| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3593
|
|
| Author |
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
However, one of the problems relates to the definition of 'reliable'. Because of the length of time needed to complete each unit and the need to hold a cache, very few machines can classify as 'reliable' even if they never have an error. I would say they need to tighten the criteria. If you keep the default buffer of 0.1 + 0.5 days, you would probably qualify with no problem. "Reliable" should be special, not ordinary. The intent should be to get good results back early of course. [Edit 1 times, last edit by Jim1348 at Nov 26, 2019 4:03:49 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Jim
----------------------------------------If I was to implement your settings which would mean a minimum cache of 2.4 hours and a maximum cache of 14.4 hours, I would never get any WUs as they are taking 27 hours without counting any queuing time. Owing to the paucity of availability, the settings need to be at least 1.5 days + 1.5 days in order to get 1 and have another waiting. That would mean a turnaround of 3 days which is less than half the allowed time. I think a better definition of 'reliable' would be half the allowed time, which could be implemented as an across the board definition. Mike [Edit 1 times, last edit by Mike.Gibson at Nov 26, 2019 6:26:51 PM] |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
Jim If I was to implement your settings which would mean a minimum cache of 2.4 hours and a maximum cache of 14.4 hours, I would never get any WUs as they are taking 27 hours without counting any queuing time. Owing to the paucity of availability, the settings need to be at least 1.5 days + 1.5 days in order to get 1 and have another waiting. That would mean a turnaround of 3 days which is less than half the allowed time. I think a better definition of 'reliable' would be half the allowed time, which could be implemented as an across the board definition. I have no problem with half the allowed time, if that works for you. But I have not had problems with the 0.1 + 0.5 day cache settings on any of my machines, except for work unit availability. I run Ryzens (1700, 2600, 2700) mainly, but also Coffee Lakes, all under Ubuntu 18.04.3. I think they should tailor "reliable" for the better machines, which are becoming more common anyway. However, whatever works for them is OK with me. EDIT: The bottleneck seems to be identifying "reliable" machines. There may be no solution except sending out enough to find them. The more the "unreliable" machines suck them up, the longer that will take. [Edit 2 times, last edit by Jim1348 at Nov 26, 2019 7:06:33 PM] |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Greetings,
I am making a few configuration changes after discussing it with the team. The rules for reliable hosts (per app version) had a clause that said avg turn around was 1.5 days. This was set for all projects. We are going to see if this helps get us more reliable hosts, but we are relaxing that to 2.5 days avg turn around. This change is project wide. Also, we are changing the reliable host time to complete from 35% of original deadline to 50% of the original deadline, this means a 7 day workunit will have 3.5 days instead of the 2.8 we allocated before. We will be trying these settings for the next few weeks as this problem didn't show up right away for this project. It will give us a chance to evaluate if this is the right solution project wide or if changes in the code are needed to do it application by application. I'm working on the deployment now, so it should be in place in about 30-45 minutes. Thanks, -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for reading the comments 🙌
----------------------------------------PS, could not fathom 21 per 30 minutes (1008 a day), until seeing the noon stats Statistics Date Total Run Time Quite a bit less than the daily validation suggested... 1700-2000 before the randomization, than catching up to 2500. [Edit 1 times, last edit by Former Member at Nov 26, 2019 7:52:22 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Since the average time to process one of these units is a little over a day, now that Uplinger has loosed the log jam, I suspect the daily totals to rebound in a day or two. Not only is the availability back to a steady trickle, but since he has tweeked the reliable host issue, that should also increase the throughput. Hopefully we will see the effect in completed units by tomorrow.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Keith
----------------------------------------Thank you for listening to our raving. If you consider an anology with marketing, Delft would be the manufacturer, WCG would be the intermediary (or shopkeeper) and we would be the customer. Whilst the customer is not always right, woe betides a shopkeeper who doesn't listen to his customers. I have now changed my cache settings in device profile to connecting every 1.5 days with 1.5 days extra cache to allow for my 27 hours crunching time in order to have a WU waiting for crunching to finish. Mike [Edit 1 times, last edit by Mike.Gibson at Nov 26, 2019 9:02:23 PM] |
||
|
|
littlepeaks
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 748 Status: Offline Project Badges:
|
Keith --
Thanks for making work units available again for this project. I wondered why I hadn't receive any in a long time. And I had rigged a solenoid to press my left mouse button every 15 seconds to hit the update button. (Just kidding, of course )But anyway, I did received 2 WUs this afternoon, which should give me my bronze badge. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'd like to talk a little more about what "reliable" means, if I may.
I've come to the conclusion that I have a problem with any system which relies upon saying that a reliable host is one that returns a result within x days. That immediately excludes any machine with a cache larger than x. Why exclude them? BOINC Manager is designed to try to ensure that all tasks are returned by their deadline. Surely the test should be more along the lines of 'How many WUs has this machine returned which were over deadline?', or perhaps 'How long ago is it since this machine returned a WU which was over deadline?'. Example: A (sub-)project has a deadline of 7 days. WUs average 1 day to process. I have a machine which is always on and has a cache of three days to guard against outages over a weekend. Most WUs will be returned after around 3 days -- too late to record the machine as reliable if the definition is 'within 2 days'. BUT, if you send my machine a WU with a deadline of 2 days, BOINC Manager will panic and start that WU straight away, and you'll get it back in 1 day - well within the two day deadline. Shouldn't you be measuring how well machines do what they're told to do, and not just performance against some unknowable deadline? If BOINC Manager doesn't know about it, and so cannot react to it, it is simply arbitrary and not a useful measure. |
||
|
|
RTorpey
Advanced Cruncher Joined: Aug 24, 2005 Post Count: 67 Status: Offline Project Badges:
|
Agreed. I have a mix of older and newer machines so the return time varies from 10-60hrs. But, They are solid machines and I very rarely get an error (so far, none on ARP)!
|
||
|
|
|