| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3595
|
|
| Author |
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
I thought that might provoke something.
Generation 147 has now started. Mike |
||
|
|
TonyEllis
Senior Cruncher Australia Joined: Jul 9, 2008 Post Count: 286 Status: Offline Project Badges:
|
Once "stuck" workunits are released, is it feasible to issue workunits in the Extreme category only to machines considered as reliable. This should also help to minimise delays caused by overdue workunits.
----------------------------------------
Run Time Stats https://grassmere-productions.no-ip.biz/
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1327 Status: Offline Project Badges:
|
Regarding "stuck" workunits...
It is possible that some stuck units don't need a change of time step to get them moving. An example would be some Darwin tasks which went "Too Late" when 5 or 6 returned results failed to validate for some reason (Unixchick saw a few of these during April/May, along with some where it took 6 or 7 tasks to get a validation!). I don't think we have ever been told what tools IBM set up to look at why ARP1 WUs stall so it is unclear how easy it is for them to determine an appropriate restart. There was some [brief] discussion on this in the Extremes thread at one point). As for "Reliable" hosts, I take that to mean systems that have returned a sequence of validated results. Unfortunately, that takes no account of how long it took to do so! However, as the client should give precedence to tasks with shorter deadlines at some point (the infamous "panic mode"...) the combination of that with sending out three tasks instead of two should be sufficient. As an aside, according to the server documentation there is also the capability to analyse host performance to see things like average return times, which might be helpful here. However, I believe it entails running a periodic task to perform said analysis, and I don't know whether WCG runs that or not (if the facility actually exists!). I suspect it hammers the database, so it would probably only be run once or twice a day. Cheers - Al. |
||
|
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 113 Status: Offline Project Badges:
|
On the definition of reliable, this is what knreed said in an Aug 2 2021 post in this thread were reliable hosts
https://www.worldcommunitygrid.org/forums/wcg...,41910_offset,1220#662797 "reliable" hosts (which are hosts with a history of returning results quickly and that returned a number of consecutive jobs without errors)." In a post previous to that, (July 26 2021) he said the definition of "reliable" included those hosts returning results in 2 days or less. Previously it was 2.5 days. https://www.worldcommunitygrid.org/forums/wcg...,41910_offset,1200#662425 Whether those same criteria and configuration are setup and working in the Jurisca Lab implementation of WCG, I don't know. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1327 Status: Offline Project Badges:
|
geophi,
Thanks for that... It sent me back to the documentation to see what I'd missed! The "reliable" hosts thing is, indeed, as Kevin said - I'd forgotten that there was a configurable option for the response time expected for "reliable"... So the answer to TonyEllis's question is yes -- all they have to do is tag the initial WUs with a priority at least as high as the need-reliable priority configured for the particular application (and that should already be set in order to deal with retries...) I'm guessing it's already set up like that now -- this year's turnaround on Extremes seemed to be quite good :-) As for the periodic task I remembered... it is only relevant if the "Multi-size apps" option is applied to a project. That is supposed to help in sending large jobs to fast systems and smaller jobs to slower ones (e.g. Android devices that are likely to not be running 24/7). That may be useful if MAM1 production runs show as much variability as some of the beta runs! Cheers - Al |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
As I understood it, reliable meant returning within the deadline and validate. If either criteria failed another 10 had to conform to regain reliable status.
----------------------------------------The problem is in setting as low a deadline as possible to speed up the process but without too many missing the deadline to reduce the machine numbers too much. We are currently on about 1.5% in the current extreme setting so a 36 hour deadline should get enough done to catch up. But please bear in mind that reducing the TimeStep means increasing crunching time so fewer machines would classify as reliable. Mike [Edit 1 times, last edit by Mike.Gibson at Jun 14, 2025 1:47:21 PM] |
||
|
|
gj82854
Advanced Cruncher Joined: Sep 26, 2022 Post Count: 122 Status: Offline Project Badges:
|
Send them all to me. I'll crunch them. I can run 150 concurrently and meet the 36 hour window.
i'm running 30 concurrently now.... |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1327 Status: Offline Project Badges:
|
According to the [old] BOINC wiki, the "deadline" for reliable is a project parameter, not the work-unit deadline (but that's me being pedantic!)...
----------------------------------------If that's still correct, setting that parameter to (say) 36 hours would work fine at present but if it turns out that they need to half the time step it might need to be pushed out to 48 hours... On the other hand, there are quite a lot of systems out there nowadays that can run an ARP1 task in well under 10 hours and [at least, on Linux] I'm seeing about 50% of each day's wingmen's work being returned within 24 hours. So perhaps leaving it at the [hypothesised] 36-hour setting would be fine anyway :-) -- there ought to be lots of capacity to handle such tasks in under 24 hours (especially if not all stalled tasks need a time step change anyway...). [Edited to note that gj82854's post that landed whilst I was compiling this serves to confirm my point!] Perhaps someone from WCG might chip in to tell us the current setting of the reliable_max_avg_turnaround parameter for ARP1 if they see these posts :-) Cheers - Al. [Edit 2 times, last edit by alanb1951 at Jun 14, 2025 4:11:12 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Please also bear in mind that the latest average completion time is almost 85 hours.
Mike |
||
|
|
gj82854
Advanced Cruncher Joined: Sep 26, 2022 Post Count: 122 Status: Offline Project Badges:
|
Is that wall clock time or CPU time? A lot can impact the wall clock time such as suspensions due to higher priority tasks etc.
|
||
|
|
|