Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 3595
Posts: 3595   Pages: 360   [ Previous Page | 337 338 339 340 341 342 343 344 345 346 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 5883747 times and has 3594 replies Next Thread
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I thought that might provoke something.

Generation 147 has now started.

Mike
[Jun 12, 2025 11:07:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TonyEllis
Senior Cruncher
Australia
Joined: Jul 9, 2008
Post Count: 286
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Once "stuck" workunits are released, is it feasible to issue workunits in the Extreme category only to machines considered as reliable. This should also help to minimise delays caused by overdue workunits.
----------------------------------------
[Jun 13, 2025 2:14:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1327
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Regarding "stuck" workunits...

It is possible that some stuck units don't need a change of time step to get them moving. An example would be some Darwin tasks which went "Too Late" when 5 or 6 returned results failed to validate for some reason (Unixchick saw a few of these during April/May, along with some where it took 6 or 7 tasks to get a validation!).

I don't think we have ever been told what tools IBM set up to look at why ARP1 WUs stall so it is unclear how easy it is for them to determine an appropriate restart. There was some [brief] discussion on this in the Extremes thread at one point).

As for "Reliable" hosts, I take that to mean systems that have returned a sequence of validated results. Unfortunately, that takes no account of how long it took to do so! However, as the client should give precedence to tasks with shorter deadlines at some point (the infamous "panic mode"...) the combination of that with sending out three tasks instead of two should be sufficient.

As an aside, according to the server documentation there is also the capability to analyse host performance to see things like average return times, which might be helpful here. However, I believe it entails running a periodic task to perform said analysis, and I don't know whether WCG runs that or not (if the facility actually exists!). I suspect it hammers the database, so it would probably only be run once or twice a day.

Cheers - Al.
[Jun 13, 2025 3:40:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
geophi
Advanced Cruncher
U.S.
Joined: Sep 3, 2007
Post Count: 113
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

On the definition of reliable, this is what knreed said in an Aug 2 2021 post in this thread were reliable hosts

https://www.worldcommunitygrid.org/forums/wcg...,41910_offset,1220#662797

"reliable" hosts (which are hosts with a history of returning results quickly and that returned a number of consecutive jobs without errors)."

In a post previous to that, (July 26 2021) he said the definition of "reliable" included those hosts returning results in 2 days or less. Previously it was 2.5 days.

https://www.worldcommunitygrid.org/forums/wcg...,41910_offset,1200#662425

Whether those same criteria and configuration are setup and working in the Jurisca Lab implementation of WCG, I don't know.
[Jun 13, 2025 6:18:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1327
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

geophi,

Thanks for that... It sent me back to the documentation to see what I'd missed!

The "reliable" hosts thing is, indeed, as Kevin said - I'd forgotten that there was a configurable option for the response time expected for "reliable"...

So the answer to TonyEllis's question is yes -- all they have to do is tag the initial WUs with a priority at least as high as the need-reliable priority configured for the particular application (and that should already be set in order to deal with retries...)

I'm guessing it's already set up like that now -- this year's turnaround on Extremes seemed to be quite good :-)

As for the periodic task I remembered... it is only relevant if the "Multi-size apps" option is applied to a project. That is supposed to help in sending large jobs to fast systems and smaller jobs to slower ones (e.g. Android devices that are likely to not be running 24/7). That may be useful if MAM1 production runs show as much variability as some of the beta runs!

Cheers - Al
[Jun 14, 2025 8:42:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

As I understood it, reliable meant returning within the deadline and validate. If either criteria failed another 10 had to conform to regain reliable status.

The problem is in setting as low a deadline as possible to speed up the process but without too many missing the deadline to reduce the machine numbers too much.

We are currently on about 1.5% in the current extreme setting so a 36 hour deadline should get enough done to catch up.

But please bear in mind that reducing the TimeStep means increasing crunching time so fewer machines would classify as reliable.

Mike
----------------------------------------
[Edit 1 times, last edit by Mike.Gibson at Jun 14, 2025 1:47:21 PM]
[Jun 14, 2025 1:34:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 122
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Send them all to me. I'll crunch them. I can run 150 concurrently and meet the 36 hour window.
i'm running 30 concurrently now....
[Jun 14, 2025 3:52:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1327
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

According to the [old] BOINC wiki, the "deadline" for reliable is a project parameter, not the work-unit deadline (but that's me being pedantic!)...

If that's still correct, setting that parameter to (say) 36 hours would work fine at present but if it turns out that they need to half the time step it might need to be pushed out to 48 hours...

On the other hand, there are quite a lot of systems out there nowadays that can run an ARP1 task in well under 10 hours and [at least, on Linux] I'm seeing about 50% of each day's wingmen's work being returned within 24 hours. So perhaps leaving it at the [hypothesised] 36-hour setting would be fine anyway :-) -- there ought to be lots of capacity to handle such tasks in under 24 hours (especially if not all stalled tasks need a time step change anyway...).

[Edited to note that gj82854's post that landed whilst I was compiling this serves to confirm my point!]

Perhaps someone from WCG might chip in to tell us the current setting of the reliable_max_avg_turnaround parameter for ARP1 if they see these posts :-)

Cheers - Al.
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Jun 14, 2025 4:11:12 PM]
[Jun 14, 2025 4:05:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Please also bear in mind that the latest average completion time is almost 85 hours.

Mike
[Jun 14, 2025 4:38:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 122
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Is that wall clock time or CPU time? A lot can impact the wall clock time such as suspensions due to higher priority tasks etc.
[Jun 14, 2025 6:19:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 3595   Pages: 360   [ Previous Page | 337 338 339 340 341 342 343 344 345 346 | Next Page ]
[ Jump to Last Post ]
Post new Thread