| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3593
|
|
| Author |
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Nearly 2,000 out of 35,609 - see my Sunday Reports.
Mike |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
My "15 or so average" is across 3 Ryzen systems (running Linux) -- the way I have them set up means I could process about 30 a day if only I could get the tasks in the first place, so I have days with 27 or 28 along with days with [low] single figures (ignoring days when there simply isn't any ARP1 work at all!). Improving throughput would help lower the expected project completion date!!
As for reporting stuck units -- I've not actually reported one since that mass of errors that needed WUs to be re-run with smaller time steps way back in 2021 and 2022, and a single task in late 2024 that got too many download errors! In the 2021/22 cases, I suspect nearly all of those cells got going again, as IBM had enough folks available to concentrate on that; as for later cases, I suspect that they've not all been examined. [There was more but it was getting too long so I've cut most of it to make another post...] Cheers - Al. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Mike Gibson is trying to track the number of stuck tasks, and I see that he has [re]posted the latest estimate whilst I was preparing an earlier version of this post. It appears to be based on the premise that anything not in a generation currently displaying movement is stuck, which is as good as any other method but assumes that all non-stuck candidates are considered for processing on each sweep (which may or may not be true, depending on the queries used by the work management system...). Without access to the server-side databases and/or knowledge of the queries used we can only estimate, after all :-(
My definition of a "stuck" cell is one where the last WU processed failed in some fashion. Some other folks might define it more simply as "hasn't moved for quite a while" and those may not be identical sets :-), in which case why aren't they moving?... There are 448 cells that don't get counted in the WCG-provided daily state.txt file but the generations.txt file still refers to the full 35609 cells -- are those stuck or has the ARP1 flow management system just lost track of them for some reason? There are currently about 710 "Extreme" and "Accelerated" cells according to generations.txt, and all the "lost" cells would appear to be from that set. Of course, we can't be 100% sure whether all those cells regarded as needing faster processing are not being processed because they are stuck or "lost"; some may just be waiting for the work generation mechanism to get round to those categories again (and that may be waiting on WCG looking into why genuinely stuck units are stuck and doing something about it). We also don't know whether some cells may have been written off permanently. The latest update implied complete final coverage but the "within 12 months" looks extremely unrealistic given that some Extremes are so far behind that it would take at least 100 days for them to get up to "Normal" status, then it would be down to whether WCG can significantly improve the throughput of the 6-day limit jobs (see Mike's weekly summaries for completion estimates). As it is, I suspect that sorting out stalled/lost ARP1 grid cells is, understandably, fairly low on their list of priorities at present... Cheers - Al. |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2173 Status: Offline Project Badges:
|
"Und die Fahrt will schneller!"
All 4 WUs that my hosts returned today were followed by an almost immediate new ARP1 WU. If only this became the norm... Ralf ![]() |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Al
My stuck units are taken from generations.txt which is the only complete list. My definition is any units that have not moved in 3 weeks, which has worked except in 1 instance of a late runner, since corrected. I used the word "stuck" to refer to any not moving for whatever reason. I am assuming that all 1,965 units still in generations 021 - 145 are stuck plus most of the 13 in generation 146.. My forecast of completion in December is based on movement over the last 5 weeks but qualified by the fact that the oldest stuck units - the ultras - are now 127 generations behind. Mike |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Mike,
Thanks for the clarification. I presumed you counted the "lost" and the known, but wondered if you had a time interval on the latter; given the current generation clearance rate, 3 weeks makes sense! Although I have all the relevant data as generations.txt files I've not done that level of detail on analysis of movement. I do have a little spreadsheet that I use to run a moving average of units completed and watch the [approximate] balance between units completed and results returned; I can then compare those with my [small-scale] personal experience of work gathered and the effect of slow returners. It's quite interesting spotting the days where there seem to be significant numbers of late returners causing triple validations, but it doesn't do much for my blood pressure... :-) I'd love to have read-only access to [part of] their ARP1-related databases to have a lot more data to work with, but that's [quite rightly] not going to happen.... :-) Cheers - Al. P.S. This is my second attempt at posting this -- the first attempt kicked me out to the login prompt and when I eventually got past the system errors to log in again it auto-posted a repeat copy of my last post about that rogue Wndows job (which I deleted before posting this!) |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
P.S. This is my second attempt at posting this -- the first attempt kicked me out to the login prompt and when I eventually got past the system errors to log in again it auto-posted a repeat copy of my last post I recognize that. Happens to me, too, sometimes. One trick I've learned is to open a new tab and login there, close it, returning me to the original tab where they would have wanted me to login (but I already did that in the tab that I just closed), then go back one page in history to return to the post in progress. It worked for me in Firefox 140, YMMV. Adri |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Al
I use a spreadsheet that started in July 2022 when generations.txt started and the data is retained for every Sunday since apart from when ARP has been suspended. It uses max units currently on generation and calculates validated units in the week and forecasts the end date based on the last 5 weeks of validated units. Mike |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Harking back to bfmorse's rogue Windows task (ARP1_0012850_148, WU 738550711), it would finally appear to be about to get stuck... The first 6 attempts have one No Reply and 5 more or less identical Error reports, so unless the 7th attempt actually succeeds (no odds on that!) it'll have 6 Errors and out...
I'll report the next stage later (if someone else doesn't beat me to it...) Cheers - Al. |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
It looks as though WCG are anticipating failure as they haven't replaced _5 yet.
Mike |
||
|
|
|