Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Community Forum: Chat Room Thread: Project Status (First Post Updated) |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 141
|
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
Might there be a policy of not sending out replacements to give crunchers more time to return units after the deadlines? That way less crunching time is wasted. However some of mine in PVal jail have a wingman errored - those should not be held. The resends seem to go out in batches, which may be more efficient. Mike Mike, One possible use of "grace periods" is to try to reduce the number of retries that end up Server Aborted (if they aren't stuck Waiting to be sent...); however, I don't think there's any other mechanism that mitigates missed deadlines... And yes, I wish they'd re-introduce grace periods, preferably with shorter underlying deadlines (given that most modern systems don't need 4+ days to process a single MCM1 (or SCC1) task!) -- it might discourage some of the proponents of large buffers and lots of missed deadlines :-) As for other mechanisms, given the number of different ways a standard BOINC feeder can select what to pass to the scheduler's buffer, it's quite possible that some selections might act to give preference to more recent work rather than to retries for older WUs. There are various ways in which clusters of tasks might be grouped together as far as feeding is concerned, and it's not clear to me how some of them are supposed to work -- I did a code dive but it didn't help much, as I then needed to know how the workunits were created and made visible to the feeder in the first place [which I don't!] :-) So I've been waiting to see a workunit with a quick error return, to see whether it is blocking all retries or just retries that are requested several days after initial distribution -- finally one turned up (with a download error) and, lo and behold, it scored a retry almost immediately. Looks as if there's something sensitive to how old the WUs are (whichever feeder parameters they use), and that might cause retries requested after several days to back up until the feeder is starved of new work, something we have seen happen in the past. Cheers - Al. |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
Looks like I've collected one of those strange WUs where one of the initial tasks is still waiting to be sent. It can keep Sgt. Joe's oddball task company :-)
Here's the current state of said WU as reported by one of my scripts -- I've checked it on the web site too... : Task MCM1_0219547_8363_1 was returned by [redacted] at 2024-06-24T02:25:06+0000: Cheers - Al. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Recently Active Project Badges: |
it's always nice seeing a workunit with one task "Waiting to be sent" turning into "Other" when one wingman finally returns their task (a little bit late):
Result name Status Sent time Due / Return time CPUtime/Elapsed Claimed/Granted[Copied from Workunit Status, generated by wcgformat (using these options: -oo)] Adri |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
I hope Ralf's definition of PVal jail is any task waiting for validation, rather than tasks where two [or more] wingmen are waiting for validation :-) -- as far as I can tell, validations that don't need retries still seem to be happening at normal rates, so I'd be seriously worried if he's seeing something very different.... Well, "PVa Jail", at least for as long as I am part of WCG (13 1/2 years now) refers to WUs that have a status of "Pending VAlidation" (in contrast to PVe/Pending VErification). And those that I referred to (as mentioned did only a very quick check as I had to leave for a number of off-site appointments and didn't come back until late in the night)., were WUs where one of my hosts had returned a result 8 days ago, with the wingman, send out 8 days ago, either had not returned a result at all by the deadline (typical hoarder result ) or had returned a result label as "Error" and the subsequent resent (either _2 or _3, on those quick samples) was sitting with "Waiting to be send".. One of those result, just as an example would be one returned from my Android tablet (https://www.worldcommunitygrid.org/contribution/workunit/541041235), which shows
MCM1_0219026_4289_1 Android 4.19.157-perf-g19050d39787c (Android 13) Error 2024-06-14 23:55:31 UTC 2024-06-20 23:56:09 UTC MCM1_0219026_4289_2 Android 4.14.87+ (Android 9) No Reply 2024-06-20 23:56:46 UTC 2024-06-23 23:56:46 UTC MCM1_0219026_4289_3 Waiting to be sent Another one is this WU (https://www.worldcommunitygrid.org/contribution/workunit/542238013), which a Windows 10 host of mine returned with 25, the wingman sent the WU back after 8 days, resulting in an "Error" again (2 days after the deadline) and now there a _2 WU is sitting "Waiting to be send" A third example is the WU (https://www.worldcommunitygrid.org/contribution/workunit/543551725) from my Windows 10 programming laptop, which is a very reliable cruncher, returned within 13.5h from receiving it, wingman send back his/her WU 5h after the deadline resulting in an "Error" and again a _2 resend is sitting "Waiting to be send". These are now 3 of the first page of my oldest WUs sitting in PVa jail, with the overall number having increased by about 10% since yesterday (roughly 20h since I made the last reply). It seems interesting (Ok, that perception might be relative) that all those WUs are stuck seemingly because one of the original 2 WUs was returned with a rather non-descriptive state of "Error", with the only info on the Error link being the client version of the wingman (7.16.11 and 8.02, for those two Windows 10 hosts I mentioned, respectively). Ralf [Edit 3 times, last edit by TPCBF at Jun 25, 2024 3:39:29 PM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
Ralf.
Thanks for that; I might be able to account for tasks that return Error with only the client information... I suspect most [if not all] of those are probably results that return an exit code that means "Not started by deadline" -- unlike many other BOINC sites, WCG doesn't ever seem to flag tasks as NSD, but there must be many such... Evidence for that hypothesis is two-fold; a lot of the wingmen I see that return Error with just client version seem to do so fairly close to the task deadline, whilst some others seem to be marked No Reply at first, then transition to that strange Error state (presumably when the client returns the NSD code somewhat later.) Unfortunately, without access to the process exit code (which the new API doesn't provide so no access to it for wingmen), this has to remain unproven unless someone from WCG can look at some of the records concerned and identify the code. Cheers - Al. P.S. I don't care how "PVal jail" is used as long as it's obvious in context -- I first encountered the term during one of the periods when some validators weren't running; hence my different usage, as it appeared that the term was applied to the WU, not a single result... Given how imprecise a lot of posters can be, there's really no point in debating it, is there? :-) |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2089 Status: Offline Project Badges: |
Oh yeah, the MCM "waiting to be sent" issue is back, with a vengeance.
|
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Recently Active Project Badges: |
This afternoon I noticed a hiccup in the distribution of MCM1-workunits when I looked at my hourly reports:
547922751 │ * 10:46 (MCM1) This is a survey of 10,000 workunits, split into 10 equal pieces, performed with the data of my latest received workunit (at the top), and 10 older ones. Each workunit has a unique ID and the workunits in this survey are exactly 1,000 IDs apart from each other. Now, you may have noticed that the number (or ID) 547918751 (let's call it M for middle one) is missing from the survey. OK, I've changed my mind, M is for missing one. So I decided to perform a small investigation by testing what had happened to 'M', plus 40 other workunit-IDs in the range from M minus 20 to M plus 20. The outcome is spectacular, to say the least, if I may say so! Let's have a look: The missing one (M) is still Waiting to be sent: https://www.worldcommunitygrid.org/contribution/workunit/547918751 - Result name OS type Status Sent time Furthermore, five other ones are also still Waiting to be sent and their IDs are: 547918731 547918735 547918739 547918743 547918747 Looking a bit closer, the difference to the other ones is constantly exactly 4. Digging a bit deeper, I could find some more Waiting workunits with their IDs: 547918715 547918719 547918723 547918727 I didn't dig any deeper than 547918651, so this is it for the moment. So if there are 4 instances that should be distributing work, then one of them is not doing its work properly. Adri |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Recently Active Project Badges: |
Very interesting Adri. One distribution channel seems to be plugged.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TigerLily
Senior Cruncher Joined: May 26, 2023 Post Count: 280 Status: Offline Project Badges: |
Hi Adri,
Thanks for passing this along. I forwarded your post to a member of the tech team. They believe this may be in some way related to the hardware failure we experienced last week, in which one of the 6 workunit management servers was lost. They are going to investigate this issue and I will pass on any updates that I receive from them. |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 858 Status: Offline Project Badges: |
Thank you TigerLily ! Thank you Adri and others for discussing this and posting useful information!!
on a personal note: I'm taking a holiday, so of course APR is starting back up. I hope it goes smoothly and save some for me. |
||
|
|