Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 141
Posts: 141   Pages: 15   [ Previous Page | 6 7 8 9 10 11 12 13 14 15 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 9930 times and has 140 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Might there be a policy of not sending out replacements to give crunchers more time to return units after the deadlines? That way less crunching time is wasted.

However some of mine in PVal jail have a wingman errored - those should not be held.

The resends seem to go out in batches, which may be more efficient.

Mike

Mike,

One possible use of "grace periods" is to try to reduce the number of retries that end up Server Aborted (if they aren't stuck Waiting to be sent...); however, I don't think there's any other mechanism that mitigates missed deadlines... And yes, I wish they'd re-introduce grace periods, preferably with shorter underlying deadlines (given that most modern systems don't need 4+ days to process a single MCM1 (or SCC1) task!) -- it might discourage some of the proponents of large buffers and lots of missed deadlines :-)

As for other mechanisms, given the number of different ways a standard BOINC feeder can select what to pass to the scheduler's buffer, it's quite possible that some selections might act to give preference to more recent work rather than to retries for older WUs. There are various ways in which clusters of tasks might be grouped together as far as feeding is concerned, and it's not clear to me how some of them are supposed to work -- I did a code dive but it didn't help much, as I then needed to know how the workunits were created and made visible to the feeder in the first place [which I don't!] :-)

So I've been waiting to see a workunit with a quick error return, to see whether it is blocking all retries or just retries that are requested several days after initial distribution -- finally one turned up (with a download error) and, lo and behold, it scored a retry almost immediately. Looks as if there's something sensitive to how old the WUs are (whichever feeder parameters they use), and that might cause retries requested after several days to back up until the feeder is starved of new work, something we have seen happen in the past.

Cheers - Al.
[Jun 25, 2024 3:44:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Looks like I've collected one of those strange WUs where one of the initial tasks is still waiting to be sent. It can keep Sgt. Joe's oddball task company :-)

Here's the current state of said WU as reported by one of my scripts -- I've checked it on the web site too...
:
Task MCM1_0219547_8363_1 was returned by [redacted] at 2024-06-24T02:25:06+0000:
Work-unit 546414343 created 2024-06-23T20:31:11+0000
Sent date 2024-06-23T22:12:32+0000, deadline 2024-06-29T22:12:32+0000.
CPU time 1.0541 hours, elapsed time 1.05572 hours,
status is Pending Validation
O/S version is Ubuntu 22.04.4 LTS [6.5.0-35-generic|libc 2.35]
The workunit has 2 potential results: wingman data follows.
MCM1_0219547_8363_0 assigned to an unknown device on unknown O/S
O/S version is unknown
time sent was unknown
due time was unknown
no return time available
status is Waiting to be sent

Cheers - Al.
[Jun 25, 2024 4:22:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

it's always nice seeing a workunit with one task "Waiting to be sent" turning into "Other" when one wingman finally returns their task (a little bit late):
Result name         Status Sent time           Due / Return time   CPUtime/Elapsed Claimed/Granted
MCM1_0219113_2072_0 Valid 2024-06-16 14:14:39 2024-06-24 20:32:03 2.12/2.19 47.4/60.5
MCM1_0219113_2072_1 Valid 2024-06-16 14:14:50 2024-06-17 03:05:23 2.13/2.14 73.6/60.5
MCM1_0219113_2072_2 Other - - -/- -/-
[Copied from Workunit Status, generated by wcgformat (using these options: -oo)]

Adri
[Jun 25, 2024 1:07:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I hope Ralf's definition of PVal jail is any task waiting for validation, rather than tasks where two [or more] wingmen are waiting for validation :-) -- as far as I can tell, validations that don't need retries still seem to be happening at normal rates, so I'd be seriously worried if he's seeing something very different....
Well, "PVa Jail", at least for as long as I am part of WCG (13 1/2 years now) refers to WUs that have a status of "Pending VAlidation" (in contrast to PVe/Pending VErification). And those that I referred to (as mentioned did only a very quick check as I had to leave for a number of off-site appointments and didn't come back until late in the night)., were WUs where one of my hosts had returned a result 8 days ago, with the wingman, send out 8 days ago, either had not returned a result at all by the deadline (typical hoarder result sad ) or had returned a result label as "Error" and the subsequent resent (either _2 or _3, on those quick samples) was sitting with "Waiting to be send"..
One of those result, just as an example would be one returned from my Android tablet (https://www.worldcommunitygrid.org/contribution/workunit/541041235), which shows
    MCM1_0219026_4289_0 Android 3.10.49-12343953 (Android 7.1.1) Pending Validation 2024-06-14 23:55:05 UTC 2024-06-16 01:19:13 UTC 15.07 / 15.38 72.4 / 0
    MCM1_0219026_4289_1 Android 4.19.157-perf-g19050d39787c (Android 13) Error 2024-06-14 23:55:31 UTC 2024-06-20 23:56:09 UTC
    MCM1_0219026_4289_2 Android 4.14.87+ (Android 9) No Reply 2024-06-20 23:56:46 UTC 2024-06-23 23:56:46 UTC
    MCM1_0219026_4289_3 Waiting to be sent
So my WU was returned within roughly 25h, my wingman got his WU 26 sec after I got mine, which resulted 6 days later in an :"Error", and the _2 resend wasn't returned with the 3 days deadline, and now a _3 resend it setting there as "waiting to be send" for roughly a day and a half.

Another one is this WU (https://www.worldcommunitygrid.org/contribution/workunit/542238013), which a Windows 10 host of mine returned with 25, the wingman sent the WU back after 8 days, resulting in an "Error" again (2 days after the deadline) and now there a _2 WU is sitting "Waiting to be send"

A third example is the WU (https://www.worldcommunitygrid.org/contribution/workunit/543551725) from my Windows 10 programming laptop, which is a very reliable cruncher, returned within 13.5h from receiving it, wingman send back his/her WU 5h after the deadline resulting in an "Error" and again a _2 resend is sitting "Waiting to be send".

These are now 3 of the first page of my oldest WUs sitting in PVa jail, with the overall number having increased by about 10% since yesterday (roughly 20h since I made the last reply).

It seems interesting (Ok, that perception might be relative) that all those WUs are stuck seemingly because one of the original 2 WUs was returned with a rather non-descriptive state of "Error", with the only info on the Error link being the client version of the wingman (7.16.11 and 8.02, for those two Windows 10 hosts I mentioned, respectively).


Ralf
----------------------------------------

----------------------------------------
[Edit 3 times, last edit by TPCBF at Jun 25, 2024 3:39:29 PM]
[Jun 25, 2024 3:34:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Ralf.

Thanks for that; I might be able to account for tasks that return Error with only the client information...

I suspect most [if not all] of those are probably results that return an exit code that means "Not started by deadline" -- unlike many other BOINC sites, WCG doesn't ever seem to flag tasks as NSD, but there must be many such...

Evidence for that hypothesis is two-fold; a lot of the wingmen I see that return Error with just client version seem to do so fairly close to the task deadline, whilst some others seem to be marked No Reply at first, then transition to that strange Error state (presumably when the client returns the NSD code somewhat later.)

Unfortunately, without access to the process exit code (which the new API doesn't provide so no access to it for wingmen), this has to remain unproven unless someone from WCG can look at some of the records concerned and identify the code.

Cheers - Al.

P.S. I don't care how "PVal jail" is used as long as it's obvious in context -- I first encountered the term during one of the periods when some validators weren't running; hence my different usage, as it appeared that the term was applied to the WU, not a single result... Given how imprecise a lot of posters can be, there's really no point in debating it, is there? :-)
[Jun 25, 2024 4:22:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2089
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Oh yeah, the MCM "waiting to be sent" issue is back, with a vengeance. sad
[Jun 26, 2024 2:48:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

This afternoon I noticed a hiccup in the distribution of MCM1-workunits when I looked at my hourly reports:
 547922751 │ * 10:46 (MCM1)
547921751 │ * 10:43 (MCM1)
547920751 │ * 10:40 (MCM1)
547919751 │ * 10:38 (MCM1)
547917751 │ (MCM1) * 10:33
547916751 │ (MCM1) * 10:30
547915751 │ (MCM1) * 10:28
547914751 │ (MCM1) * 10:26
547913751 │ (MCM1) * 10:23
547912751 │ (MCM1) * 10:22
───────────┼────────────────────────────────────────────────────────────────────────────
WorkunitId │ 2024-06-26 2024-06-26

This is a survey of 10,000 workunits, split into 10 equal pieces, performed with the data of my latest received workunit (at the top), and 10 older ones. Each workunit has a unique ID and the workunits in this survey are exactly 1,000 IDs apart from each other.

Now, you may have noticed that the number (or ID) 547918751 (let's call it M for middle one) is missing from the survey. OK, I've changed my mind, M is for missing one. wink So I decided to perform a small investigation by testing what had happened to 'M', plus 40 other workunit-IDs in the range from M minus 20 to M plus 20.

The outcome is spectacular, to say the least, if I may say so! Let's have a look:

The missing one (M) is still Waiting to be sent: https://www.worldcommunitygrid.org/contribution/workunit/547918751 -
Result name         OS type Status             Sent time
MCM1_0219689_4417_0 Waiting to be sent -
MCM1_0219689_4417_1 Waiting to be sent -

Furthermore, five other ones are also still Waiting to be sent and their IDs are:
547918731 547918735 547918739 547918743 547918747
Looking a bit closer, the difference to the other ones is constantly exactly 4.
Digging a bit deeper, I could find some more Waiting workunits with their IDs:
547918715 547918719 547918723 547918727
I didn't dig any deeper than 547918651, so this is it for the moment.

So if there are 4 instances that should be distributing work, then one of them is not doing its work properly.

Adri
[Jun 26, 2024 11:35:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Very interesting Adri. One distribution channel seems to be plugged.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jun 27, 2024 1:52:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Hi Adri,

Thanks for passing this along. I forwarded your post to a member of the tech team. They believe this may be in some way related to the hardware failure we experienced last week, in which one of the 6 workunit management servers was lost. They are going to investigate this issue and I will pass on any updates that I receive from them.
[Jun 27, 2024 3:56:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 858
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thank you TigerLily ! Thank you Adri and others for discussing this and posting useful information!!

on a personal note: I'm taking a holiday, so of course APR is starting back up. I hope it goes smoothly and save some for me.
[Jun 27, 2024 9:18:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 141   Pages: 15   [ Previous Page | 6 7 8 9 10 11 12 13 14 15 | Next Page ]
[ Jump to Last Post ]
Post new Thread