| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3593
|
|
| Author |
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
I have a feelling there is a problem with WU:
ARP1_0012850_148 As myself and one other person reported the identical access violation error. The other two WU's are still "In Progress" at this time. (The error messages below are from my system.) <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> (unknown error) (317) - exit code 3221225477 (0xc0000005)</message> <stderr_txt> INFO: Initializing INFO: No state to restore. Start from the beginning. Starting WRFMain [04:22:55] INFO: Checkpoint taken at 2019-04-23_06:00:00 [07:22:14] INFO: Checkpoint taken at 2019-04-23_12:00:00 [10:27:33] INFO: Checkpoint taken at 2019-04-23_18:00:00 [12:19:23] INFO: Checkpoint taken at 2019-04-24_00:00:00 [14:13:06] INFO: Checkpoint taken at 2019-04-24_06:00:00 [17:10:39] INFO: Checkpoint taken at 2019-04-24_12:00:00 [20:11:45] INFO: Checkpoint taken at 2019-04-24_18:00:00 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00007FF6AB6148E7 read attempt to address 0xAE3410A0 Engaging BOINC Windows Runtime Debugger... <snip> |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Recently Active Project Badges:
|
bfmorse - can you supply the workunit number (e.g. by a link to the WU or your result) so that folks can have a decent look at all the result logs before they disappear?
----------------------------------------Also, can we assume from your "identical access violation error" that both failures had the same number of successful checkpoints (some folks would only be referring to the error, not the total report, if they said that)? It'll be interesting to see whether this is a case of another grid cell having a problem with time step size or some other data aspect, or whether (as happens sometimes) one or two systems have problems but others don't. Thanks in advance - Al. [Edit 2 times, last edit by alanb1951 at Jul 6, 2025 11:51:28 AM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Sunday Report
This is an 8 day report to balance up last week's 6 day. 1,965 units in generations up to and including 145 remain stuck and have probably been joined by 13 from generation 146.. There are 30 units in generation 147 and 22,363 in generation 148, which is the current generation. We are now 52% of the way through generation 148. There are now 11,238 units held in generation 149. 15,068 units have validated in the week, but there are 1,258,569 units to go. Based on the last 5 weeks, we would complete ARP1 on 22 December 2026, but we are getting close to where the stuck units will hold the completion up. Mike |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
I THINK that work unit ID is: 738550711
I am uncertain because I have not been shown where and how to obtain that info before. (At work atm & using my cellphone - kind of awkward for this task) |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
That is the correct number.
I obtain it by clicking on Result Status, then the Result Name and the full access appears in the internet link. Mike |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Recently Active Project Badges:
|
Thanks for the WU ID...
It's looking ominous for this WU, as both the failed tasks did indeed make the same number of checkpoints and (after allowing for things being loaded into memory in different places) the error location and call stacks look similar enough to suggest a crash at the same point in the code (and possibly at a similar stage of execution)... [Someone more used to MS diagnostics may be able to clarify (or deny) that!] Given that, the other tasks may well fail the same way at about the same stage. So now we wait and see whether this is a candidate for a shorter time step or something else... By the way, I note that the two failed tasks are an initial wingman and a retry (and on different Windows releases); the second retry is actually scheduled to return before the other initial wingman because of reduced deadlines. I just hope we don't see any No Reply tasks, as if this is going to be a dead unit it needs to be killed off sooner rather than later :-) Once again, thanks (and yes, using a mobile phone screen for WCG access isn't much fun...) Cheers - Al. P.S. I don't have a record of any cell near that one having had problems in the past; that said, my list of 132 cells (and 320 different cell+generation combinations) won't be anywhere near a complete record of problem tasks (failing or otherwise)... |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Al
I have belatedly started to record newly issued units with their generations. It will slowly narrow down the stuck units but I have only narrowed down to 31k units so far in 4 generations because I am only getting a very few units these days. However, I am also using Adri's current task lists. Perhaps your data might speed up my process? Mike |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1293 Status: Offline Project Badges:
|
no ARPs are going out at the moment... not even resends....
Here is a link to one of mine "waiting to be sent' https://www.worldcommunitygrid.org/contribution/workunit/736955953 |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Recently Active Project Badges:
|
Regarding bfmorse's problem WU, SIGSEGV and Waiting to be sent...
----------------------------------------The _3 retry has also failed (looking extremely similar, again!) so it definitely looks as if this WU is not going to complete. The next retry for this WU is also stuck at "Waiting to be sent" so Unixchick is not alone in this... And on checking what I returned yesterday I discovered that I handled a retry for a Linux wingman that went SIGSEGV -- that WU has validated! Just a reminder that not all such errors indicate a doomed WU :-) Cheers Al. [Edit 1 times, last edit by alanb1951 at Jul 7, 2025 3:53:04 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
Nothing from ARP since July 5.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|