| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 31
|
|
| Author |
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
It seems that all results of this workunit are erroring out in the same way.
----------------------------------------workunit 800486386: ARP1_0035156_084_5-- Linux Ubuntu 727 Error 9/5/21 06:16:11 9/5/21 16:20:35 7.09 510.7 / 0.0Details:Project Name: Africa Rainfall Project[Edit 2 times, last edit by adriverhoef at Sep 5, 2021 7:43:08 PM] |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Yup; I saw a different one where they all failed after only two checkpoints - unfortunately, I didn't record the task name (busy doing something else at the time) but I think it was from iteration 88.
Fortunately, these seem to be fairly uncommon, and I believe the technicians check up on anything that ends up with all tasks in Error and/or Invalid status... Whether they'll let us know whether the problem is terminal for those individual little areas (model going out of bounds because of normal calculation) or the result of something strange happening in the "next iteration" task generation remains to be seen :-) Cheers - Al. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Two more:
ARP1_0033555_090 -- this one failed before any checkpoints taken, 17 items in stack trace; ARP1_0034321_089 -- this one failed after 6 checkpoints, 18 elements in stack trace. The stack trace in the error report appears more or less identical in all cases (the only difference appears to be an extra call level in the 18-element trace (all other addresses being the same); it could be data-driven (rather than hardware or O/S-related) as all the wingmen fail after the same number of checkpoints. I do hope the efforts involved in the WCG move to Krembil doesn't cause solving this to fall thorough the cracks, especially if something about the data in some of the sample areas has found a glitch in the software. Data issues should produce error messages (even if they're like some of the relatively incomprehensible ones at CPDN - e.g. "Invalid THETA"...), not consistently crash the software! That said, at CPDN, one user's SIGSEGV task often completes successfully when run by someone else, which could be hardware or O/S issues -- less likely here, I fear... Weather/climate modelling does seem to exercise the computer :-) Cheers - Al. |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
I did experience the same with WU ARP1_0033315_088 after about 12 hours computation.
----------------------------------------ARP1_ 0033315_ 088_ 4-- 727 Error 9/14/21 07:27:14 9/15/21 06:17:01 22.73 662.9 / 0.0 ARP1_ 0033315_ 088_ 3-- 727 Error 9/14/21 00:45:22 9/15/21 08:28:02 12.86 518.4 / 0.0 ARP1_ 0033315_ 088_ 2-- 727 Error 9/13/21 10:19:48 9/14/21 07:26:33 20.18 550.8 / 0.0 ARP1_ 0033315_ 088_ 0-- 727 Error 9/12/21 14:12:58 9/14/21 00:45:12 22.49 522.4 / 0.0 ARP1_ 0033315_ 088_ 1-- 727 Error 9/12/21 14:12:56 9/13/21 10:19:31 19.76 610.4 / 0.0 All on Linux hosts (various kernel version and distributions). After so many errors, I don't understand why the WU has been redistributed to other machines (still one "in progress"). Cheers, Yves |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Error file for WU ARP1_0033315_088
----------------------------------------
Yves |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
I don't recall anyone reporting similar issues with Windows ARP1 (especially not recently!) so I wonder whether this is a Linux-specific issue or whether it's just that Windows users who have had the problem that's causing SIGSEGV on Linux simply haven't noticed or haven't bothered to hit the forums about it...
Ah, well; whilst the WCG move is going on I suspect this is (quite rightly) fairly low on the WCG priority lists, so all we can do is log them when we see them, in case it's of some use at a later date. Cheers - Al. P.S. In my earlier post I mentioned an 18th level in the stack trace in some cases... The extra level appears thus: [0x11f86d4] |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Again an error: ARP1_0035341_092
----------------------------------------5 in Error, 1 aborted. Cheers, Yves |
||
|
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges:
|
ARP1_0034319_089 , 6 tasks, all Linux 64 bit. All 6 tasks is the same error. Process exited with code 193 (0xc1, -63)
I do wonder if this Workunit would work or not work on Windows? If it works on Windows then there could be something wrong with Linux build version. If it doesn't work, then there is possibly a bug with app or work unit. [03:35:33] INFO: Checkpoint taken at 2018-12-26_06:00:00 |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Another workunit ending up in SIGSEGV: segmentation violation.
----------------------------------------Remarkable: it seems that task ARP1_0034522_089_1 was resent as ARP1_0034522_089_2 to the same device (both are called WCG-10-5-173-149, which can hardly be a coincidence, especially when both share the same 4.4.0-62-generic OS-version and also - most probably - the same timezone, or else one individual named two different devices in their computerfarm the same, which also doesn't sound plausible to me. It seems to me that resending a repair task to the same device isn't the correct thing to do, if that's the case. workunit 854352072 ARP1_0034522_089_0 LinuxMint Error 2021-10-21T18:25:45 2021-10-22T08:30:52 9.32/9.33 546.3/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: ARP1_0034522_089_0 LinuxMint Error 2021-10-21T18:25:45 2021-10-22T08:30:52 9.32/9.33 546.3/0.0etc. EDIT: Result for wingman _3 updated (also ended in Error). [Edit 1 times, last edit by adriverhoef at Oct 22, 2021 9:54:13 PM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
A few hours later, a second workunit ending up in SIGSEGV: segmentation violation.
workunit 855488444 ARP1_0034171_090_0 Linux Fedora Error 2021-10-22T20:11:25 2021-10-22T22:33:38 0.00/0.00 0.0/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: ARP1_0034171_090_0 Linux Fedora Error 2021-10-22T20:11:25 2021-10-22T22:33:38 0.00/0.00 0.0/0.0 |
||
|
|
|