| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 31
|
|
| Author |
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
I think Kevin's response below probably includes these units.
My query related to Unhandled Exceptions. Mike There is a small number of workunits that are having an issue (ARP1_0033558_091 and ARP1_0033636_089 are among them). I need to collect and send the data back to Delft but I had some critical transition work I needed to do first. I'm hoping to be able to send that information to them soon. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
This is another segmentation violation workunit:
----------------------------------------(Three of them have ended up with it thus far, the other two are still in progress at this moment.) workunit 862217006 ARP1_0033639_098_0 Linuxmint Error 2021-10-28T22:21:12 2021-10-29T04:31:29 6.11/6.13 267.2/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: ARP1_0033639_098_1 Linux Error 2021-10-28T22:24:59 2021-10-29T01:46:00 3.24/3.25 179.8/0.0 UPDATE: All results are in, they all errored out — except the Server Aborted one — with SIGSEGV: workunit 862217006 ARP1_0033639_098_0 Linuxmint Error 2021-10-28T22:21:12 2021-10-29T04:31:29 6.11/6.13 267.2/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: ARP1_0033639_098_0 Linuxmint Error 2021-10-28T22:21:12 2021-10-29T04:31:29 6.11/6.13 267.2/0.0 Three times out of six, a task from this same workunit was sent to the same device (WCG-10-5-215-117). Is this really true? ![]() [Edit 3 times, last edit by adriverhoef at Oct 29, 2021 6:05:21 PM] |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
Three times out of six, a task from this same workunit was sent to the same device (WCG-10-5-215-117). Is this really true? I wonder if this is a case of several accounts running on one [big?] machine - if so, the machine name being the same can't be used as a blocker... As we don't seem to be able to see device IDs (other than our own) and system "owners" (understandable with the WCG attitude to privacy, but unlike a lot of other BOINC projects!) I can't think of any way to pursue this :-( Just a thought... Cheers - Al. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Thanks for your thoughts, Al.
----------------------------------------After some futile searching on the WCG site I decided to do a Google search for their OS version, "4.4.0-62-generic". This is what I found: it could have something to do with grcpool.com, grcpool.com-2, grcpool.com-3 & grcpool.com-4, or someone like that. At this LHC webpage you'll see the same OS Version, 4.4.0-62-generic, their name is grcpool.com-2 and their last contact was 6 Aug 2017. Now look at WCG-member grcpool.com-2 again. They are a registered member since 06/27/2017, that's 27-06-2017, and they have 4,698 device installations. Or this Rosetta webpage, same OS version (4.4.0-62-generic), their name is grcpool.com-3 and their last contact was 19 Aug 2017. Here at WCG their number of devices is 3,964 and they registered at 08-08-2017. Of these three members, only grcpool.com has a badge for ARP1 (and it is only a 'small' badge: for 14 days). WCG-10-5-215-117 clearly points to an IP-number: 10.5.215.117. Numbers like these (10.*.*.*) are reserved for private-use networks. Then I found this text at LHCathome: "Please note that some of the applications on LHC@home requre(sic) Virtual Box to be installed." (Insert some headscratching here.) So, what if there are multiple instances of BOINC, each inside their own Virtual Box, each with a different device-ID, running on the same computer? Suppose they each have the same name inside their own Virtual Box (in this case WCG-10-5-215-117). It may not relate to grcpool.com etc., but … for me, a multitude of Virtual Box instances on one mainframe would explain the whole case of the devices with the same name. Could it be that we're on the right track? [Edit 1 times, last edit by adriverhoef at Oct 30, 2021 8:22:45 AM] |
||
|
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges:
|
ARP1_0033395_098_2: all 6 are 2 checkpoints and errors on Linux, the _2 is my Linux.
https://www.worldcommunitygrid.org/contribution/workunit/859229171 This is possibly caused by a multiply overflow, as in a number too big to fit into 32 bit float number. Note: Some users who aren't programmers possbly not understand everything on the rest of this reply. I went back to old data containing data for ARP1_0033395_098_2, thanks to frequent Btrfs snapshots in Linux. I moved ARP1_0033395_098_2 slot to Windows and use x64dbg to run windows 64 bit app. With 2 checkpoints and a few hours later, fail with Exception_access_violation on a memory read, with all the CPU registers filled with float NAN (not a number) with hex 7FC00000. Floating point exception are normally off. I start it again with float exception on with _mm_setcsr(0) with some edits to assembly using x64dbg. Patched in setcsr and Re-run. This time it fails Exception_Flt_invalid_operation with multiply 5.3384e+010 (53384000000) x 1.43248e+037 (14324800000000000000000000000000000000) = NAN as this overflowed the 32 bit float number. I turned off float exception. 20 seconds later, fails with Exception_access_violation. This looks like a multiply overflow error, eventually fails with memory access violation. I wonder why an extremely huge number? Without access to ARP1 app source code I can't see the full picture on what it does. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Adri,
Then I found this text at LHCathome: "Please note that some of the applications on LHC@home requre(sic) Virtual Box to be installed." (Insert some headscratching here.) In case you weren't aware, some of the LHC stuff uses a VM to ensure a consistent working environment (libraries &c) regardless of the host O/S, thus knowing exactly how any floating point arithmetic will behave and avoiding the sorts of potential cross-platform discrepancies that can cause verification/validation issues! I think they use Scientific Linux (though I don't run LHC work so I could be mistaken...) sam6861, Interesting bit of forensic work there! I think the WRF source is available at github, but I don't know how much tweaking WCG might have done, and without the symbol tables from a WCG ARP1 compile to unpick the stack frame at crash time I'm not sure how fruitful the source would actually be (so I've not bothered trying to look). :-( Cheers - Al. |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Floating point exception are normally off. I start it again with float exception on with _mm_setcsr(0) with some edits to assembly using x64dbg. Patched in setcsr and Re-run. This time it fails Exception_Flt_invalid_operation with multiply 5.3384e+010 (53384000000) x 1.43248e+037 (14324800000000000000000000000000000000) = NAN as this overflowed the 32 bit float number. I turned off float exception. 20 seconds later, fails with Exception_access_violation. This looks like a multiply overflow error, eventually fails with memory access violation. I wonder why an extremely huge number? Without access to ARP1 app source code I can't see the full picture on what it does. The simplest explanation could be a typo in one or both of the exponents.
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges:
|
ARP1_0033397_099_3 have float NAN (Not a number) problem on input data.
https://www.worldcommunitygrid.org/contribution/workunit/865982654 I decompress 7z input files and with HxD hex editor/viewer, I see lots of 7FC00000 (32 bit float NAN) and FFC00000 (Negative NAN). The 64 bit app just crash computation error in 2 seconds, something went wrong before iteration 099. 32 bit ARP1 app can happily crunch float NAN numbers to completion with "no error". Unsure if NAN give any useful data or if the 32 bit ARP is just wasting computation on NAN data. 64 bit ARP do this when I look into how it does it: 1. Converts 32 bit float (NAN) into 32 bit integer. (0x80000000) 2. Convert to 64 bit integer. (0xFFFFFFFF80000000) 3. Multiply by 64 (0xFFFFFF1000000000) 4. Use it as part of memory index. Memory read error. "Computation error". 32 bit appears to not error on NAN. 1. Converts 32 bit float (NAN) into 32 bit integer (0x80000000) 2. As a 32 bit app, keeps it as 32 bit integer. (0x80000000) 3. Multiply by 64 and integer overflow to zero. (0x00000000) - Note: No exceptions or app crashes for integer overflows, it just keeps on going. 4. Use it as part of memory index. Reads data_array[0] just fine, however this possibly can be a float NAN or possibly broken data. 5. Completed task with NAN in results. 6. Server keeps going with new tasks with NAN numbers. For ARP1_0033397, something went wrong with previous iterations (any moment less then 099), and I guess this NAN problem probably have come from 32 bit ARP app. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Probably the first segmentation violation workunit in July 2022 and presumably also the first SIGSEGV in the 0020000-0029999 range:
workunit 152522928 App: Africa Rainfall Project Each stderr output has the same error ("process exited with code 193 (0xc1, -63)") and stacktrace as below. Logfile: |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Oh, bother! I'll add ARP1_0029301 to my database of stuck and other delayed units...
----------------------------------------Interesting that it's around 1800 cells away from the lowest problem children we'd seen before; I wonder if it will turn out that as simulated conditions change with time it will expose a different set of modelling problems (though that's exactly the same stack trace as before, the cause may be different...) Thanks for the heads up - Al. [Edit 1 times, last edit by alanb1951 at Jul 21, 2022 4:36:40 AM] |
||
|
|
|