Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 31
Posts: 31   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 9086 times and has 30 replies Next Thread
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

I think Kevin's response below probably includes these units.

My query related to Unhandled Exceptions.

Mike


There is a small number of workunits that are having an issue (ARP1_0033558_091 and ARP1_0033636_089 are among them). I need to collect and send the data back to Delft but I had some critical transition work I needed to do first. I'm hoping to be able to send that information to them soon.
[Oct 26, 2021 5:01:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

This is another segmentation violation workunit:
(Three of them have ended up with it thus far, the other two are still in progress at this moment.)
workunit 862217006
ARP1_0033639_098_0  Linuxmint     Error      2021-10-28T22:21:12  2021-10-29T04:31:29    6.11/6.13     267.2/0.0   
ARP1_0033639_098_1 Linux Error 2021-10-28T22:24:59 2021-10-29T01:46:00 3.24/3.25 179.8/0.0
ARP1_0033639_098_2 Linux Fedora Error 2021-10-29T01:48:03 2021-10-29T08:03:49 3.39/3.43 184.5/0.0
ARP1_0033639_098_3 Linuxmint In Progr. 2021-10-29T04:34:38 2021-11-02T16:34:38 0.00/0.00 0.0/0.0
ARP1_0033639_098_4 Linux In Progr. 2021-10-29T08:04:59 2021-11-02T20:04:59 0.00/0.00 0.0/0.0
---------------------------------------------------------------------------------------------------------------------------------------
Details:
ARP1_0033639_098_1  Linux         Error      2021-10-28T22:24:59  2021-10-29T01:46:00    3.24/3.25     179.8/0.0   
Devicename: WCG-10-5-215-117
Logfile:
<core_client_version>7.6.31</core_client_version>
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[23:49:37] INFO: Checkpoint taken at 2019-01-13_06:00:00
[01:24:44] INFO: Checkpoint taken at 2019-01-13_12:00:00
SIGSEGV: segmentation violation
Stack trace (17 frames):
[0x2d13b72]
[0x2da0400]
[0x1ed9107]
[0x1e9c664]
[0x1e9444a]
[0x1e8997c]
[0x188518c]
[0x1b6f8e2]
[0x135f570]
[0x11f86d4]
[0x5848b7]
[0x448f61]
[0x4475c9]
[0x440967]
[0x2eb2344]
[0x2eb25c1]
[0x405466]

Exiting...

</stderr_txt>
... sent to the same machine ...
ARP1_0033639_098_4 Linux In Progr. 2021-10-29T08:04:59 2021-11-02T20:04:59 0.00/0.00 0.0/0.0
Devicename: WCG-10-5-215-117


UPDATE:
All results are in, they all errored out — except the Server Aborted one — with SIGSEGV:
workunit 862217006
ARP1_0033639_098_0  Linuxmint     Error      2021-10-28T22:21:12  2021-10-29T04:31:29    6.11/6.13     267.2/0.0   
ARP1_0033639_098_1 Linux Error 2021-10-28T22:24:59 2021-10-29T01:46:00 3.24/3.25 179.8/0.0
ARP1_0033639_098_2 Linux Fedora Error 2021-10-29T01:48:03 2021-10-29T08:03:49 3.39/3.43 184.5/0.0
ARP1_0033639_098_3 Linuxmint S.Aborted 2021-10-29T04:34:38 2021-10-29T15:16:08 0.00/0.00 0.0/0.0
ARP1_0033639_098_4 Linux Error 2021-10-29T08:04:59 2021-10-29T11:24:29 3.27/3.27 176.9/0.0
ARP1_0033639_098_5 Linux Error 2021-10-29T11:24:57 2021-10-29T14:46:33 3.30/3.31 183.9/0.0
---------------------------------------------------------------------------------------------------------------------------------------
Details:
ARP1_0033639_098_0  Linuxmint     Error      2021-10-28T22:21:12  2021-10-29T04:31:29    6.11/6.13     267.2/0.0   
OS-Version: Linux Mint 20.2 [5.11.0-38-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]
ARP1_0033639_098_1 Linux Error 2021-10-28T22:24:59 2021-10-29T01:46:00 3.24/3.25 179.8/0.0
OS-Version: 4.4.0-62-generic
Devicename: WCG-10-5-215-117
ARP1_0033639_098_2 Linux Fedora Error 2021-10-29T01:48:03 2021-10-29T08:03:49 3.39/3.43 184.5/0.0
OS-Version: Fedora 34 (Xfce) [5.13.16-200.fc34.x86_64|libc 2.33 (GNU libc)]
ARP1_0033639_098_3 Linuxmint S.Aborted 2021-10-29T04:34:38 2021-10-29T15:16:08 0.00/0.00 0.0/0.0
OS-Version: Linux Mint 20.1 [5.4.0-84-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]
ARP1_0033639_098_4 Linux Error 2021-10-29T08:04:59 2021-10-29T11:24:29 3.27/3.27 176.9/0.0
OS-Version: 4.4.0-62-generic
Devicename: WCG-10-5-215-117
ARP1_0033639_098_5 Linux Error 2021-10-29T11:24:57 2021-10-29T14:46:33 3.30/3.31 183.9/0.0
OS-Version: 4.4.0-62-generic
Devicename: WCG-10-5-215-117

Three times out of six, a task from this same workunit was sent to the same device (WCG-10-5-215-117).
Is this really true? devilish
----------------------------------------
[Edit 3 times, last edit by adriverhoef at Oct 29, 2021 6:05:21 PM]
[Oct 29, 2021 9:23:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Adri,
Three times out of six, a task from this same workunit was sent to the same device (WCG-10-5-215-117).
Is this really true?

I wonder if this is a case of several accounts running on one [big?] machine - if so, the machine name being the same can't be used as a blocker...

As we don't seem to be able to see device IDs (other than our own) and system "owners" (understandable with the WCG attitude to privacy, but unlike a lot of other BOINC projects!) I can't think of any way to pursue this :-(

Just a thought...

Cheers - Al.
[Oct 29, 2021 9:13:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Thanks for your thoughts, Al.

After some futile searching on the WCG site I decided to do a Google search for their OS version, "4.4.0-62-generic". This is what I found: it could have something to do with grcpool.com, grcpool.com-2, grcpool.com-3 & grcpool.com-4, or someone like that. At this LHC webpage you'll see the same OS Version, 4.4.0-62-generic, their name is grcpool.com-2 and their last contact was 6 Aug 2017. Now look at WCG-member grcpool.com-2 again. They are a registered member since 06/27/2017, that's 27-06-2017, and they have 4,698 device installations. Or this Rosetta webpage, same OS version (4.4.0-62-generic), their name is grcpool.com-3 and their last contact was 19 Aug 2017. Here at WCG their number of devices is 3,964 and they registered at 08-08-2017.

Of these three members, only grcpool.com has a badge for ARP1 (and it is only a 'small' badge: for 14 days).

WCG-10-5-215-117 clearly points to an IP-number: 10.5.215.117. Numbers like these (10.*.*.*) are reserved for private-use networks.

Then I found this text at LHCathome: "Please note that some of the applications on LHC@home requre(sic) Virtual Box to be installed."
(Insert some headscratching here.)

So, what if there are multiple instances of BOINC, each inside their own Virtual Box, each with a different device-ID, running on the same computer? Suppose they each have the same name inside their own Virtual Box (in this case WCG-10-5-215-117). It may not relate to grcpool.com etc., but … for me, a multitude of Virtual Box instances on one mainframe would explain the whole case of the devices with the same name. Could it be that we're on the right track?
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Oct 30, 2021 8:22:45 AM]
[Oct 29, 2021 11:03:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

ARP1_0033395_098_2: all 6 are 2 checkpoints and errors on Linux, the _2 is my Linux.
https://www.worldcommunitygrid.org/contribution/workunit/859229171
This is possibly caused by a multiply overflow, as in a number too big to fit into 32 bit float number.

Note: Some users who aren't programmers possbly not understand everything on the rest of this reply.

I went back to old data containing data for ARP1_0033395_098_2, thanks to frequent Btrfs snapshots in Linux. I moved ARP1_0033395_098_2 slot to Windows and use x64dbg to run windows 64 bit app.

With 2 checkpoints and a few hours later, fail with Exception_access_violation on a memory read, with all the CPU registers filled with float NAN (not a number) with hex 7FC00000.

Floating point exception are normally off. I start it again with float exception on with _mm_setcsr(0) with some edits to assembly using x64dbg. Patched in setcsr and Re-run. This time it fails Exception_Flt_invalid_operation with multiply 5.3384e+010 (53384000000) x 1.43248e+037 (14324800000000000000000000000000000000) = NAN as this overflowed the 32 bit float number. I turned off float exception. 20 seconds later, fails with Exception_access_violation. This looks like a multiply overflow error, eventually fails with memory access violation.

I wonder why an extremely huge number? Without access to ARP1 app source code I can't see the full picture on what it does.
[Oct 29, 2021 11:34:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Adri,
Then I found this text at LHCathome: "Please note that some of the applications on LHC@home requre(sic) Virtual Box to be installed."
(Insert some headscratching here.)

In case you weren't aware, some of the LHC stuff uses a VM to ensure a consistent working environment (libraries &c) regardless of the host O/S, thus knowing exactly how any floating point arithmetic will behave and avoiding the sorts of potential cross-platform discrepancies that can cause verification/validation issues! I think they use Scientific Linux (though I don't run LHC work so I could be mistaken...)

sam6861,

Interesting bit of forensic work there! I think the WRF source is available at github, but I don't know how much tweaking WCG might have done, and without the symbol tables from a WCG ARP1 compile to unpick the stack frame at crash time I'm not sure how fruitful the source would actually be (so I've not bothered trying to look). :-(

Cheers - Al.
[Oct 30, 2021 2:07:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Floating point exception are normally off. I start it again with float exception on with _mm_setcsr(0) with some edits to assembly using x64dbg. Patched in setcsr and Re-run. This time it fails Exception_Flt_invalid_operation with multiply 5.3384e+010 (53384000000) x 1.43248e+037 (14324800000000000000000000000000000000) = NAN as this overflowed the 32 bit float number. I turned off float exception. 20 seconds later, fails with Exception_access_violation. This looks like a multiply overflow error, eventually fails with memory access violation. I wonder why an extremely huge number? Without access to ARP1 app source code I can't see the full picture on what it does.

The simplest explanation could be a typo in one or both of the exponents.
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 30, 2021 3:32:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

ARP1_0033397_099_3 have float NAN (Not a number) problem on input data.
https://www.worldcommunitygrid.org/contribution/workunit/865982654
I decompress 7z input files and with HxD hex editor/viewer, I see lots of 7FC00000 (32 bit float NAN) and FFC00000 (Negative NAN). The 64 bit app just crash computation error in 2 seconds, something went wrong before iteration 099.

32 bit ARP1 app can happily crunch float NAN numbers to completion with "no error". Unsure if NAN give any useful data or if the 32 bit ARP is just wasting computation on NAN data.

64 bit ARP do this when I look into how it does it:
1. Converts 32 bit float (NAN) into 32 bit integer. (0x80000000)
2. Convert to 64 bit integer. (0xFFFFFFFF80000000)
3. Multiply by 64 (0xFFFFFF1000000000)
4. Use it as part of memory index. Memory read error. "Computation error".

32 bit appears to not error on NAN.
1. Converts 32 bit float (NAN) into 32 bit integer (0x80000000)
2. As a 32 bit app, keeps it as 32 bit integer. (0x80000000)
3. Multiply by 64 and integer overflow to zero. (0x00000000)
- Note: No exceptions or app crashes for integer overflows, it just keeps on going.
4. Use it as part of memory index. Reads data_array[0] just fine, however this possibly can be a float NAN or possibly broken data.
5. Completed task with NAN in results.
6. Server keeps going with new tasks with NAN numbers.

For ARP1_0033397, something went wrong with previous iterations (any moment less then 099), and I guess this NAN problem probably have come from 32 bit ARP app.
[Nov 1, 2021 9:09:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Probably the first segmentation violation workunit in July 2022 and presumably also the first SIGSEGV in the 0020000-0029999 range:

workunit 152522928
App: Africa Rainfall Project
Workunit: ARP1_0029301_127
Created: 2022-07-16T14:06:10
Quorum: 2
Replication: 2

ARP1_0029301_127_0 Linux Ubuntu Error 2022-07-16T15:17:28 2022-07-17T20:13:14 12.62/12.66 675.4/0.0
ARP1_0029301_127_1 Linux Ubuntu Error 2022-07-16T15:20:06 2022-07-18T20:49:11 48.94/49.05 772.8/0.0
ARP1_0029301_127_2 Fedora Linux Error 2022-07-17T20:13:20 2022-07-19T07:41:57 8.72/8.79 592.4/0.0
ARP1_0029301_127_3 Linux Debian Error 2022-07-18T20:49:15 2022-07-20T02:00:08 24.42/24.48 703.2/0.0
ARP1_0029301_127_4 Linux Ubuntu Error 2022-07-19T07:42:46 2022-07-20T08:45:59 9.91/14.43 346.6/0.0
ARP1_0029301_127_5 Linux Debian Error 2022-07-20T02:00:26 2022-07-20T23:45:11 21.42/21.55 823.7/0.0

Each stderr output has the same error ("process exited with code 193 (0xc1, -63)") and stacktrace as below.
Logfile:
<core_client_version>7.16.11</core_client_version>
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[01:36:21] INFO: Checkpoint taken at 2019-03-12_06:00:00
[02:58:58] INFO: Checkpoint taken at 2019-03-12_12:00:00
[04:22:37] INFO: Checkpoint taken at 2019-03-12_18:00:00
[05:34:37] INFO: Checkpoint taken at 2019-03-13_00:00:00
[07:04:04] INFO: Checkpoint taken at 2019-03-13_06:00:00
[08:26:38] INFO: Checkpoint taken at 2019-03-13_12:00:00
SIGSEGV: segmentation violation
Stack trace (19 frames):
[0x2d13b72]
[0x2da0400]
[0x1ed9107]
[0x1e9c664]
[0x1e9444a]
[0x1e8997c]
[0x188518c]
[0x1b6f8e2]
[0x135f570]
[0x11f86d4]
[0x5848b7]
[0x584ece]
[0x584ece]
[0x448f61]
[0x4475c9]
[0x440967]
[0x2eb2344]
[0x2eb25c1]
[0x405466]

Exiting...

[Jul 21, 2022 12:59:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Oh, bother! I'll add ARP1_0029301 to my database of stuck and other delayed units...

Interesting that it's around 1800 cells away from the lowest problem children we'd seen before; I wonder if it will turn out that as simulated conditions change with time it will expose a different set of modelling problems (though that's exactly the same stack trace as before, the cause may be different...)

Thanks for the heads up - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jul 21, 2022 4:36:40 AM]
[Jul 21, 2022 4:33:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 31   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread