World Community Grid - View Thread - Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 31

[ ]

Author

This topic has been viewed 9238 times and has 30 replies

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

I think Kevin's response below probably includes these units.

My query related to Unhandled Exceptions.

Mike

There is a small number of workunits that are having an issue (ARP1_0033558_091 and ARP1_0033636_089 are among them). I need to collect and send the data back to Delft but I had some critical transition work I needed to do first. I'm hoping to be able to send that information to them soon.

[Oct 26, 2021 5:01:47 PM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2360
Status: Recently Active
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

This is another segmentation violation workunit:
(Three of them have ended up with it thus far, the other two are still in progress at this moment.)
workunit 862217006

ARP1_0033639_098_0  Linuxmint     Error      2021-10-28T22:21:12  2021-10-29T04:31:29    6.11/6.13     267.2/0.0   
ARP1_0033639_098_1  Linux         Error      2021-10-28T22:24:59  2021-10-29T01:46:00    3.24/3.25     179.8/0.0   
ARP1_0033639_098_2  Linux Fedora  Error      2021-10-29T01:48:03  2021-10-29T08:03:49    3.39/3.43     184.5/0.0   
ARP1_0033639_098_3  Linuxmint     In Progr.  2021-10-29T04:34:38  2021-11-02T16:34:38    0.00/0.00       0.0/0.0   
ARP1_0033639_098_4  Linux         In Progr.  2021-10-29T08:04:59  2021-11-02T20:04:59    0.00/0.00       0.0/0.0

---------------------------------------------------------------------------------------------------------------------------------------
Details:

ARP1_0033639_098_1  Linux         Error      2021-10-28T22:24:59  2021-10-29T01:46:00    3.24/3.25     179.8/0.0   
	Devicename: WCG-10-5-215-117
	Logfile:
	<core_client_version>7.6.31</core_client_version>
	<message>
	process exited with code 193 (0xc1, -63)
	</message>
	<stderr_txt>
	INFO: Initializing
	INFO: No state to restore.  Start from the beginning.
	Starting WRFMain
	[23:49:37] INFO: Checkpoint taken at 2019-01-13_06:00:00
	[01:24:44] INFO: Checkpoint taken at 2019-01-13_12:00:00
	SIGSEGV: segmentation violation
	Stack trace (17 frames):
	[0x2d13b72]
	[0x2da0400]
	[0x1ed9107]
	[0x1e9c664]
	[0x1e9444a]
	[0x1e8997c]
	[0x188518c]
	[0x1b6f8e2]
	[0x135f570]
	[0x11f86d4]
	[0x5848b7]
	[0x448f61]
	[0x4475c9]
	[0x440967]
	[0x2eb2344]
	[0x2eb25c1]
	[0x405466]
	
	Exiting...
	
	</stderr_txt>
... sent to the same machine ... 
ARP1_0033639_098_4  Linux         In Progr.  2021-10-29T08:04:59  2021-11-02T20:04:59    0.00/0.00       0.0/0.0   
	Devicename: WCG-10-5-215-117

UPDATE:
All results are in, they all errored out — except the Server Aborted one — with SIGSEGV:
workunit 862217006

ARP1_0033639_098_0  Linuxmint     Error      2021-10-28T22:21:12  2021-10-29T04:31:29    6.11/6.13     267.2/0.0   
ARP1_0033639_098_1  Linux         Error      2021-10-28T22:24:59  2021-10-29T01:46:00    3.24/3.25     179.8/0.0   
ARP1_0033639_098_2  Linux Fedora  Error      2021-10-29T01:48:03  2021-10-29T08:03:49    3.39/3.43     184.5/0.0   
ARP1_0033639_098_3  Linuxmint     S.Aborted  2021-10-29T04:34:38  2021-10-29T15:16:08    0.00/0.00       0.0/0.0   
ARP1_0033639_098_4  Linux         Error      2021-10-29T08:04:59  2021-10-29T11:24:29    3.27/3.27     176.9/0.0   
ARP1_0033639_098_5  Linux         Error      2021-10-29T11:24:57  2021-10-29T14:46:33    3.30/3.31     183.9/0.0

---------------------------------------------------------------------------------------------------------------------------------------
Details:

ARP1_0033639_098_0  Linuxmint     Error      2021-10-28T22:21:12  2021-10-29T04:31:29    6.11/6.13     267.2/0.0   
	OS-Version: Linux Mint 20.2 [5.11.0-38-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]
ARP1_0033639_098_1  Linux         Error      2021-10-28T22:24:59  2021-10-29T01:46:00    3.24/3.25     179.8/0.0   
	OS-Version: 4.4.0-62-generic
	Devicename: WCG-10-5-215-117
ARP1_0033639_098_2  Linux Fedora  Error      2021-10-29T01:48:03  2021-10-29T08:03:49    3.39/3.43     184.5/0.0   
	OS-Version: Fedora 34 (Xfce) [5.13.16-200.fc34.x86_64|libc 2.33 (GNU libc)]
ARP1_0033639_098_3  Linuxmint     S.Aborted  2021-10-29T04:34:38  2021-10-29T15:16:08    0.00/0.00       0.0/0.0   
	OS-Version: Linux Mint 20.1 [5.4.0-84-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]
ARP1_0033639_098_4  Linux         Error      2021-10-29T08:04:59  2021-10-29T11:24:29    3.27/3.27     176.9/0.0   
	OS-Version: 4.4.0-62-generic
	Devicename: WCG-10-5-215-117
ARP1_0033639_098_5  Linux         Error      2021-10-29T11:24:57  2021-10-29T14:46:33    3.30/3.31     183.9/0.0   
	OS-Version: 4.4.0-62-generic
	Devicename: WCG-10-5-215-117

Three times out of six, a task from this same workunit was sent to the same device (WCG-10-5-215-117).
Is this really true? devilish

----------------------------------------
[Edit 3 times, last edit by adriverhoef at Oct 29, 2021 6:05:21 PM]

[Oct 29, 2021 9:23:53 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1337
Status: Recently Active
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Adri,

Three times out of six, a task from this same workunit was sent to the same device (WCG-10-5-215-117).
Is this really true?

I wonder if this is a case of several accounts running on one [big?] machine - if so, the machine name being the same can't be used as a blocker...

As we don't seem to be able to see device IDs (other than our own) and system "owners" (understandable with the WCG attitude to privacy, but unlike a lot of other BOINC projects!) I can't think of any way to pursue this :-(

Just a thought...

Cheers - Al.

[Oct 29, 2021 9:13:05 PM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2360
Status: Recently Active
Project Badges:


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Thanks for your thoughts, Al.

After some futile searching on the WCG site I decided to do a Google search for their OS version, "4.4.0-62-generic". This is what I found: it could have something to do with grcpool.com, grcpool.com-2, grcpool.com-3 & grcpool.com-4, or someone like that. At this LHC webpage you'll see the same OS Version, 4.4.0-62-generic, their name is grcpool.com-2 and their last contact was 6 Aug 2017. Now look at WCG-member grcpool.com-2 again. They are a registered member since 06/27/2017, that's 27-06-2017, and they have 4,698 device installations. Or this Rosetta webpage, same OS version (4.4.0-62-generic), their name is grcpool.com-3 and their last contact was 19 Aug 2017. Here at WCG their number of devices is 3,964 and they registered at 08-08-2017.

Of these three members, only grcpool.com has a badge for ARP1 (and it is only a 'small' badge: for 14 days).

WCG-10-5-215-117 clearly points to an IP-number: 10.5.215.117. Numbers like these (10.*.*.*) are reserved for private-use networks.

Then I found this text at LHCathome: "Please note that some of the applications on LHC@home requre(sic) Virtual Box to be installed."
(Insert some headscratching here.)

So, what if there are multiple instances of BOINC, each inside their own Virtual Box, each with a different device-ID, running on the same computer? Suppose they each have the same name inside their own Virtual Box (in this case WCG-10-5-215-117). It may not relate to grcpool.com etc., but … for me, a multitude of Virtual Box instances on one mainframe would explain the whole case of the devices with the same name. Could it be that we're on the right track?

----------------------------------------
[Edit 1 times, last edit by adriverhoef at Oct 30, 2021 8:22:45 AM]

[Oct 29, 2021 11:03:11 PM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

45 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

5 year badge for OpenPandemics - COVID-19


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

ARP1_0033395_098_2: all 6 are 2 checkpoints and errors on Linux, the _2 is my Linux.
https://www.worldcommunitygrid.org/contribution/workunit/859229171
This is possibly caused by a multiply overflow, as in a number too big to fit into 32 bit float number.

Note: Some users who aren't programmers possbly not understand everything on the rest of this reply.

I went back to old data containing data for ARP1_0033395_098_2, thanks to frequent Btrfs snapshots in Linux. I moved ARP1_0033395_098_2 slot to Windows and use x64dbg to run windows 64 bit app.

With 2 checkpoints and a few hours later, fail with Exception_access_violation on a memory read, with all the CPU registers filled with float NAN (not a number) with hex 7FC00000.

Floating point exception are normally off. I start it again with float exception on with _mm_setcsr(0) with some edits to assembly using x64dbg. Patched in setcsr and Re-run. This time it fails Exception_Flt_invalid_operation with multiply 5.3384e+010 (53384000000) x 1.43248e+037 (14324800000000000000000000000000000000) = NAN as this overflowed the 32 bit float number. I turned off float exception. 20 seconds later, fails with Exception_access_violation. This looks like a multiply overflow error, eventually fails with memory access violation.

I wonder why an extremely huge number? Without access to ARP1 app source code I can't see the full picture on what it does.

[Oct 29, 2021 11:34:10 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1337
Status: Recently Active
Project Badges:


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Adri,

Then I found this text at LHCathome: "Please note that some of the applications on LHC@home requre(sic) Virtual Box to be installed."
(Insert some headscratching here.)

In case you weren't aware, some of the LHC stuff uses a VM to ensure a consistent working environment (libraries &c) regardless of the host O/S, thus knowing exactly how any floating point arithmetic will behave and avoiding the sorts of potential cross-platform discrepancies that can cause verification/validation issues! I think they use Scientific Linux (though I don't run LHC work so I could be mistaken...)

sam6861,

Interesting bit of forensic work there! I think the WRF source is available at github, but I don't know how much tweaking WCG might have done, and without the symbol tables from a WCG ARP1 compile to unpick the stack frame at crash time I'm not sure how fruitful the source would actually be (so I've not bothered trying to look). :-(

Cheers - Al.

[Oct 30, 2021 2:07:52 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7854
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Floating point exception are normally off. I start it again with float exception on with _mm_setcsr(0) with some edits to assembly using x64dbg. Patched in setcsr and Re-run. This time it fails Exception_Flt_invalid_operation with multiply 5.3384e+010 (53384000000) x 1.43248e+037 (14324800000000000000000000000000000000) = NAN as this overflowed the 32 bit float number. I turned off float exception. 20 seconds later, fails with Exception_access_violation. This looks like a multiply overflow error, eventually fails with memory access violation. I wonder why an extremely huge number? Without access to ARP1 app source code I can't see the full picture on what it does.

The simplest explanation could be a typo in one or both of the exponents.

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Oct 30, 2021 3:32:00 AM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

ARP1_0033397_099_3 have float NAN (Not a number) problem on input data.
https://www.worldcommunitygrid.org/contribution/workunit/865982654
I decompress 7z input files and with HxD hex editor/viewer, I see lots of 7FC00000 (32 bit float NAN) and FFC00000 (Negative NAN). The 64 bit app just crash computation error in 2 seconds, something went wrong before iteration 099.

32 bit ARP1 app can happily crunch float NAN numbers to completion with "no error". Unsure if NAN give any useful data or if the 32 bit ARP is just wasting computation on NAN data.

64 bit ARP do this when I look into how it does it:
1. Converts 32 bit float (NAN) into 32 bit integer. (0x80000000)
2. Convert to 64 bit integer. (0xFFFFFFFF80000000)
3. Multiply by 64 (0xFFFFFF1000000000)
4. Use it as part of memory index. Memory read error. "Computation error".

32 bit appears to not error on NAN.
1. Converts 32 bit float (NAN) into 32 bit integer (0x80000000)
2. As a 32 bit app, keeps it as 32 bit integer. (0x80000000)
3. Multiply by 64 and integer overflow to zero. (0x00000000)
- Note: No exceptions or app crashes for integer overflows, it just keeps on going.
4. Use it as part of memory index. Reads data_array[0] just fine, however this possibly can be a float NAN or possibly broken data.
5. Completed task with NAN in results.
6. Server keeps going with new tasks with NAN numbers.

For ARP1_0033397, something went wrong with previous iterations (any moment less then 099), and I guess this NAN problem probably have come from 32 bit ARP app.

[Nov 1, 2021 9:09:25 PM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2360
Status: Recently Active
Project Badges:


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Probably the first segmentation violation workunit in July 2022 and presumably also the first SIGSEGV in the 0020000-0029999 range:

workunit 152522928

App: Africa Rainfall Project
Workunit: ARP1_0029301_127
Created: 2022-07-16T14:06:10
Quorum: 2
Replication: 2

ARP1_0029301_127_0  Linux Ubuntu  Error  2022-07-16T15:17:28  2022-07-17T20:13:14   12.62/12.66    675.4/0.0   
ARP1_0029301_127_1  Linux Ubuntu  Error  2022-07-16T15:20:06  2022-07-18T20:49:11   48.94/49.05    772.8/0.0   
ARP1_0029301_127_2  Fedora Linux  Error  2022-07-17T20:13:20  2022-07-19T07:41:57    8.72/8.79     592.4/0.0   
ARP1_0029301_127_3  Linux Debian  Error  2022-07-18T20:49:15  2022-07-20T02:00:08   24.42/24.48    703.2/0.0   
ARP1_0029301_127_4  Linux Ubuntu  Error  2022-07-19T07:42:46  2022-07-20T08:45:59    9.91/14.43    346.6/0.0   
ARP1_0029301_127_5  Linux Debian  Error  2022-07-20T02:00:26  2022-07-20T23:45:11   21.42/21.55    823.7/0.0

Each stderr output has the same error ("process exited with code 193 (0xc1, -63)") and stacktrace as below.

Logfile:
	<core_client_version>7.16.11</core_client_version>
	<message>
	process exited with code 193 (0xc1, -63)</message>
	<stderr_txt>
	INFO: Initializing
	INFO: No state to restore.  Start from the beginning.
	Starting WRFMain
	[01:36:21] INFO: Checkpoint taken at 2019-03-12_06:00:00
	[02:58:58] INFO: Checkpoint taken at 2019-03-12_12:00:00
	[04:22:37] INFO: Checkpoint taken at 2019-03-12_18:00:00
	[05:34:37] INFO: Checkpoint taken at 2019-03-13_00:00:00
	[07:04:04] INFO: Checkpoint taken at 2019-03-13_06:00:00
	[08:26:38] INFO: Checkpoint taken at 2019-03-13_12:00:00
	SIGSEGV: segmentation violation
	Stack trace (19 frames):
	[0x2d13b72]
	[0x2da0400]
	[0x1ed9107]
	[0x1e9c664]
	[0x1e9444a]
	[0x1e8997c]
	[0x188518c]
	[0x1b6f8e2]
	[0x135f570]
	[0x11f86d4]
	[0x5848b7]
	[0x584ece]
	[0x584ece]
	[0x448f61]
	[0x4475c9]
	[0x440967]
	[0x2eb2344]
	[0x2eb25c1]
	[0x405466]
	
	Exiting...

[Jul 21, 2022 12:59:25 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1337
Status: Recently Active
Project Badges:


Re: Error: SIGSEGV: segmentation violation, process exited with code 193 (0xc1, -63)

Oh, bother! I'll add ARP1_0029301 to my database of stuck and other delayed units...

Interesting that it's around 1800 cells away from the lowest problem children we'd seen before; I wonder if it will turn out that as simulated conditions change with time it will expose a different set of modelling problems (though that's exactly the same stack trace as before, the cause may be different...)

Thanks for the heads up - Al.

----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jul 21, 2022 4:36:40 AM]

[Jul 21, 2022 4:33:33 AM]

[ ]