Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 15
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4410 times and has 14 replies Next Thread
jay_Orlando
Senior Cruncher
USA
Joined: Jan 4, 2006
Post Count: 189
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
smile Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Greetings!!

From 20 to 25 December, 2020, 15 MIP WU failed with a singal 11 for me.
In the same time period 11 MIP WU were valid.
African Rain, Covid, Help Stop TB WU were also Valid.

I have plenty of memory... Running One Einstein@Home GPU and 7 WCG tasks,
the PC currently uses 3.3GiB of memory of 11.6 GiB. (28% with all OS)

Anyone else seeing this ??? Or, should I run memory diags for a day?

Other:
Kernel Linux 5.8.0-33-generic x86_64
Ubuntu Mate Release 20.10 (Groovy Gorilla) 64-bit
Processor: AMD FX(tm)-8150 Eight-Core Processor × 8

BOINC Log - starttup - within syslog...
Dec 19 10:27:09 pc-14 boinc[1271]: 19-Dec-2020 10:27:09 [---] Starting BOINC client version 7.16.11 for x86_64-pc-linux-gnu
Dec 19 10:27:10 pc-14 kernel: [ 20.211036] radeon_dp_aux_transfer_native: 116 callbacks suppressed
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:09 [---] log flags: file_xfer, sched_ops, task
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:09 [---] Libraries: libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh/0.9.3/openssl/zlib nghttp2/1.41.0 librtmp/2.3
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:09 [---] Data directory: /var/lib/boinc-client
Dec 19 10:27:10 pc-14 systemd[1]: sssd-sudo.socket: Job sssd-sudo.socket/start failed with result 'dependency'.
Dec 19 10:27:10 pc-14 whoopsie-upload-all[1437]: ERROR: whoopsie is not running
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] OpenCL: AMD/ATI GPU 0: AMD VERDE (DRM 2.50.0, 5.8.0-33-generic, LLVM 11.0.0) (driver version 20.2.1, device version OpenCL 1.1 Mesa 20.2.1, 2048MB, 2048MB available, 512 GFLOP
S peak)
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] libc: Ubuntu GLIBC 2.32-0ubuntu3 version 2.32
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Host name: pc-14
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Processor: 8 AuthenticAMD AMD FX(tm)-8150 Eight-Core Processor [Family 21 Model 1 Stepping 2]
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_
tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt fma4 nod
eid_msr topoext perfctr_core perfctr_nb cpb hw_pstate ssbd ibpb vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] OS: Linux Ubuntu: Ubuntu 20.10 [5.8.0-33-generic|libc 2.32 (Ubuntu GLIBC 2.32-0ubuntu3)]
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Memory: 11.62 GB physical, 9.31 GB virtual
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Disk: 91.17 GB total, 83.09 GB free
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Local time is UTC -5 hours
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Config: GUI RPCs allowed from:
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Config: report completed tasks immediately
Dec 19 10:27:10 pc-14 cron[1450]: (CRON) INFO (pidfile fd = 3)
Dec 19 10:27:10 pc-14 whoopsie[1265]: [10:27:09] Using lock path: /var/lock/whoopsie/lock
Dec 19 10:27:10 pc-14 systemd[1]: Dependency failed for SSSD PAM Service responder socket.
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Last CPU benchmark was 31 days 17:15:51 ago
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [World Community Grid] General prefs: from World Community Grid (last modified 18-Dec-2020 16:54:47)
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [World Community Grid] Host location: none
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [World Community Grid] General prefs: using your defaults
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Reading preferences override file
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] Preferences:
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] max memory usage when active: 8327.41 MB
Dec 19 10:27:10 pc-14 boinc[1271]: 19-Dec-2020 10:27:10 [---] max memory usage when idle: 8327.41 MB



So...
Anyone else seeing a SIGSEV?

Thanks!!

Jay

PS
sample WU with error:
https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=452568867

and
https://www.worldcommunitygrid.org/ms/device/...og.do?resultId=1391588198

and

Result Log

Result Name: MIP1_ 00327420_ 7452_ 0--
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process got signal 11</message>

<stderr_txt>
[2020-12-25 0:43:29:] :: BOINC:: Initializing ... ok.
[2020-12-25 0:43:29:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00327420.flags -out::file::silent result_silent.out -run:jran 937053714 -nstruct 1 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Sequence Length = 345
Starting work on structure: _0001

</stderr_txt>
]]>


PPS

Merry Christmas
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by jay_Orlando at Dec 25, 2020 7:37:30 AM]
[Dec 25, 2020 7:36:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7848
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

How many MIP work units are you running at the same time ?
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Dec 25, 2020 4:26:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jay_Orlando
Senior Cruncher
USA
Joined: Jan 4, 2006
Post Count: 189
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Hey Joe,
Happy Boxing day!!
In response to ypur post,
First, are you seeing the problem of signal 11 on MIP WU??

Second, the number of WU varies for MIP as WCG fills requests.
Right now only two.

Why? I would like to understand your line of thought.

I have run mem dignostics for about 10 hours - ne errors.
I have turned on memory logging.
Turning on RPC logging destroys my BOINC manager display.
Does this destroy your display?. Please check.
----------------------------------------

[Dec 26, 2020 9:26:32 AM]   Link   Report threatening or abusive post: please login first  Go to top 
jay_Orlando
Senior Cruncher
USA
Joined: Jan 4, 2006
Post Count: 189
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Update.
26December 2020, 28 failures on MIP between nidnight and 1PM.
Will stop MIP for a while.
No errors on other projects.
Reinstalled BOIN and Libraries - still errors.

Anyone else seeing this?

I have two machines with Ubuntu Linux.
One has the errors , the other does not.
ubuntu 20.04.1 LTS BOINC 7.16.6+dfsg-1 --- No errors on MIP or others
ubuntu 20.10 BOINC 7.16.15.11+dfsg-1 --- errors (signal 11)

Anyone else with ubuntu 20.10 AND BOINC 7.16.15.11+dfsg-1
having sig 11??
Thanks Jay
PS will also check ubuntu forum.
----------------------------------------

[Dec 26, 2020 6:50:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7848
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Why? I would like to understand your line of thought.

I had some signal 11 problems years ago, but never did figure a cause. I had a theory there was a traffic jam on access to either memory, cpu or hard drive. I was partial to the memory access theory, but I have no way to tell for sure. I haven't seen this on any of my Linux systems for several years.
Edit: By the way what are the cpus in your systems and how much memory ?
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
----------------------------------------
[Edit 1 times, last edit by Sgt.Joe at Dec 26, 2020 8:49:09 PM]
[Dec 26, 2020 8:47:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jay_Orlando
Senior Cruncher
USA
Joined: Jan 4, 2006
Post Count: 189
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Another person reported this problem here:
https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,41744

Apparently the problems run with AMD.
Maybe it was a compiler-flag when generating the Debian/Ubuntu package.

My workaround (for now) is to not to run MIP on the AMD machine.


Joe,
The info you want was listed in the original post above with the BOINC log startup.


:-)

Cheers,
jay
----------------------------------------

[Dec 28, 2020 4:41:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7848
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Joe, The info you want was listed in the original post above with the BOINC log startup.

Sorry for not reading the OP.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Dec 28, 2020 12:25:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jay_Orlando
Senior Cruncher
USA
Joined: Jan 4, 2006
Post Count: 189
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Greetings,

Is there any debug that would help?
I've tried the flags - but nothing at the time of signal 11.

Does the project want a volunteer?
Jay
----------------------------------------

[Dec 29, 2020 4:40:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Brian Nixon
Cruncher
United Kingdom
Joined: Oct 27, 2020
Post Count: 9
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

There have been reports of unexplained failures like this over at Rosetta@home, too – particularly with Linux on AMD. It seems like the kind of obscure bug that will be effectively impossible to track down without a debug build and the Rosetta source code.
[Dec 29, 2020 5:12:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jay_Orlando
Senior Cruncher
USA
Joined: Jan 4, 2006
Post Count: 189
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Anyone Else getting signal 11 on MIP WU? (Deceember 25, 2020)

Brian,

Thanks for the info!!

I added some packages and slowed down the memory accesses as per Mxd1 in the MIP forum.
No luck.

I agree; probably something obscure like an instruction fault.

Oh well,

I set another machine on a different venue working 100% MIPS.

T H A N K S again,
Jay
----------------------------------------

[Dec 31, 2020 3:33:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 15   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread