Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 70
Posts: 70   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 46843 times and has 69 replies Next Thread
Pekarius
Cruncher
Joined: Jun 8, 2016
Post Count: 2
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Hi, I also have problems with MIP1 WUs for about month or so. Every time they end with "signal 11". Never had any problems with other projects and tasks.

Manjaro linux (x64)
Kernel 5.8.18-1 and 5.9.1_rt19-1
Boinc 7.16.10 and 7.16.11
Intel G4600 / 4GB RAM / 8GB swap
[Nov 5, 2020 11:57:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7697
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Hi, I also have problems with MIP1 WUs for about month or so. Every time they end with "signal 11". Never had any problems with other projects and tasks.

Manjaro linux (x64)
Kernel 5.8.18-1 and 5.9.1_rt19-1
Boinc 7.16.10 and 7.16.11
Intel G4600 / 4GB RAM / 8GB swap

My experience with "Signal 11" problems indicates a bottleneck someplace. How many of the MIP units are running at a time ?
Try running only 1 at a time and gradually increase the number to the point of failure to see if this may alleviate your problem.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 5, 2020 1:53:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Pekarius
Cruncher
Joined: Jun 8, 2016
Post Count: 2
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

I've just tried Boinc with one MIP WU. It ended with error at cca 4%.

Terminal output:

mv: nelze získat informace (could not get information) o 'slots/0/result_silent.out': Adresář nebo soubor neexistuje (Folder or file does not exist)
06-Nov-2020 10:59:52 [World Community Grid] Computation for task MIP1_00323822_0466_0 finished
06-Nov-2020 10:59:52 [World Community Grid] Output file MIP1_00323822_0466_0_r115058097_0 for task MIP1_00323822_0466_0 absent
[Nov 6, 2020 10:08:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Suddenly, I have MIP1 WUs succeeding again. I last updated the OS packages on my farm 9 days 14 hours ago, which would have been Nov 1st. A couple days ago I accidentally re-enabled all projects on the WCG website. And just now I've discovered that in the past 46 hours I've had 48 successful MIP1 WUs processed.

Before this my last successful one was timestamped 1598730443, which is Saturday, August 29, 2020 7:47:23 PM. I'm now thinking kernel update as the culprit, since Arch switched to a 5.9 series kernel less than a month ago.

There's more WUs running on one of my machines right now, at 23% and 7% complete. I'll report back with how they go.

Edit/update: Confimed. Here they are running
  9  MIP1_00324451_0436_0                         Run   48.00%   27m27s    9d12h
13 MIP1_00324469_0538_0 Run 31.43% 36m12s 9d13h

and here they are in the joblog
1604992531 ue 3168.398600 ct 2580.507000 fe 21032005497806 nm MIP1_00324451_0436_0 et 2599.503068 es 0
1604993168 ue 3168.398600 ct 2852.457000 fe 21032005497806 nm MIP1_00324469_0538_0 et 2873.486387 es 0

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by mdxi at Nov 10, 2020 7:26:32 AM]
[Nov 10, 2020 6:47:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

It's not the kernel: MIP WUs are only succeeding on one of my six x86_64 nodes. This problem just gets weirder the longer I look at it. Here's the facts:

  • I have six x86 compute nodes
  • All have identical hardware (excepting different SSD and RAM manufacturers)
  • One of them (node03) is now successfully finishing approximately 20 MIP1 WUs per day
  • The other five have a 100% failure rate
  • All have their OS packages and BOINC configuration centrally managed; they are in sync
  • By chance, node03 has not been reinstalled in a long time, while the other five have
  • All nodes have the same versions of packages installed, but node03 has a few packages installed that the other nodes do not
  • The list of packages on node03 but not any other node is: autoconf, automake, gcc-fortran, hwloc, libevent, and make

And here's some possibilities/thoughts:

  • node03 may have never been suffering MIP failures; I noticed the problem on multiple nodes and disconnected from MIP1, but at this point I can't remember if I checked every node or not
  • I don't think the autotools or make could be a contributing factor. They're there because an older version of my node management software compiled BOINC from source, and they're not a dependency for executing compiled binaries
  • hwloc seems unlikely as well. I believe I installed that one manually to examine some RAM specifics a long while back
  • gcc-fortran and libevent could both be possible explanations. libevent's job is to execute callbacks when file descriptors change state, or on timeouts
  • gcc-fortran is my leading candidate though, because on Arch it has libmpc as a dependency
  • libmpc provides arbitrary-precision maths support for complex numbers

The crazy thing is that none of this should matter, to the best of my knowledge. I was under the impression that BOINC WUs were statically compiled. And if they weren't, then they should fail at runtime when a dynlib isn't found.

But after a few months of wondering WTF, it's the only thing I've been able to point to as a difference between machines where MIP works 100% of the time, and where it fails 100% of the time.

More to come...
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by mdxi at Nov 11, 2020 8:20:36 PM]
[Nov 11, 2020 8:14:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 987
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

@mdxi
The crazy thing is that none of this should matter, to the best of my knowledge. I was under the impression that BOINC WUs were statically compiled. And if they weren't, then they should fail at runtime when a dynlib isn't found.

Nearly, but (unfortunately) not quite!

Yes, WCG programs are [mostly] statically linked but there's a mechanism for a program to request a dynamic library explicitly at run time, which means that the library won't show up as a requirement when you use ldd, even if the program isn't statically linked - man dlopen for some basics. It's a useful mechanism for plugins...

[The graphics programs I've checked are all dynamically linked, hence the "mostly" above!]

This issue showed up once at CPDN, where all the programs are 32-bit (FORTRAN, no less!) so one needs to have certain 32-bit libraries installed! However, some of the programs were failing at wrapup time on some machines because they were using a static library that pulled in a dynamic library (I think it was a file compression library, but I may be misremembering...) and if the 32-bit library was missing or the wrong version, crash! Fortunately, I never got bitten by that one...

So it's just faintly possible that MIP1 is doing that trick at some point; it might be worth using whatever package tools come with Arch to get an accurate list of the versions of absolutely everything installed on your node that seems to work and one that doesn't - sort the lists and diff them and you might find that your working node has a different version of something MIP1 needs but doesn't announce because it isn't dynamically bound at linkage time.

If you've already done the above check, my apologies for "suggesting the obvious" :-) It was just a thought...

Cheers - Al.
[Nov 12, 2020 5:46:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2174
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Is trying to use 'strace' an option?

Here is a snippet of the output you might see:

write(2, "Options::initialize() Check spec"..., 34) = 34
write(2, "\n", 1) = 1
write(2, "Options::initialize() End reach"..., 34) = 34
write(2, "\n", 1) = 1
write(2, "Loaded options.... ok ", 22) = 22
write(2, "\n", 1) = 1
write(2, "Processed options.... ok ", 25) = 25
write(2, "\n", 1) = 1
open("/dev/urandom", O_RDONLY) = 4
read(4, "\34b\356\16\35\331B\21Gu\246\10\241\"v\267\347\3230\360\202\340\20\272p\27H>h\311\354\320"..., 8191) = 8191
close(4) = 0
write(1, "\33[0mcore.init: \33[0mRosetta versi"..., 280) = 280
write(2, "Initializing random generators.."..., 37) = 37
write(2, "\n", 1) = 1
time(NULL) = 1605185618 (2020-11-12T13:53:38+0100)
readlink("/proc/self/exe", "/var/lib/boinc/projects/www.worl"..., 1024) = 95
write(1, "\33[0mcore.init.random: \33[0mRandom"..., 370) = 370
write(2, "Initialization complete. ", 25) = 25
write(2, "\n", 1) = 1
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
setrlimit(RLIMIT_STACK, {rlim_cur=32768*1024, rlim_max=RLIM64_INFINITY}) = 0
write(1, "\33[0mcore.init: \33[0m\n\33[0mcore.ini"..., 130) = 130
write(2, "Setting WU description ...", 26) = 26
write(2, "\n", 1) = 1
write(2, "Setting database description ...", 32) = 32
write(2, "\n", 1) = 1
write(2, "Setting up checkpointing ...", 28) = 28
write(2, "\n", 1) = 1
write(2, "Setting up graphics native ...", 30) = 30
write(2, "\n", 1) = 1
write(2, "set_shared_memory_fully_initiali"..., 39) = 39
write(2, "\n", 1) = 1
write(2, "abrelax ...", 11) = 11
write(2, "\n", 1) = 1
write(2, "abrelax.run", 11) = 11
write(2, "\n", 1) = 1
[Nov 12, 2020 1:00:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Thanks for the suggestion to use strace. I don't have a solution yet, but I do have a far more concrete look at the problem. I cloned a MIP1 WU out of a running slot, halted BOINC, and ran it manually (after doing lots of file edits to repoint the needed soft links).

Everything went normally for a few minutes, then the following sequence happens:
stat("rotamer/shapovalov/StpDwn_0-0-0", 0x7fffe43f8680) = -1 ENOENT (No such file or directory)
stat("./database/rotamer/shapovalov/StpDwn_0-0-0", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat("./database/rotamer/shapovalov/StpDwn_0-0-0/Dunbrack10.lib.bin", 0x7fffe43f8390) = -1 ENOENT (No such file or directory)
stat("./database/rotamer/shapovalov/StpDwn_0-0-0/Dunbrack10.lib.bin.gz", 0x7fffe43f8390) = -1 ENOENT (No such file or directory)

The WU goes looking for this data file, which doesn't exist.
getuid()                                = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(4) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
connect(4, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(4) = 0
open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=312, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa1bbf88000
read(4, "# Name Service Switch configurat"..., 4096) = 312
read(4, "", 4096) = 0
close(4) = 0
munmap(0x7fa1bbf88000, 4096) = 0
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=58424, ...}) = 0
mmap(NULL, 58424, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7fa1bbf7a000
close(4) = 0
open("/usr/lib/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P5\0\0\0\0\0\0"..., 832) = 832
fstat(4, {st_mode=S_IFREG|0755, st_size=51376, ...}) = 0
mmap(NULL, 79320, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x7fa1bbf66000
mmap(0x7fa1bbf69000, 28672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x3000) = 0x7fa1bbf69000
mmap(0x7fa1bbf70000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0xa000) = 0x7fa1bbf70000
mmap(0x7fa1bbf72000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0xb000) = 0x7fa1bbf72000
mmap(0x7fa1bbf74000, 21976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa1bbf74000
close(4) = 0

and needs to be fetched from the network
open("/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 4
read(4, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\202\2\0\0\0\0\0"..., 832) = 832
lseek(4, 64, SEEK_SET) = 64
read(4, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784) = 784
lseek(4, 848, SEEK_SET) = 848
read(4, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32
lseek(4, 880, SEEK_SET) = 880
read(4, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\207\360\21\247\344\314?\306\nT\320\323\335i\16t"..., 68) = 68
fstat(4, {st_mode=S_IFREG|0755, st_size=2159552, ...}) = 0
lseek(4, 64, SEEK_SET) = 64
read(4, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784) = 784
mmap(NULL, 1868448, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x7fa1bb85e000
mmap(0x7fa1bb884000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x26000) = 0x7fa1bb884000
mmap(0x7fa1bb9d1000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x173000) = 0x7fa1bb9d1000
mmap(0x7fa1bba1d000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x1be000) = 0x7fa1bba1d000
mmap(0x7fa1bba23000, 12960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa1bba23000
close(4) = 0
open("/usr/lib/ld-linux-x86-64.so.2", O_RDONLY|O_CLOEXEC) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220 \0\0\0\0\0\0"..., 832) = 832
fstat(4, {st_mode=S_IFREG|0755, st_size=207944, ...}) = 0
mmap(NULL, 188824, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x7fa1bb82f000
mmap(0x7fa1bb831000, 135168, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x2000) = 0x7fa1bb831000
mmap(0x7fa1bb852000, 36864, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x23000) = 0x7fa1bb852000
mmap(0x7fa1bb85b000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x2b000) = 0x7fa1bb85b000
close(4) = 0

Some stuff is read from shared libraries, and then...
mprotect(0x7fa1bb85b000, 4096, PROT_READ) = 0
mprotect(0x7fa1bba1d000, 12288, PROT_READ) = 0
mprotect(0x7fa1bbf72000, 4096, PROT_READ) = 0
munmap(0x7fa1bbf7a000, 58424) = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa1bb72f000
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=1066, ...}) = 0
lseek(4, 0, SEEK_SET)

.../etc/passwd is memory-mapped.Then on the first read...
read(4, "root:x:0:0::/root:/bin/bash\nbin:"..., 4096) = 1066
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xe5} ---
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_CONTINUE or TCSETSF, {B38400 opost isig icanon echo ...}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV (core dumped) +++

Boom.

So now I have a lot more data, but it continues to not really make sense. Why would reading /etc/passwd fail? Its perms are 0644 on all my machines. I guess I'll try it on another failing machine, just to make sure this is reproducible.

Edit: continuing the theme of this problem making no sense, when I ran the exact same WU on another machine, it completed successfully. Mind you, this machine has had MIP1 failures (via BOINC) in the past 24 hours. I've been a sysadmin a long time, and am almost never in favor of "solving" problems by taking drastic measures when you don't yet have a root cause -- but this is just crazy. I'm going to reinstall one of my nodes and see what happens.
----------------------------------------

----------------------------------------
[Edit 2 times, last edit by mdxi at Nov 24, 2020 8:03:57 PM]
[Nov 24, 2020 7:08:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2174
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Hi mdxi,
I have been reading your post and looked back some posts trying to understand what is going on here. First, I googled "SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xe5}". Since you stated that you compiled your own static executables from BOINC source, I glanced some more and found this page, "static compiled example exits with segfault" on discourse.itk.org, trying to relate your problem. This led me to Creating Static Executables on Linux: "It is difficult to create distributable executables for Linux because of issues like incompatible C libraries and C++ standard libraries. Creating static executables avoids some of the dependencies, although it may not necessarily help with portability." - leading to Static Linking Considered Harmful. (DSO probably means Dynamic Shared Object there.)

After some reading and contemplating I came up with: have you tried creating a dynamic binary instead? (Never did it myself, but that is what I thought.) Although this may not per se be the solution, it might help pointing you in the right direction. (Let's say something where a problematic library is involved.)
[Nov 25, 2020 1:32:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Sorry for inadvertently creating confusion, but those were two separate things:

1) I did, in the past, compile BOINC itself from source. But I've used distro packages for over a year, and I never compiled it statically in any case.

2) The talk about static compilation was me saying that I believed that project binaries (as a concrete example in this case, the MIP1 binary, wcgrid_mip1_rosetta_7.16_i686-pc-linux-gnu) were statically compiled, because that has always been the safest way to ship a binary that you need to work on disparate systems.

Thanks for trying to help. I really appreciate it.
----------------------------------------

[Nov 25, 2020 5:31:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 70   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread