Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 70
|
![]() |
Author |
|
Pekarius
Cruncher Joined: Jun 8, 2016 Post Count: 2 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
Hi, I also have problems with MIP1 WUs for about month or so. Every time they end with "signal 11". Never had any problems with other projects and tasks.
Manjaro linux (x64) Kernel 5.8.18-1 and 5.9.1_rt19-1 Boinc 7.16.10 and 7.16.11 Intel G4600 / 4GB RAM / 8GB swap |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7697 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi, I also have problems with MIP1 WUs for about month or so. Every time they end with "signal 11". Never had any problems with other projects and tasks. Manjaro linux (x64) Kernel 5.8.18-1 and 5.9.1_rt19-1 Boinc 7.16.10 and 7.16.11 Intel G4600 / 4GB RAM / 8GB swap My experience with "Signal 11" problems indicates a bottleneck someplace. How many of the MIP units are running at a time ? Try running only 1 at a time and gradually increase the number to the point of failure to see if this may alleviate your problem. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Pekarius
Cruncher Joined: Jun 8, 2016 Post Count: 2 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
I've just tried Boinc with one MIP WU. It ended with error at cca 4%.
Terminal output: mv: nelze získat informace (could not get information) o 'slots/0/result_silent.out': Adresář nebo soubor neexistuje (Folder or file does not exist) 06-Nov-2020 10:59:52 [World Community Grid] Computation for task MIP1_00323822_0466_0 finished 06-Nov-2020 10:59:52 [World Community Grid] Output file MIP1_00323822_0466_0_r115058097_0 for task MIP1_00323822_0466_0 absent |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Suddenly, I have MIP1 WUs succeeding again. I last updated the OS packages on my farm 9 days 14 hours ago, which would have been Nov 1st. A couple days ago I accidentally re-enabled all projects on the WCG website. And just now I've discovered that in the past 46 hours I've had 48 successful MIP1 WUs processed.
----------------------------------------Before this my last successful one was timestamped 1598730443, which is Saturday, August 29, 2020 7:47:23 PM. I'm now thinking kernel update as the culprit, since Arch switched to a 5.9 series kernel less than a month ago. There's more WUs running on one of my machines right now, at 23% and 7% complete. I'll report back with how they go. Edit/update: Confimed. Here they are running 9 MIP1_00324451_0436_0 Run 48.00% 27m27s 9d12h and here they are in the joblog 1604992531 ue 3168.398600 ct 2580.507000 fe 21032005497806 nm MIP1_00324451_0436_0 et 2599.503068 es 0 ![]() [Edit 1 times, last edit by mdxi at Nov 10, 2020 7:26:32 AM] |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's not the kernel: MIP WUs are only succeeding on one of my six x86_64 nodes. This problem just gets weirder the longer I look at it. Here's the facts:
----------------------------------------
And here's some possibilities/thoughts:
The crazy thing is that none of this should matter, to the best of my knowledge. I was under the impression that BOINC WUs were statically compiled. And if they weren't, then they should fail at runtime when a dynlib isn't found. But after a few months of wondering WTF, it's the only thing I've been able to point to as a difference between machines where MIP works 100% of the time, and where it fails 100% of the time. More to come... ![]() [Edit 1 times, last edit by mdxi at Nov 11, 2020 8:20:36 PM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 987 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@mdxi
The crazy thing is that none of this should matter, to the best of my knowledge. I was under the impression that BOINC WUs were statically compiled. And if they weren't, then they should fail at runtime when a dynlib isn't found. Nearly, but (unfortunately) not quite! Yes, WCG programs are [mostly] statically linked but there's a mechanism for a program to request a dynamic library explicitly at run time, which means that the library won't show up as a requirement when you use ldd, even if the program isn't statically linked - man dlopen for some basics. It's a useful mechanism for plugins... [The graphics programs I've checked are all dynamically linked, hence the "mostly" above!] This issue showed up once at CPDN, where all the programs are 32-bit (FORTRAN, no less!) so one needs to have certain 32-bit libraries installed! However, some of the programs were failing at wrapup time on some machines because they were using a static library that pulled in a dynamic library (I think it was a file compression library, but I may be misremembering...) and if the 32-bit library was missing or the wrong version, crash! Fortunately, I never got bitten by that one... So it's just faintly possible that MIP1 is doing that trick at some point; it might be worth using whatever package tools come with Arch to get an accurate list of the versions of absolutely everything installed on your node that seems to work and one that doesn't - sort the lists and diff them and you might find that your working node has a different version of something MIP1 needs but doesn't announce because it isn't dynamically bound at linkage time. If you've already done the above check, my apologies for "suggesting the obvious" :-) It was just a thought... Cheers - Al. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2174 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Is trying to use 'strace' an option?
Here is a snippet of the output you might see: write(2, "Options::initialize() Check spec"..., 34) = 34 write(2, "\n", 1) = 1 write(2, "Options::initialize() End reach"..., 34) = 34 write(2, "\n", 1) = 1 write(2, "Loaded options.... ok ", 22) = 22 write(2, "\n", 1) = 1 write(2, "Processed options.... ok ", 25) = 25 write(2, "\n", 1) = 1 open("/dev/urandom", O_RDONLY) = 4 read(4, "\34b\356\16\35\331B\21Gu\246\10\241\"v\267\347\3230\360\202\340\20\272p\27H>h\311\354\320"..., 8191) = 8191 close(4) = 0 write(1, "\33[0mcore.init: \33[0mRosetta versi"..., 280) = 280 write(2, "Initializing random generators.."..., 37) = 37 write(2, "\n", 1) = 1 time(NULL) = 1605185618 (2020-11-12T13:53:38+0100) readlink("/proc/self/exe", "/var/lib/boinc/projects/www.worl"..., 1024) = 95 write(1, "\33[0mcore.init.random: \33[0mRandom"..., 370) = 370 write(2, "Initialization complete. ", 25) = 25 write(2, "\n", 1) = 1 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 setrlimit(RLIMIT_STACK, {rlim_cur=32768*1024, rlim_max=RLIM64_INFINITY}) = 0 write(1, "\33[0mcore.init: \33[0m\n\33[0mcore.ini"..., 130) = 130 write(2, "Setting WU description ...", 26) = 26 write(2, "\n", 1) = 1 write(2, "Setting database description ...", 32) = 32 write(2, "\n", 1) = 1 write(2, "Setting up checkpointing ...", 28) = 28 write(2, "\n", 1) = 1 write(2, "Setting up graphics native ...", 30) = 30 write(2, "\n", 1) = 1 write(2, "set_shared_memory_fully_initiali"..., 39) = 39 write(2, "\n", 1) = 1 write(2, "abrelax ...", 11) = 11 write(2, "\n", 1) = 1 write(2, "abrelax.run", 11) = 11 write(2, "\n", 1) = 1 |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the suggestion to use strace. I don't have a solution yet, but I do have a far more concrete look at the problem. I cloned a MIP1 WU out of a running slot, halted BOINC, and ran it manually (after doing lots of file edits to repoint the needed soft links).
----------------------------------------Everything went normally for a few minutes, then the following sequence happens: stat("rotamer/shapovalov/StpDwn_0-0-0", 0x7fffe43f8680) = -1 ENOENT (No such file or directory) The WU goes looking for this data file, which doesn't exist. getuid() = 0 and needs to be fetched from the network open("/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 4 Some stuff is read from shared libraries, and then... mprotect(0x7fa1bb85b000, 4096, PROT_READ) = 0 .../etc/passwd is memory-mapped.Then on the first read... read(4, "root:x:0:0::/root:/bin/bash\nbin:"..., 4096) = 1066 Boom. So now I have a lot more data, but it continues to not really make sense. Why would reading /etc/passwd fail? Its perms are 0644 on all my machines. I guess I'll try it on another failing machine, just to make sure this is reproducible. Edit: continuing the theme of this problem making no sense, when I ran the exact same WU on another machine, it completed successfully. Mind you, this machine has had MIP1 failures (via BOINC) in the past 24 hours. I've been a sysadmin a long time, and am almost never in favor of "solving" problems by taking drastic measures when you don't yet have a root cause -- but this is just crazy. I'm going to reinstall one of my nodes and see what happens. ![]() [Edit 2 times, last edit by mdxi at Nov 24, 2020 8:03:57 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2174 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi mdxi,
I have been reading your post and looked back some posts trying to understand what is going on here. First, I googled "SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xe5}". Since you stated that you compiled your own static executables from BOINC source, I glanced some more and found this page, "static compiled example exits with segfault" on discourse.itk.org, trying to relate your problem. This led me to Creating Static Executables on Linux: "It is difficult to create distributable executables for Linux because of issues like incompatible C libraries and C++ standard libraries. Creating static executables avoids some of the dependencies, although it may not necessarily help with portability." - leading to Static Linking Considered Harmful. (DSO probably means Dynamic Shared Object there.) After some reading and contemplating I came up with: have you tried creating a dynamic binary instead? (Never did it myself, but that is what I thought.) Although this may not per se be the solution, it might help pointing you in the right direction. (Let's say something where a problematic library is involved.) |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Sorry for inadvertently creating confusion, but those were two separate things:
----------------------------------------1) I did, in the past, compile BOINC itself from source. But I've used distro packages for over a year, and I never compiled it statically in any case. 2) The talk about static compilation was me saying that I believed that project binaries (as a concrete example in this case, the MIP1 binary, wcgrid_mip1_rosetta_7.16_i686-pc-linux-gnu) were statically compiled, because that has always been the safest way to ship a binary that you need to work on disparate systems. Thanks for trying to help. I really appreciate it. ![]() |
||
|
|
![]() |