Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Microbiome Immunity Project Thread: MIP units error on Linux |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 32
|
Author |
|
katoda
Senior Cruncher Poland Joined: Apr 28, 2007 Post Count: 170 Status: Offline Project Badges: |
Will try to do that after finishing currently work in progress.
---------------------------------------- |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2084 Status: Recently Active Project Badges: |
Four days ago, I received 32 BETA-WUs on one device at the same time (8/19/17 01:08:40) of which one errored out with "process got signal 11". One other WU errored out with "finish file present too long". The other 30 BETA-WUs all went Valid.
My wingmen - for the WUs that went into Error - were successful afterwards: they got Valids. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
So far all WUs returned are without error, on Windows (32/64bit), Linux (Mint 18/64). Got a few on a remote Mac with OS X 10.6.8, but those haven't been returned yet...
----------------------------------------Seems to me running better than the recent Beta... Ralf |
||
|
widdershins
Veteran Cruncher Scotland Joined: Apr 30, 2007 Post Count: 674 Status: Offline Project Badges: |
Ironically I'm having the opposite issue the ancient BOINC app running on my clapped out old Ubuntu 10 vm that hasn't seen a patch in years and running inside a BSD box is turning in 100% ok results as is normal for it. The much more modern, fully patched Win 7 Pro box is turning in 100% error rate, but only on these work units.
However excluding MIP, all boxes show 99.9% valids (1 Scc unit threw an error last week). A single error in any science is rare for my machines, a 100% error rate is unheard of. So I think the problem lies with the science application mainly... |
||
|
guhsoftware
Cruncher Germany Joined: Nov 23, 2005 Post Count: 4 Status: Offline Project Badges: |
As within the beta I see signal 11 on my two RHEL 6 machines. No overclocking, these systems are running rock solid for many months now and have returned quite some valid results for other projects.
Result Name: MIP1_ 00000076_ 0224_ 0-- <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> [2017- 8-24 0: 2: 2:] :: BOINC:: Initializing ... ok. [2017- 8-24 0: 2: 2:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.11_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00000076.flags -out::file::silent result_silent.out -run:jran 955047679 -nstruct 26 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Sequence Length = 40 Starting work on structure: _0001 </stderr_txt> ]]> |
||
|
guhsoftware
Cruncher Germany Joined: Nov 23, 2005 Post Count: 4 Status: Offline Project Badges: |
I did let the machine run dry. Did a reset project. Resumed work.
If I go into projects/www.worldcommunitygrid.org ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08 cat stderrgfx.txt 07:17:25 (20310): Can't open init data file - running in standalone mode SIGSEGV: segmentation violation Stack trace (12 frames): ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08(boinc_catch_signal+0x4d)[0x49859d] /lib64/libpthread.so.0(+0xf5e0)[0x7f505b37d5e0] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x4393b4] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x4935ca] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x4938e8] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x493a39] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x4affa6] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x4b0825] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x49384d] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x43ba57] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f505afccc05] ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08[0x4381a9] Exiting... Might be expected without parameters but looks dubious to me. I can reproduce this on my two RHEL 6 hosts and a CentOS 6 host. Let me know if I can help troubleshooting this further. |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Very hmmm. That's the graphics app part. Years ago split out from the main app in BOINC to prevent tasks crashing if the candy part goes down, which then gets logged to the stderrgfx.txt file in the main data directory, not the job slot. Not even got such a file anywhere on the MIP1 running W10 and Ubuntu 16.04 LTS, though never tried viewing the graphics on the Ubuntu system. Will try that tonight. Repeat, graphics failing is not supposed to crash the main science app / task.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If I go into @guhsoftware mdds isn't mip1 (it was a long-ago beta) projects/www.worldcommunitygrid.org ./wcgrid_mdds_gfx_prod_linux_64.x86.7.08 |
||
|
guhsoftware
Cruncher Germany Joined: Nov 23, 2005 Post Count: 4 Status: Offline Project Badges: |
Additional information for the hosts:
These are virtual machines running on VMware vSphere 6.5. Just the commandline "boinc" is running in an xterm. |
||
|
katoda
Senior Cruncher Poland Joined: Apr 28, 2007 Post Count: 170 Status: Offline Project Badges: |
I tried to run the graphics part of MIP1 (wcgrid_mip1_gfx_7.11_x86_64-pc-linux-gnu) and got the following error
----------------------------------------./wcgrid_mip1_gfx_7.11_x86_64-pc-linux-gnu: error while loading shared libraries: libglut.so.3: cannot open shared object file: No such file or directory so apparently there is a problem with "candy" part of the science application. I'm wondering, despite SekeRob's statement that it should not impact the main science e application, if our problem is somehow linked with it. EDIT: and, just as @guhsoftware, I run Boinc in a terminal. [Edit 1 times, last edit by katoda at Aug 25, 2017 7:42:11 AM] |
||
|
|