Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 14
Posts: 14   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 10946 times and has 13 replies Next Thread
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

I am still considering that MIP1 is really not OK.
There is no acceptable justification for negative interactions between projects (excepted regarding performance).

It does seem to work the other way too. That is, I started running Ebola again after several days without it, along with MIP. I got a few errors on Ebola (about one out of ten), but also for the first time I picked up three errors on MIP. I don't recall ever seeing that before when they run by themselves.

One thing I do a little differently than most people is that I run Folding on a GPU (GTX 1070), and reserve a CPU core for it. Normally, there is no interaction with the BOINC projects, but I have seen it on rare occasion. However, I consider it high priority, and will not be stopping it to check. So my experience may not be the same as for others, but yes MIP is a little more problematic than some.

The MIP errors were all the same:
<core_client_version>7.12.0</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
[2018-10-18 20:43:41:] :: BOINC:: Initializing ... ok.
[2018-10-18 20:43:41:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00109580.flags -out::file::silent result_silent.out -run:jran 1309508479 -nstruct 3 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Sequence Length = 223
Starting work on structure: _0001
Finished _0001 in 1984.02 seconds.
Starting work on structure: _0002

</stderr_txt>

----------------------------------------
[Edit 1 times, last edit by Jim1348 at Oct 20, 2018 1:44:46 PM]
[Oct 20, 2018 1:40:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7633
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

process got signal 11</message>

This is the key message. When I have gotten this message in the distant past, it was an indication there was some part of the machine which was bottle necked in some way. Some processes were competing for the same resource at the same time and the software was not able to handle the conflict in a smooth way. I don't know where in the system the conflict may have occurred, but it only happened with CEP2 ( which was very resource intensive) on one system.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 20, 2018 3:40:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

I don't know where in the system the conflict may have occurred, but it only happened with CEP2 ( which was very resource intensive) on one system.

Interesting. I started using a ramdisk or write-cache back in the CEP2 days (to protect the SSD), and it generally avoided problems. I still use a 12 GB write cache on the Ryzen 1700 (Ubuntu 18.04). But with 32 GB main memory, I still have 22 GB free at the moment (not all the cache is used). That is with four MIP running, and all the other cores busy on WCG or Folding/GPU. So it is probably some resource other than memory in conflict, though I have no idea what.

I actually limit MIP to four at a time with an app_config to prevent problems with run times; they are currently averaging 1 hour 30 minutes, and the maximum is under 3 hours. Maybe I will try limiting them to three at a time, though I think I will just drop Ebola first. It is near the end of its run, and they don't really need me now.

Thanks.
[Oct 20, 2018 4:23:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

I picked up another Signal 11 error on MIP this morning, even though I was not running any OET. So I will limit MIP to running three at a time with the app_config, and also set "Number of workunits per host for the Microbiome Immunity Project?" to 12, in order to prevent too many from downloading. That should fix it.

I have never gotten an MIP error on my i7-4771 (Win 7 64-bit), though they are usually limited to running two at a time there. But it could be the Intel machines are more resistant to MIP errors. I will be building a Ryzen 2700 shortly, and will see how it goes there.
[Oct 21, 2018 11:32:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 14   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread