Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Microbiome Immunity Project Thread: Lots of MIP1 WUs error out |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 70
|
Author |
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: |
Okay, here is the end of the saga from my end.
----------------------------------------Four of my six nodes are now successfully completing most (> 90%) of the MIP1 WUs they are sent. All of them do still have an occasional segfault, but things are mostly okay now. The fix (I am loath to call it a "solution", because I still don't understand exactly what the problem was) had two parts. One part was turning off XMP in BIOS. The other part was installing a flock of software packages, most of which seem to have nothing to do with the sorts of things a BOINC task would be doing (automake, autoconf, binutils, elfutils, gc, gcc, gcc-fortran, guile, make). Only installing the packages did not fix the problem. Only changing the RAM's BIOS settings did not fix the problem. It had to be both to make WUs stop failing. Finally, the commonality in the four nodes which are now doing MIP1 work is that they all have Ryzen 3900X CPUs. The two which are still failing, 100% of the time, have Ryzen 2700s. My fix there was to create a new profile for them that is not attached to the MIP1 project. If anyone else feels like chasing down this problem further, I'd be very interested to read your findings -- but I'm done now :) |
||
|
jay_Orlando
Senior Cruncher USA Joined: Jan 4, 2006 Post Count: 181 Status: Offline Project Badges: |
Hi there!!
----------------------------------------I *Finally* saw your post. I am having the same thing on an AMD cpu - but not Intel. I have now excluded MIP from the AMD machine. Will wait for someone in the project to take note. (( Iused only the Ubuntu packeages and their .so . )) I had posted under 'support'.. https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43083 Good Luck, Jay PS I noticed a zip file error taht recuirs every year or so. It was fixed at the project too, I believe. Jay |
||
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
I am having the same thing on an AMD cpu - but not Intel. I have now excluded MIP from the AMD machine. On one or two projects (don't remember which ones) I would get segfaults on my Ryzen 1700, but not my Ryzen 2700 (running Ubuntu 14.04/16.04). And I had one of the "fixed" Ryzen 1700 that was supposed to avoid the problem. So if you are running an older AMD CPU, there may be no help for it except to upgrade (or use Intel). |
||
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 90 Status: Offline Project Badges: |
So if you are running an older AMD CPU, there may be no help for it except to upgrade (or use Intel). He's running a Bulldozer 8150. I haven't run MIP in awhile, but my Piledriver 8320 never had a stability problem with MIP when I was running 2 or 3 at a time. That was on CentOS 7 though. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7574 Status: Offline Project Badges: |
I am going to post in this thread about MIP errors, but the underlying cause is different than previously. I have at least a page of errors all saying "WU download error: couldn't get input files:. This has occurred across several of my machines, occurring only sporadically until today. None of the other projects I am running have this problem, just MIP, so I don't think this is a problem on my end. Just wondering if anyone else might be seeing this.
----------------------------------------Cheers Edit: I have now cut off all work for MIP as I now have 18 units which errored on March 18 although it has only been a couple of hours into the 18th. It is the same error and it is on all my machines. Other projects, which is mostly OPN are receiving units normally. The problem appeared to start at about 17:00 Central Standard Time (22:00 UTC) on March 17. I am really curious if any others have experienced this problem, or if the techs have noticed it. Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 1 times, last edit by Sgt.Joe at Mar 18, 2021 2:21:57 AM] |
||
|
dondee
Advanced Cruncher Joined: Jan 16, 2006 Post Count: 100 Status: Offline Project Badges: |
Sgt.Joe,
----------------------------------------I ran mips before and quit. I decided to try again and at first everything looked good. So I went full speed ahead on both machines. I was having the same problem as you with only one of my machines. There are more errors on this machine and the "WU download error: couldn't get input files:" are new. They were around for about two or three days and have since ceased. The other machine has very few errors and none of the input files errors. I don't understand what is going on with this project, no problems with other projects. Also, the machine with most of the errors is dedicated, and the other machine with few errors runs other programs for email, internet, games etc. Both machines run 24/7. I don't like this project for this reason and have run other projects with very little to no trouble. I have quit this project for this reason before and am thinking of moving to another project again. Both machines are basically the same, the motherboards have only minor differences, same manufacturer, and the cpus are the same. The memory is the same, speed, brand and model. The main drives are the same. I will stay with this project for a while longer and see what happens, but errors are a waste of production time that could be benefiting co-operative projects. dondee [Edit 1 times, last edit by dondee at Mar 23, 2021 4:04:42 AM] |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
I have AMD FX 4100, Asus M5A97 R2.0 on Linux Debian. Bad RAM was the cause of my random errors and invalids.
A few months ago, random MIP1 computation errors and random ARP1 invalid. Then I used Linux command, memtester 16G then 20 seconds later computer froze and goes unresponsive. This looks like old DDR3 memory going bad. The old memory was DDR3 32GB (4x8GB) non-EEC. About 2 month ago I recently got new DDR3L ECC UDIMM (unbuffered) 32GB (4x8GB), and this works much better on old AMD FX 4100. Linux memtester 24G passes for hours. MIP1 works with no more random errors. ARP1 are now all valid. Those with random errors and/or invalid, I suggest to test your computer's memory, and replace faulty memory with new memory. |
||
|
SolidAir79
Cruncher Joined: Dec 1, 2018 Post Count: 3 Status: Offline Project Badges: |
Hi all wondering if you could help out getting errors on two of my Linux machines :
Result Log Result Name: MIP1_ 00332990_ 4608_ 0-- <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> [2021- 5- 3 8: 3:14:] :: BOINC:: Initializing ... ok. [2021- 5- 3 8: 3:14:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.16_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00332990.flags -out::file::silent result_silent.out -run:jran 1528695177 -nstruct 4 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Sequence Length = 204 Starting work on structure: _0001 </stderr_txt> ]]> Regards SolidAir |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7574 Status: Offline Project Badges: |
----------------------------------------
Sgt. Joe
*Minnesota Crunchers* |
||
|
jay_Orlando
Senior Cruncher USA Joined: Jan 4, 2006 Post Count: 181 Status: Offline Project Badges: |
Greetings!!
----------------------------------------I run Linux on two machines. I recently upgraded one machine. from Ubuntu 20.04 (Long Term Support) to Ubuntu 21.04 AND, I noticed the MIP1 errors. Lo, and behold: The errors were/are on both machines. This makes me think the problem is the build of the WU. This has happen before. Any news from that front? Jay |
||
|
|