Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Outsmart Ebola Together Thread: High error rate on Ryzen |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 14
|
Author |
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
I am getting about an equal number of errors as valid OET1 work units on my Ryzen 1700 running Ubuntu 18.04.1 (not overclocked, running 24/7 and cool). However, the errors are usually completed OK on other machines, which I assume are running Intel.
----------------------------------------Has anyone looked into this? Since the errors typically consume several hours of time, I will deselect it for a while. (I am running all the other WCG projects without problems.) [Edit 1 times, last edit by Jim1348 at Oct 1, 2018 3:27:03 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7581 Status: Recently Active Project Badges: |
Is there any indication in the results file indicating the nature of the error ?
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
They all (8 of them since 26 Sept.) look like this:
<core_client_version>7.12.0</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63)</message> <stderr_txt> INFO: result number = 0 INFO: No state to restore. Start from the beginning. [14:10:20] Number of tasks = 1 [14:10:20] Running task 0,CPU time at start of task 0 was 0.000000 [14:10:20] ./ZINC05830100.pdbqt size = 25 6 ../../projects/www.worldcommunitygrid.org/oet1.xMBGP-OM_rig.pdbqt size = 1930 0 SIGILL: illegal instruction Stack trace (4 frames): [0x4dc2c2] [0x586b50] [0x552c29] [0x7f1c003e18b0] Exiting... |
||
|
rod4x4
Cruncher Joined: Apr 29, 2014 Post Count: 12 Status: Offline Project Badges: |
I am running a Ryzen 1700 on Ubuntu 18.04.1, Overclocked running 24/7.
OET is only running on 11 threads. It has processed over 240 jobs since 25/09/18, no errors encountered in my case. |
||
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
It has processed over 240 jobs since 25/09/18, no errors encountered in my case. Interesting, since I was running a mix of all the jobs (along with Folding on a GPU). That suggests that the errors are due to OET running along with something else. But I think it is not practical to determine what is the something else; there are too many combinations. But thanks for the info. EDIT: I was also running the GPUGrid/QC work units, which are multi-core. I might try suspending those for a while and try again on OET. (I have been wanting to put the QC on a separate machine anyway, and it would eliminate one variable. I would expect the WCG jobs to run OK with each other, but there is no guarantee of that.) [Edit 3 times, last edit by Jim1348 at Oct 4, 2018 1:30:38 AM] |
||
|
rod4x4
Cruncher Joined: Apr 29, 2014 Post Count: 12 Status: Offline Project Badges: |
My machine is also running GPUgrid - single GPU task.
Have not run any GPUgrid QC CPU tasks... seems there is a lot of chatter on GPUgrid forums regarding "challenges" these CPU tasks are having. Sounds like suspending the QC tasks will be a good starting point. Good Luck!! |
||
|
mmonnin
Advanced Cruncher Joined: Jul 20, 2016 Post Count: 148 Status: Offline Project Badges: |
I've ran OET on my 1950x on all threads before w/o issues. Did some of those QC tasks take all the memory or disk space? Not sure if those are still using extreme amounts of resources or not.
---------------------------------------- |
||
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
Did some of those QC tasks take all the memory or disk space? Not sure if those are still using extreme amounts of resources or not. They should not have overloaded my system. I have 32 GB of memory, and somewhere around 180 GB of disk space free. The QC were running on 2 cores each, and limited to a maximum of 4 work units at a time by an app_config.xml file (and on average only 2 work units at a time via the resouce share settings). It would take an usual combination of QC work to exceed that, though I suppose it is possible. However, more tellingly, I ended the QC work on 4 Oct 2018, with the last one reporting at 15:40:45 UTC. I also re-started OET, picking up my next error already for an OET work unit sent on 10/5/18 at 04:48:42. So it appears that the problem is not due to QC. My next guess would be interference from a MIP work unit, which sometimes errors also, though less frequently than OET. I will disable MIP for a while. |
||
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
My next guess would be interference from a MIP work unit, which sometimes errors also, though less frequently than OET. I will disable MIP for a while. Since disabling MIP, I have completed 30 OET without error. This shows that MIP interfered with OET, and probably vice-versa. I think that answers the question. (There may have been other factors too, but this is one that I can identify.) |
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1671 Status: Offline Project Badges: |
I am still considering that MIP1 is really not OK.
----------------------------------------There is no acceptable justification for negative interactions between projects (excepted regarding performance). If your observation does accurately reflect the reality, MIP1 could impact negatively even non WCG work, i.e. other applications running on the machine. If it is the case, it would significantly damage the approach of grid computing "only utilizing idle CPU resources without disturbing the system behaviour". After a couple of weeks with troubles after MIP1 launch, I rigorously stopped to support this project, although I do consider the topic as being really important. Cheers, Yves |
||
|
|