Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 14
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 10599 times and has 13 replies Next Thread
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
High error rate on Ryzen

I am getting about an equal number of errors as valid OET1 work units on my Ryzen 1700 running Ubuntu 18.04.1 (not overclocked, running 24/7 and cool). However, the errors are usually completed OK on other machines, which I assume are running Intel.

Has anyone looked into this? Since the errors typically consume several hours of time, I will deselect it for a while.

(I am running all the other WCG projects without problems.)
----------------------------------------
[Edit 1 times, last edit by Jim1348 at Oct 1, 2018 3:27:03 PM]
[Oct 1, 2018 3:23:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7581
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

Is there any indication in the results file indicating the nature of the error ?
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 1, 2018 5:59:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

They all (8 of them since 26 Sept.) look like this:
<core_client_version>7.12.0</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)</message>
<stderr_txt>
INFO: result number = 0
INFO: No state to restore. Start from the beginning.
[14:10:20] Number of tasks = 1
[14:10:20] Running task 0,CPU time at start of task 0 was 0.000000
[14:10:20] ./ZINC05830100.pdbqt size = 25 6 ../../projects/www.worldcommunitygrid.org/oet1.xMBGP-OM_rig.pdbqt size = 1930 0
SIGILL: illegal instruction
Stack trace (4 frames):
[0x4dc2c2]
[0x586b50]
[0x552c29]
[0x7f1c003e18b0]

Exiting...

[Oct 1, 2018 11:46:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
rod4x4
Cruncher
Joined: Apr 29, 2014
Post Count: 12
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

I am running a Ryzen 1700 on Ubuntu 18.04.1, Overclocked running 24/7.
OET is only running on 11 threads.

It has processed over 240 jobs since 25/09/18, no errors encountered in my case.
[Oct 3, 2018 11:18:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

It has processed over 240 jobs since 25/09/18, no errors encountered in my case.

Interesting, since I was running a mix of all the jobs (along with Folding on a GPU). That suggests that the errors are due to OET running along with something else. But I think it is not practical to determine what is the something else; there are too many combinations. But thanks for the info.

EDIT: I was also running the GPUGrid/QC work units, which are multi-core. I might try suspending those for a while and try again on OET. (I have been wanting to put the QC on a separate machine anyway, and it would eliminate one variable. I would expect the WCG jobs to run OK with each other, but there is no guarantee of that.)
----------------------------------------
[Edit 3 times, last edit by Jim1348 at Oct 4, 2018 1:30:38 AM]
[Oct 4, 2018 1:20:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
rod4x4
Cruncher
Joined: Apr 29, 2014
Post Count: 12
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

My machine is also running GPUgrid - single GPU task.

Have not run any GPUgrid QC CPU tasks... seems there is a lot of chatter on GPUgrid forums regarding "challenges" these CPU tasks are having.
Sounds like suspending the QC tasks will be a good starting point.
Good Luck!!
[Oct 4, 2018 8:20:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mmonnin
Advanced Cruncher
Joined: Jul 20, 2016
Post Count: 148
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

I've ran OET on my 1950x on all threads before w/o issues. Did some of those QC tasks take all the memory or disk space? Not sure if those are still using extreme amounts of resources or not.
----------------------------------------

[Oct 5, 2018 3:01:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

Did some of those QC tasks take all the memory or disk space? Not sure if those are still using extreme amounts of resources or not.

They should not have overloaded my system. I have 32 GB of memory, and somewhere around 180 GB of disk space free. The QC were running on 2 cores each, and limited to a maximum of 4 work units at a time by an app_config.xml file (and on average only 2 work units at a time via the resouce share settings). It would take an usual combination of QC work to exceed that, though I suppose it is possible.

However, more tellingly, I ended the QC work on 4 Oct 2018, with the last one reporting at 15:40:45 UTC. I also re-started OET, picking up my next error already for an OET work unit sent on 10/5/18 at 04:48:42. So it appears that the problem is not due to QC. My next guess would be interference from a MIP work unit, which sometimes errors also, though less frequently than OET. I will disable MIP for a while.
[Oct 6, 2018 12:59:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

My next guess would be interference from a MIP work unit, which sometimes errors also, though less frequently than OET. I will disable MIP for a while.

Since disabling MIP, I have completed 30 OET without error. This shows that MIP interfered with OET, and probably vice-versa. I think that answers the question. (There may have been other factors too, but this is one that I can identify.)
[Oct 8, 2018 1:30:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1671
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate on Ryzen

I am still considering that MIP1 is really not OK.
There is no acceptable justification for negative interactions between projects (excepted regarding performance).
If your observation does accurately reflect the reality, MIP1 could impact negatively even non WCG work, i.e. other applications running on the machine. If it is the case, it would significantly damage the approach of grid computing "only utilizing idle CPU resources without disturbing the system behaviour".
After a couple of weeks with troubles after MIP1 launch, I rigorously stopped to support this project, although I do consider the topic as being really important.
Cheers,
Yves
----------------------------------------
[Oct 13, 2018 11:46:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread