Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 27
Posts: 27   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4493 times and has 26 replies Next Thread
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
CPU failure ? - SOLVED

I am currently facing an interesting but disturbing problem.
5 months ago, I've bought 2 new identical mainboards (Asus), CPU (Ryzen 2700), and 16 GB RAM for each machine. The both machines run under Linux:
- Ubuntu 18.04 x64
- LinuxMint 19.1 x64
One machine (LinuxMint) works perfectly.
The second machine (Ubuntu) works fine with MCC, but generate only invalid results for Zika (see also my post on this issue: https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=609214) and now SCC.
With SCC, only 1 10 WUs from 125 221 SCC WU's has been declared valid.
The machine does not report any memory issues. The machine does not crash. The OS and BIOS are up-to-date.
Before I will try to exchange the complete system (MB, CPU, RAM), I would like to know if the problem could be caused by a CPU failure (maybe some enhanced instructions are not well supported?).
What CPU instructions are only used by Zika and SCC and NOT by MCC or MIP1?
In advance, I thank everybody for support.
Cheers,
Yves
----------------------------------------
----------------------------------------
[Edit 4 times, last edit by KerSamson at Jul 31, 2019 7:11:18 AM]
[Jul 19, 2019 9:09:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CPU failure ?

It may not be sensible, or even doable, but could you shut both machines down, sway the drives, and boot them up the other way round? You'd soon know if it really was a problem with just your one system then.
[Jul 19, 2019 9:16:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi Apis,
I am considering to install LinuxMint on the trouble-causing machine in order to have identical configuration. However, kernel and glibc versions are the same on both the Ubuntu machine as well as on the MinuxMint machine.
Beside a possible issue with the CPU, my other thought is regarding a possible RAM timing problem, although all the RAM parameters are set to auto in the BIOS. The machines are not overclocked.
Yves
----------------------------------------
[Jul 19, 2019 12:49:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CPU failure ?

I agree. If it runs as expected under Mint, It's not a hardware problem. It would possibly be a difference in either libraries or firmware. I ran into a problem a couple of years ago where a single machine wouldn't run right (CPUs wouldn't run at 100 percent even when fuly loaded with work). It was the only machine with an Intel X5570 processor. Worked fine at a previous Ubuntu level but after an upgrade it stopped. I shutdown the machine and after a couple of Linux Firmware updates, it started working again. I just assume a firmware module was either dropped or changed for the X5570 and was corrected with a later update. If you are running an LTS release you might not have the latest updates to adequately support that processor.
[Jul 19, 2019 12:58:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi Doneske,
I understand what you mean and I will try to find a couple of free hours for installing LinuxMint beside Ubuntu. However, the both OS rely on the same LTS version.
It is finally very strange.
Yves
----------------------------------------
[Jul 19, 2019 1:08:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Another option, besides installing an operating system, would be to swap the ~boinc directories for both systems: copy the ~boinc directory to the other machine. Involves some thinking because you don't want to overwrite, you'd want to stop boinc and move ~boinc to a separate directory on both systems, then copy both copies of ~boinc to the other system, move the copies on the other system to ~boinc again and restart boinc. IYKWIM. biggrin
[Jul 19, 2019 3:43:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi adriverhoef,
IKWYM wink
However I do not see why the problem should be related to the content of boinc data.
Nevertheless, if nothing else could help, I will try it.
Chees,
Yves
----------------------------------------
[Jul 19, 2019 10:57:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1322
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Yves,

VINA tasks (Zika and SCC at present) seem to be extremely CPU-intensive, with the lowest "cache references" statistics reported by any WCG projects, which tends to suggest short code paths and probably quite good register usage (I use Linux perf stat to look at various metrics.) I doubt that they use any exotic instructions, just that what they do use is efficient in the "instructions per second" area!

VINA tasks were the ones that would cause my Intel chips to throttle if I wasn't keeping an eye on temperatures at the hottest time of year; I wonder if you are having temperature issues on one system but not the other.

I don't (yet) have a Ryzen myself - I've been waiting for the 3xxx machines to be around for long enough for the rough edges to get knocked off! However, because I've been considering a switch I've been following the "Ryzen and Threadripper" discussion (and other similar threads) in the SETI@Home Number Crunching forum, and there are comments in there about the effects of excess temperature (and bad memory voltages and timings.) Mind you, a lot of that is about overclocking, but if the ambient temperature is high enough the default clocks might be too high...

On the assumption that both motherboards have the same BIOS version and all the critical performance settings are the same, the suggestion that Apis made about simply swapping the drives between your two systems and seeing what happens might help work out whether it's a Linux version hardware tuning issue or an actual CPU, memory or board issue, as would your own "side by side install" (provided you ensure they don't use the same root filestore!). But if there are differences in the BIOS settings, all bets are off!

I hope you manage to resolve this in a permanent fashion (and I know you'll report what you find!!!) - I really would like a Ryzen for the cost-per-thread benefits, but if they're difficult to keep tuned I'm not so sure!

Good luck - Al.

[Edited to fix an obvious typo...]
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jul 20, 2019 5:18:53 AM]
[Jul 20, 2019 5:16:04 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cadbane
Cruncher
Denmark
Joined: Jun 11, 2013
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

@Yves, when you say it doesn't report memory issues, does that mean, you've ran a memtest86, for some hours or more?

Other than that there are some good inputs in this thread, and as Alanb1951 points out, it could also be a heat issue on that PC.

To answer alanb's concerns over the Ryzens stability, I haven't had any problems with my 2 Ryzen 5 2600's myself. I don't overclock though. But they do have some nice Noctua coolers to keep them cool, mostly for the sake of quiet cooling too.
[Jul 20, 2019 8:08:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CPU failure ?

Hi Ryle,
the both machines are located in the same room and I use Noctua CPU coolers.
There is a slight CPU die temperature difference between the both machines: 60°C vs. 65°C (currently the ambient temperature is about 27°C to 30°C: hot Summer in Switzerland).
Since the OS versions (LinuxMint vs Ubuntu) are different, I am not sure about the accuracy of the die temperature report. Maybe the difference is only caused by different sensor software.
The cases as well as the hardware builds are identical for the both machines.

I really appreciate all the received feedbacks smile and obviously I will continue to report if I am able to solve the problem.

Since the machine is not generating errored WUs but only invalid WUs, the question is: Why identical computations could generate binary different result files, without generating any errors?
Are we really dealing with a CPU instability caused by the temperature?
What would be the real impact of such instability (if any) if the CPU would be used for controlling mission critical application, e.g. process control?

Over the night, I had a new idea.
My office (4th floor) is just located in front of another building (at the other street side) with a mobile phone antenna at the same level as my office.
Would maybe the machine suffer under mobile phone radiation?
In order to be able to exclude this possible cause, I will move the machine at another place in my office.
Anyway, why only VINA based projects would be impacted by such disturbances?

Cheers,
Yves
----------------------------------------
[Jul 20, 2019 9:10:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 27   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread