| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 27
|
|
| Author |
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
I am currently facing an interesting but disturbing problem.
----------------------------------------5 months ago, I've bought 2 new identical mainboards (Asus), CPU (Ryzen 2700), and 16 GB RAM for each machine. The both machines run under Linux: - Ubuntu 18.04 x64 - LinuxMint 19.1 x64 One machine (LinuxMint) works perfectly. The second machine (Ubuntu) works fine with MCC, but generate only invalid results for Zika (see also my post on this issue: https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=609214) and now SCC. With SCC, only The machine does not report any memory issues. The machine does not crash. The OS and BIOS are up-to-date. Before I will try to exchange the complete system (MB, CPU, RAM), I would like to know if the problem could be caused by a CPU failure (maybe some enhanced instructions are not well supported?). What CPU instructions are only used by Zika and SCC and NOT by MCC or MIP1? In advance, I thank everybody for support. Cheers, Yves ---------------------------------------- [Edit 4 times, last edit by KerSamson at Jul 31, 2019 7:11:18 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It may not be sensible, or even doable, but could you shut both machines down, sway the drives, and boot them up the other way round? You'd soon know if it really was a problem with just your one system then.
|
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi Apis,
----------------------------------------I am considering to install LinuxMint on the trouble-causing machine in order to have identical configuration. However, kernel and glibc versions are the same on both the Ubuntu machine as well as on the MinuxMint machine. Beside a possible issue with the CPU, my other thought is regarding a possible RAM timing problem, although all the RAM parameters are set to auto in the BIOS. The machines are not overclocked. Yves |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I agree. If it runs as expected under Mint, It's not a hardware problem. It would possibly be a difference in either libraries or firmware. I ran into a problem a couple of years ago where a single machine wouldn't run right (CPUs wouldn't run at 100 percent even when fuly loaded with work). It was the only machine with an Intel X5570 processor. Worked fine at a previous Ubuntu level but after an upgrade it stopped. I shutdown the machine and after a couple of Linux Firmware updates, it started working again. I just assume a firmware module was either dropped or changed for the X5570 and was corrected with a later update. If you are running an LTS release you might not have the latest updates to adequately support that processor.
|
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi Doneske,
----------------------------------------I understand what you mean and I will try to find a couple of free hours for installing LinuxMint beside Ubuntu. However, the both OS rely on the same LTS version. It is finally very strange. Yves |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Another option, besides installing an operating system, would be to swap the ~boinc directories for both systems: copy the ~boinc directory to the other machine. Involves some thinking because you don't want to overwrite, you'd want to stop boinc and move ~boinc to a separate directory on both systems, then copy both copies of ~boinc to the other system, move the copies on the other system to ~boinc again and restart boinc. IYKWIM.
![]() |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi adriverhoef,
----------------------------------------IKWYM However I do not see why the problem should be related to the content of boinc data. Nevertheless, if nothing else could help, I will try it. Chees, Yves |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1322 Status: Offline Project Badges:
|
Yves,
----------------------------------------VINA tasks (Zika and SCC at present) seem to be extremely CPU-intensive, with the lowest "cache references" statistics reported by any WCG projects, which tends to suggest short code paths and probably quite good register usage (I use Linux perf stat to look at various metrics.) I doubt that they use any exotic instructions, just that what they do use is efficient in the "instructions per second" area! VINA tasks were the ones that would cause my Intel chips to throttle if I wasn't keeping an eye on temperatures at the hottest time of year; I wonder if you are having temperature issues on one system but not the other. I don't (yet) have a Ryzen myself - I've been waiting for the 3xxx machines to be around for long enough for the rough edges to get knocked off! However, because I've been considering a switch I've been following the "Ryzen and Threadripper" discussion (and other similar threads) in the SETI@Home Number Crunching forum, and there are comments in there about the effects of excess temperature (and bad memory voltages and timings.) Mind you, a lot of that is about overclocking, but if the ambient temperature is high enough the default clocks might be too high... On the assumption that both motherboards have the same BIOS version and all the critical performance settings are the same, the suggestion that Apis made about simply swapping the drives between your two systems and seeing what happens might help work out whether it's a Linux version hardware tuning issue or an actual CPU, memory or board issue, as would your own "side by side install" (provided you ensure they don't use the same root filestore!). But if there are differences in the BIOS settings, all bets are off! I hope you manage to resolve this in a permanent fashion (and I know you'll report what you find!!!) - I really would like a Ryzen for the cost-per-thread benefits, but if they're difficult to keep tuned I'm not so sure! Good luck - Al. [Edited to fix an obvious typo...] [Edit 1 times, last edit by alanb1951 at Jul 20, 2019 5:18:53 AM] |
||
|
|
cadbane
Cruncher Denmark Joined: Jun 11, 2013 Post Count: 7 Status: Offline Project Badges:
|
@Yves, when you say it doesn't report memory issues, does that mean, you've ran a memtest86, for some hours or more?
Other than that there are some good inputs in this thread, and as Alanb1951 points out, it could also be a heat issue on that PC. To answer alanb's concerns over the Ryzens stability, I haven't had any problems with my 2 Ryzen 5 2600's myself. I don't overclock though. But they do have some nice Noctua coolers to keep them cool, mostly for the sake of quiet cooling too. |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi Ryle,
----------------------------------------the both machines are located in the same room and I use Noctua CPU coolers. There is a slight CPU die temperature difference between the both machines: 60°C vs. 65°C (currently the ambient temperature is about 27°C to 30°C: hot Summer in Switzerland). Since the OS versions (LinuxMint vs Ubuntu) are different, I am not sure about the accuracy of the die temperature report. Maybe the difference is only caused by different sensor software. The cases as well as the hardware builds are identical for the both machines. I really appreciate all the received feedbacks and obviously I will continue to report if I am able to solve the problem.Since the machine is not generating errored WUs but only invalid WUs, the question is: Why identical computations could generate binary different result files, without generating any errors? Are we really dealing with a CPU instability caused by the temperature? What would be the real impact of such instability (if any) if the CPU would be used for controlling mission critical application, e.g. process control? Over the night, I had a new idea. My office (4th floor) is just located in front of another building (at the other street side) with a mobile phone antenna at the same level as my office. Would maybe the machine suffer under mobile phone radiation? In order to be able to exclude this possible cause, I will move the machine at another place in my office. Anyway, why only VINA based projects would be impacted by such disturbances? Cheers, Yves |
||
|
|
|