| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 24
|
|
| Author |
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
KerSamson,
I looked at those workunits and doesn't look like a workunit issue. Has anything on your machine changed or have you always had these issues? Thanks, armstrdj |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi amstrdj,
----------------------------------------the host is running well, no change and no negative interaction with other applications since the host is 24/7/365 devoted to WCG. Usually I reboot the system after kernel updates and glibc update. The machine is not too hot since the room is well aired. There was no electricity troubles. The host is a Phenom II x6 at 3 GHz with 16 GB RAM, updated Ubuntu 14.04 x64. A couple of days ago, the same host experienced an invalid result (HST1_007022_000063_AC0032_T325_F00077_S00008) as well for a 17.5 hours long WU. I don't have any idea regarding the cause for the random crunching troubles (recurrent trouble (invalid result) for AMD/Linux-based hosts). Cheers, Yves |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
On another host, after 14+ hours, following error (HST1_ 007237_ 000015_ AC0024_ T300_ F00050_ S00009_ 1-- ):
----------------------------------------step 45388: Water molecule starting at atom 124032 can not be settled. Cheers, Yves |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
KerSamson I have not seen that error before and the other runs did not have that. The current beta running has some changes that can effect some variation across different processors. I will look through those results to see if this error shows up.
Thanks, armstrdj |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi armstrdj,
----------------------------------------feel free to contact me directly if you have some news or if you need more background info. Yves |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Bad news !
----------------------------------------Again an error just at the end of a WU computation (99.xx%), after 17+ hours ![]() HST1_007810_000068_AC0021_T300_F00013_S00010 SIGSEGV: segmentation violation Cheers, Yves -- PS: In the mean time, this host computed successfully several HST1 WUs. |
||
|
|
Eric_Kaiser
Veteran Cruncher Germany (Hessen) Joined: May 7, 2013 Post Count: 1047 Status: Offline Project Badges:
|
Had some issues today too causing boinc to stop completely. I had no chance to restart boinc on my server. A reboot of the server failed too. Only a hardware reset braught the server back online and boinc up again. I made no changes on the server and the server was up over a year.
----------------------------------------These wu errored out: HST1_ 007772_ 000058_ MC0019_ T325_ F00080_ S00010_ 0-- with finish file present too long error HST1_ 007768_ 000096_ AT0016_ T325_ F00071_ S00009_ 0-- with finish file present too long HST1_ 007766_ 000048_ MC0019_ T400_ F00046_ S00009_ 0-- with SIGSEGV HST1_ 007766_ 000053_ MC0019_ T400_ F00052_ S00009_ 0-- with SIGSEGV ![]() |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Recently Active Project Badges:
|
Hi Yves, did you install an applet measuring the temperature of the cores?
Help is here: https://help.ubuntu.com/community/SensorInstallHowto Just trying to be helpful. ![]() Adri |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Hi Adri,
----------------------------------------CPU temperature is monitored and OK <60°C. The host runs Ubuntu 14.04 x64; CPU: Phenom II x6 @ 3 GHz, 16 GB RAM. Everything is fine with the host. At the same time, some other long HST1 WUs have been computed without any incident. @Eric: I did experience a similar case like your about 6 or 8 weeks ago with HST1. Within 9 years contribution, I think that it was the first time a boinc project (HST1) fully crashed a system. Cheers, Yves |
||
|
|
Eric_Kaiser
Veteran Cruncher Germany (Hessen) Joined: May 7, 2013 Post Count: 1047 Status: Offline Project Badges:
|
Yves, a few weeks ago there was an issue with the memory allocation of the wu causing that boinc wasn't able to start new wu.
----------------------------------------Only wu from wuprop and the like were still running. The server/boinc was controllable by boinctask or command line from my computer at home. Cancelling the trouble making wu did the trick. This time the server/boinc didn't show up in boinctask or to be precise it was not reachable. Even when I have logged in to the server via ssh and did boinccmd I had no chance. Even a shutdown -r now got stuck. The server was completely unreachable even with a ping. Only a hardware reset from the managing console of the hoster braught the server back online. This was the first time in 1.5 years with my rented servers that this happened. ![]() |
||
|
|
|