Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 24
Posts: 24   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 5905 times and has 23 replies Next Thread
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

KerSamson,

I looked at those workunits and doesn't look like a workunit issue. Has anything on your machine changed or have you always had these issues?

Thanks,
armstrdj
[Aug 22, 2016 2:20:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Hi amstrdj,
the host is running well, no change and no negative interaction with other applications since the host is 24/7/365 devoted to WCG.
Usually I reboot the system after kernel updates and glibc update.
The machine is not too hot since the room is well aired. There was no electricity troubles.
The host is a Phenom II x6 at 3 GHz with 16 GB RAM, updated Ubuntu 14.04 x64.
A couple of days ago, the same host experienced an invalid result (HST1_007022_000063_AC0032_T325_F00077_S00008) as well for a 17.5 hours long WU.
I don't have any idea regarding the cause for the random crunching troubles (recurrent trouble (invalid result) for AMD/Linux-based hosts).
Cheers,
Yves
----------------------------------------
[Aug 24, 2016 6:55:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

On another host, after 14+ hours, following error (HST1_ 007237_ 000015_ AC0024_ T300_ F00050_ S00009_ 1-- ):
step 45388: Water molecule starting at atom 124032 can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates
SIGSEGV: segmentation violation
Stack trace (12 frames):

Cheers,
Yves
----------------------------------------
[Aug 27, 2016 1:38:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

KerSamson I have not seen that error before and the other runs did not have that. The current beta running has some changes that can effect some variation across different processors. I will look through those results to see if this error shows up.

Thanks,
armstrdj
[Sep 1, 2016 2:41:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Hi armstrdj,
feel free to contact me directly if you have some news or if you need more background info.
Yves
----------------------------------------
[Sep 4, 2016 9:43:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Bad news !
Again an error just at the end of a WU computation (99.xx%), after 17+ hours sad
HST1_007810_000068_AC0021_T300_F00013_S00010
SIGSEGV: segmentation violation
Stack trace (12 frames):
...

Cheers,
Yves
--
PS: In the mean time, this host computed successfully several HST1 WUs.
----------------------------------------
[Sep 17, 2016 7:53:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Eric_Kaiser
Veteran Cruncher
Germany (Hessen)
Joined: May 7, 2013
Post Count: 1047
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Had some issues today too causing boinc to stop completely. I had no chance to restart boinc on my server. A reboot of the server failed too. Only a hardware reset braught the server back online and boinc up again. I made no changes on the server and the server was up over a year.
These wu errored out:
HST1_ 007772_ 000058_ MC0019_ T325_ F00080_ S00010_ 0-- with finish file present too long error
HST1_ 007768_ 000096_ AT0016_ T325_ F00071_ S00009_ 0--
with finish file present too long
HST1_ 007766_ 000048_ MC0019_ T400_ F00046_ S00009_ 0--
with SIGSEGV
HST1_ 007766_ 000053_ MC0019_ T400_ F00052_ S00009_ 0--
with SIGSEGV
----------------------------------------

[Sep 17, 2016 12:27:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Hi Yves, did you install an applet measuring the temperature of the cores?
Help is here: https://help.ubuntu.com/community/SensorInstallHowto
Just trying to be helpful. biggrin

Adri
[Sep 17, 2016 12:28:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Hi Adri,
CPU temperature is monitored and OK <60°C.
The host runs Ubuntu 14.04 x64; CPU: Phenom II x6 @ 3 GHz, 16 GB RAM.
Everything is fine with the host.
At the same time, some other long HST1 WUs have been computed without any incident.

@Eric: I did experience a similar case like your about 6 or 8 weeks ago with HST1. Within 9 years contribution, I think that it was the first time a boinc project (HST1) fully crashed a system.

Cheers,
Yves
----------------------------------------
[Sep 18, 2016 6:41:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Eric_Kaiser
Veteran Cruncher
Germany (Hessen)
Joined: May 7, 2013
Post Count: 1047
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2 HST tasks reported as error today

Yves, a few weeks ago there was an issue with the memory allocation of the wu causing that boinc wasn't able to start new wu.
Only wu from wuprop and the like were still running. The server/boinc was controllable by boinctask or command line from my computer at home.
Cancelling the trouble making wu did the trick.

This time the server/boinc didn't show up in boinctask or to be precise it was not reachable. Even when I have logged in to the server via ssh and did boinccmd I had no chance.
Even a shutdown -r now got stuck. The server was completely unreachable even with a ping.
Only a hardware reset from the managing console of the hoster braught the server back online.

This was the first time in 1.5 years with my rented servers that this happened.
----------------------------------------

[Sep 18, 2016 11:27:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 24   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread