Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 21
Posts: 21   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3238 times and has 20 replies Next Thread
LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
more Errors...

Hai everybody,

i posted some months ago about random errors from different WUs from WCG.
Cause of this errors i checked the hole machine, got all timings and settings on the standard/normal level - hardware is in the signature.

Today, i got an error (and the time to hunt it down) again from MCM:

MCM1_ 0007100_ 9182_ 0--

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message> process got signal 8 </message>
<stderr_txt> Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0007100_9182.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Settings File
DateOfDesign = 08/05/2014
Designer = PMCC_OCI_0.1
WorkOrderID = 0007100_9182
DatasetID = 17_72_SDG_v1
NumberOfGenesInStartingSignature = 19
NumberOfGenesInSignatureMin = 19
NumberOfGenesInSignatureMax = 19
GroupVectorValues = {A}{B}{C}{D}{E}{F}
ExplicitStartingGeneSignatures = A B D F
StartingGeneSignatureAlgorithm = randomFixedLengthSearch SearchAlgorithmNumberToCreate = 54959
SearchAlgorithmSequentialStartPosition = 5
RunPermutationAlgorithm = 0
PermutationGroups = A
PermutationGroupsForReplacement = G
PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy PermutationsNumIterations = 0
OptimizationAlgorithmFrequency = 0 0 1
FBeta = 1.5
SimAnnealIMax = 20000
SimAnnealAlpha = 0.9996
FitnessFn = 0
MinFitness = 0.37
NReps = 10
TrainFrac = 0.7
NFolds = 10
VMethod = LOO
ModelType = SVM
SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0"
SvmLearnLimit = 500000
RSeed = 27319183

[13:26:20] Initializing
[13:26:22] Running
[13:26:23] EvaluateFitnessOfStartingGeneSignatures 54959

</stderr_txt> ]]>


I cannot see, why this WU got an error (maybe, cause i started an VM???)

This also causes, that my virtual machines start, but i have graphical errors and it is a MUST to reboot the physical machine!

Any help will be appreciated.
Thanks in advance
LAZA
----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T
[Aug 27, 2014 5:01:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...

As i see now, i got another one the day before:

MCM1_0007083_7388

<core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 8 </message> <stderr_txt> Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0007083_7388.txt -DatabaseFile dataset-17_72_SDG_v1.txt Settings File DateOfDesign = 08/05/2014 Designer = PMCC_OCI_0.1 WorkOrderID = 0007083_7388 DatasetID = 17_72_SDG_v1 NumberOfGenesInStartingSignature = 18 NumberOfGenesInSignatureMin = 18 NumberOfGenesInSignatureMax = 18 GroupVectorValues = {A}{B}{C}{D}{E}{F} ExplicitStartingGeneSignatures = A B D F StartingGeneSignatureAlgorithm = randomFixedLengthSearch SearchAlgorithmNumberToCreate = 58274 SearchAlgorithmSequentialStartPosition = 5 RunPermutationAlgorithm = 0 PermutationGroups = A PermutationGroupsForReplacement = G PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy PermutationsNumIterations = 0 OptimizationAlgorithmFrequency = 0 0 1 FBeta = 1.5 SimAnnealIMax = 20000 SimAnnealAlpha = 0.9996 FitnessFn = 0 MinFitness = 0.37 NReps = 10 TrainFrac = 0.7 NFolds = 10 VMethod = LOO ModelType = SVM SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0"  SvmLearnLimit = 500000 RSeed = 27147389   [21:35:43] Initializing [21:35:46] Running [21:35:46] EvaluateFitnessOfStartingGeneSignatures 58274  </stderr_txt> ]]>


Maybe it is a side effect cause another WU from WCG (FAHV_ x1MRX-AS_ 0877453_ 0052_ 3-- ) crashed at the same time...
----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T
----------------------------------------
[Edit 1 times, last edit by LAZA74 at Aug 27, 2014 5:06:29 PM]
[Aug 27, 2014 5:05:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: more Errors...

Reference to this faq on signal 8 http://boincfaq.mundayweb.com/index.php?view=377 , which it says to have occurred in the log, may have been posted before. Finds on these forums of this error are, ahum, extremely rare. You mention a VM environment and a parallel simultaneous crash of a fahv task. What does that one say in the result log and the message/event log (data stored in the stdoutdae.txt/.old files). Was the VM encroaching on memory, pushing other processes aside?
[Aug 27, 2014 5:15:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: more Errors...

And a find at another project, ut chemistry, http://theory.cm.utexas.edu/forum/viewtopic.php?f=9&t=1503

Whatever the meaning is, maybe it can interest armstrdj or seippel.
[Aug 27, 2014 5:19:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...

And a find at another project, ut chemistry, http://theory.cm.utexas.edu/forum/viewtopic.php?f=9&t=1503

Whatever the meaning is, maybe it can interest armstrdj or seippel.


The answer on the problem there is:
"The signal 8 error means " SIGFPE 8 Core Floating point exception" and is project-related. I have edited my input parameters handle this error and have not had such an error has not occurred since."

And i'm not crunching on this project anymore (also, there is an successor ;-)



Edith:

I crawled a bit through my system and found this:


laza@xubuntu:/usr/src/linux-headers-3.13.0-34-generic$ grep PREEMPT .config #
CONFIG_PREEMPT_RCU is not set CONFIG_PREEMPT_NOTIFIERS=y #
CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y #
CONFIG_PREEMPT is not set

----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T
----------------------------------------
[Edit 2 times, last edit by LAZA74 at Aug 27, 2014 6:08:11 PM]
[Aug 27, 2014 5:30:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...

Reference to this faq on signal 8 http://boincfaq.mundayweb.com/index.php?view=377 , which it says to have occurred in the log, may have been posted before. Finds on these forums of this error are, ahum, extremely rare.


Maybe, cause the BOINC FAQ Service addresses only "and your operating system is Linux with a Kernel version of between 2.6.20 and 2.6.27, then read on."

You mention a VM environment and a parallel simultaneous crash of a fahv task. What does that one say in the result log and the message/event log (data stored in the stdoutdae.txt/.old files). Was the VM encroaching on memory, pushing other processes aside?


Maybe this happened, but the kernel should stop/pause the WU if the VM needs RAM, CPU time, ... (in an ideal world).

Logs are here:
https://www.dropbox.com/sh/yvzc1eaap485m6u/AAAnaapDHy5m5m0Lrh9dsifka?dl=0
----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T
[Aug 27, 2014 6:01:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
seippel
Former World Community Grid Tech
Joined: Apr 16, 2009
Post Count: 392
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...

LAZA74, Signal 8 is Floating Point exception. In general this could be an application problem or a problem with the specific computer. In this case since other copies of same work unit completed successfully on other computers, that would indicate a problem with this specific computer. As a starting point, I'd suggest hardware checks. Also, keep in mind that a computer with WCG installed is being utilized to a much larger extent than a normal computer, so hardware problems that might otherwise go unnoticed are more likely to show up.

Seippel
[Sep 3, 2014 6:06:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...

Thanks Seippel for your advice.

I did a check (RAM) some months ago as i had the problems for the first time.
I also set all settings (BIOS) back to "normal" or "standard" so that OC and other side effect things can be barred.
I attached a SSD and doubled the space for the root partition cause of some WUs needed more space.

These machine is crunching for years so for me it is a problem with some(!) WUs from MCM - or a bad side effect from a other WU like FAAH (where i get errors and posted also).

Seems, that i have to live with it and hope, that this sort of WUs is running out...
----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T
[Sep 4, 2014 4:50:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7668
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...

These machine is crunching for years so for me it is a problem with some(!) WUs from MCM - or a bad side effect from a other WU like FAAH (where i get errors and posted also).


You may have an overheating problem. Since you have been crunching a long time, when was the last time you cleaned or blew out your heatsinks ?

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 4, 2014 6:18:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: more Errors...


You may have an overheating problem. Since you have been crunching a long time, when was the last time you cleaned or blew out your heatsinks ?

Cheers


The Sensor-plugin is running and shows not even 60°C (= 140 Fahrenheit).
I crunch only with 3 Cores, so the system load is roundabout 80% (with some other programs running also!)

I think, that it is not a (over)heating problem...

Cheers
----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T
[Sep 6, 2014 8:28:15 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 21   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread