Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 21
|
![]() |
Author |
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hai everybody,
----------------------------------------i posted some months ago about random errors from different WUs from WCG. Cause of this errors i checked the hole machine, got all timings and settings on the standard/normal level - hardware is in the signature. Today, i got an error (and the time to hunt it down) again from MCM: MCM1_ 0007100_ 9182_ 0-- <core_client_version>7.2.42</core_client_version> I cannot see, why this WU got an error (maybe, cause i started an VM???) This also causes, that my virtual machines start, but i have graphical errors and it is a MUST to reboot the physical machine! Any help will be appreciated. Thanks in advance LAZA
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As i see now, i got another one the day before:
----------------------------------------MCM1_0007083_7388 <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 8 </message> <stderr_txt> Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0007083_7388.txt -DatabaseFile dataset-17_72_SDG_v1.txt Settings File DateOfDesign = 08/05/2014 Designer = PMCC_OCI_0.1 WorkOrderID = 0007083_7388 DatasetID = 17_72_SDG_v1 NumberOfGenesInStartingSignature = 18 NumberOfGenesInSignatureMin = 18 NumberOfGenesInSignatureMax = 18 GroupVectorValues = {A}{B}{C}{D}{E}{F} ExplicitStartingGeneSignatures = A B D F StartingGeneSignatureAlgorithm = randomFixedLengthSearch SearchAlgorithmNumberToCreate = 58274 SearchAlgorithmSequentialStartPosition = 5 RunPermutationAlgorithm = 0 PermutationGroups = A PermutationGroupsForReplacement = G PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy PermutationsNumIterations = 0 OptimizationAlgorithmFrequency = 0 0 1 FBeta = 1.5 SimAnnealIMax = 20000 SimAnnealAlpha = 0.9996 FitnessFn = 0 MinFitness = 0.37 NReps = 10 TrainFrac = 0.7 NFolds = 10 VMethod = LOO ModelType = SVM SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0" SvmLearnLimit = 500000 RSeed = 27147389 [21:35:43] Initializing [21:35:46] Running [21:35:46] EvaluateFitnessOfStartingGeneSignatures 58274 </stderr_txt> ]]> Maybe it is a side effect cause another WU from WCG (FAHV_ x1MRX-AS_ 0877453_ 0052_ 3-- ) crashed at the same time...
NAS - Eigenbau
----------------------------------------Xiaomi Mi 10T [Edit 1 times, last edit by LAZA74 at Aug 27, 2014 5:06:29 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Reference to this faq on signal 8 http://boincfaq.mundayweb.com/index.php?view=377 , which it says to have occurred in the log, may have been posted before. Finds on these forums of this error are, ahum, extremely rare. You mention a VM environment and a parallel simultaneous crash of a fahv task. What does that one say in the result log and the message/event log (data stored in the stdoutdae.txt/.old files). Was the VM encroaching on memory, pushing other processes aside?
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And a find at another project, ut chemistry, http://theory.cm.utexas.edu/forum/viewtopic.php?f=9&t=1503
Whatever the meaning is, maybe it can interest armstrdj or seippel. |
||
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
And a find at another project, ut chemistry, http://theory.cm.utexas.edu/forum/viewtopic.php?f=9&t=1503 Whatever the meaning is, maybe it can interest armstrdj or seippel. The answer on the problem there is: "The signal 8 error means " SIGFPE 8 Core Floating point exception" and is project-related. I have edited my input parameters handle this error and have not had such an error has not occurred since." And i'm not crunching on this project anymore (also, there is an successor ;-) Edith: I crawled a bit through my system and found this:
NAS - Eigenbau
----------------------------------------Xiaomi Mi 10T [Edit 2 times, last edit by LAZA74 at Aug 27, 2014 6:08:11 PM] |
||
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Reference to this faq on signal 8 http://boincfaq.mundayweb.com/index.php?view=377 , which it says to have occurred in the log, may have been posted before. Finds on these forums of this error are, ahum, extremely rare. Maybe, cause the BOINC FAQ Service addresses only "and your operating system is Linux with a Kernel version of between 2.6.20 and 2.6.27, then read on." You mention a VM environment and a parallel simultaneous crash of a fahv task. What does that one say in the result log and the message/event log (data stored in the stdoutdae.txt/.old files). Was the VM encroaching on memory, pushing other processes aside? Maybe this happened, but the kernel should stop/pause the WU if the VM needs RAM, CPU time, ... (in an ideal world). Logs are here: https://www.dropbox.com/sh/yvzc1eaap485m6u/AAAnaapDHy5m5m0Lrh9dsifka?dl=0 ----------------------------------------
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
seippel
Former World Community Grid Tech Joined: Apr 16, 2009 Post Count: 392 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
LAZA74, Signal 8 is Floating Point exception. In general this could be an application problem or a problem with the specific computer. In this case since other copies of same work unit completed successfully on other computers, that would indicate a problem with this specific computer. As a starting point, I'd suggest hardware checks. Also, keep in mind that a computer with WCG installed is being utilized to a much larger extent than a normal computer, so hardware problems that might otherwise go unnoticed are more likely to show up.
Seippel |
||
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks Seippel for your advice.
----------------------------------------I did a check (RAM) some months ago as i had the problems for the first time. I also set all settings (BIOS) back to "normal" or "standard" so that OC and other side effect things can be barred. I attached a SSD and doubled the space for the root partition cause of some WUs needed more space. These machine is crunching for years so for me it is a problem with some(!) WUs from MCM - or a bad side effect from a other WU like FAAH (where i get errors and posted also). Seems, that i have to live with it and hope, that this sort of WUs is running out...
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7668 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
These machine is crunching for years so for me it is a problem with some(!) WUs from MCM - or a bad side effect from a other WU like FAAH (where i get errors and posted also). You may have an overheating problem. Since you have been crunching a long time, when was the last time you cleaned or blew out your heatsinks ? Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You may have an overheating problem. Since you have been crunching a long time, when was the last time you cleaned or blew out your heatsinks ? Cheers The Sensor-plugin is running and shows not even 60°C (= 140 Fahrenheit). I crunch only with 3 Cores, so the system load is roundabout 80% (with some other programs running also!) I think, that it is not a (over)heating problem... Cheers
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
|
![]() |