| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 13
|
|
| Author |
|
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges:
|
i have sometimes WUs that do not end, like the one now:
----------------------------------------FAHV_x3NF9_A_IN_LEDGFa_rig_0213178_2033 estimated calculation time: 32443 GFLOPS runtime till last checkpoint: 2 h 21 min calculation time: 3 h 6 min runtime: 13 hours 32 minutes !!! progress: 83,333% This is one of about 10(?) WUs i have/had to kill cause there running endless... I'm sure not the only person with such problems, but could not find another thread about this behavior. Xubuntu 14.04 (fresh installed) BOINC 7.2.42
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Your reporting in both mcm and faah forums simultaneous suggests the problem is on your end
. Rule one: boot.On signal 8 in your other post, this help is available: http://boincfaq.mundayweb.com/index.php?view=377 All linux signals are discussed here: http://boincfaq.mundayweb.com/index.php?view=165 Of course, the boring 'no don't see any of that' reply applies too. Maybe the reason you don't see any previous thread ![]() |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
I mentioned this behavior in a thread a long time ago in a galaxy far away. The problem indeed was on my end. I am running Linux Mint. There is a two pronged approach to help alleviate the problem 99% of the time.
----------------------------------------First: Make sure the way your connectivity with the internet is stable. It appears that if the connectivity is broken, even for short times, it will cause the WU to display a "signal 11 client is dead" message. The work units are pretty resilient and do recover from these, but there appears to be a hard limit after which BOINC will kill the job. Second: For those jobs which appear to be running in an endless loop, suspend the job, wait about minut so another job can get a good start and then resume the unit. Once it begins running again when a slot opens up, it will take about a minute to get its bearings once again, and should start running normally. I used to have these items both occur quite regularly until I gave my range extender a static IP address instead of letting the router give it a DHCP address. Since then I have only had an isolated incident of a running unit spinning its wheels and very little problem with intermittent connectivity glitches. Hope this helps Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 1 times, last edit by Sgt.Joe at May 1, 2014 12:57:36 AM] |
||
|
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges:
|
Thanks for your response, lavaflow and Sgt. Joe, and also for the useful link to the signals!
----------------------------------------Signal 8 = SIGFPE: Floating point exception So this seems for me that something really bad happens! Or could this be a hardware problem? @lavaflow: I'm on the just-released Xubuntu 14.04 which is distributed with generic kernel 3.13 - both options on the BOINC FAQ address kernel between 2.6.20 and 2.6.27 neither do i get an output with the checks. There is also mentioned "Recently this bug was fixed in the Kernel. You will need a kernel of 2.6.25.6 or higher for this fix." So this would be a bad regression and address some more people! As non-native english speaker it is indeed a problem to search for something you don't know (and don't understand, btw) about and shurely don't know more than two, three buzzwords... @Sgt. Joe: 1. I checked the router and could not find any problems or instability with the internet connection. Somedays it is not the fastest (6 MBit/sec) but works since years without problems. All machines in my net have a static address, but the ISPs address is not fix! 2. If i suspend a FAAH Vina 7.20 (or reboot, restart boinc client, ...) there is no checkpoint made and WUs start from the beginning even if there ~30% finished... 2a. I played a bit with this and got now a WU which restart with 36,666% finished. I will keep this in mind and try it if more irregular things come up. Very strange is this WU: https://secure.worldcommunitygrid.org/ms/devi....do?workunitId=1075421658 CPU Time / Elapsed Time (hours): 3.14 / 13.58 (after 13,58 hours i aborted the WU) valid Wu toke: 16.72 hours!
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
|
LAZA74
Advanced Cruncher Germany Joined: Sep 28, 2008 Post Count: 56 Status: Offline Project Badges:
|
I changed all changes on BIOS back to "Auto" and "Normal", but got two more WUs with problems!
----------------------------------------Paused them both and the scheduler change to a another project, then they began to go crazy: percentage jumped from 83,333% to 1234% and back (and also other numbers)... Real bad
NAS - Eigenbau
Xiaomi Mi 10T |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Having this issue also for the past 2 days on Windows. I'm opting out of this specific project, I'll crunch the other 2 until someone bothers looking into it.
It started happening around the same time as all the downtime and appears to be a workunit heartbeat issue, as if it sits waiting for a server response before progressing. Units that were a few hours have become over 10 hours. The clean energy and cancer projects are working fine. |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
LAZA74, When you have one of these workunits does it use cpu time in top or ps? Also can you post the stderr log from one of these you have aborted.
Thanks, armstrdj |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I am having this happen on multiple rigs (windows and linux)
It is only happening with the Vina units. Did we go through this once before and it was something with the wu's themselves? |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
I have these very occasionally. I finally caught one on MCM1. It was running on Linux Mint 14 32 bit on a Core2 Duo. Here is the Stderr_txt file.
----------------------------------------It is valid. When I saw it was running (accruing time)but not showing any progress and the one core was idle I suspended the WU and then resumed it. It then started running normally. I presume if I had not caught it, it may have run indefinitely. Result Log Result Name: MCM1_ 0004153_ 1465_ 2-- <core_client_version>6.10.59</core_client_version> <![CDATA[ <stderr_txt> Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0004153_1465.txt -DatabaseFile dataset-17_72_SDG_v1.txt Settings File DateOfDesign = 11/08/2013 Designer = PMCC_OCI WorkOrderID = 4153_1465 DatasetID = 17_72_SDG_v1 NumberOfGenesInStartingSignature = 19 NumberOfGenesInSignatureMin = 10 NumberOfGenesInSignatureMax = 20 GroupVectorValues = {A}{B}{C}{D}{E}{F} ExplicitStartingGeneSignatures = A B D F StartingGeneSignatureAlgorithm = randomFixedLengthSearch SearchAlgorithmNumberToCreate = 1 SearchAlgorithmSequentialStartPosition = 5 RunPermutationAlgorithm = 1 PermutationGroups = A PermutationGroupsForReplacement = G PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy PermutationsNumIterations = 54959 OptimizationAlgorithmFrequency = 0 0 1 FBeta = 1.5 SimAnnealIMax = 20000 SimAnnealAlpha = 0.9996 NReps = 10 TrainFrac = 0.7 NFolds = 10 VMethod = LOO ModelType = SVM FitnessFn = 0 MinFitness = 0.61 SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0" SvmLearnLimit = 500000 RSeed = 400891465 [22:29:00] Initializing [22:29:03] Running [22:29:03] EvaluateFitnessOfStartingGeneSignatures 1 [22:29:03]: Computing pass 0 04:22:01 (14403): No heartbeat from client for 30 sec - exiting 04:22:01 (14403): timer handler: client dead, exiting Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0004153_1465.txt -DatabaseFile dataset-17_72_SDG_v1.txt [20:01:15] Initializing [20:01:18] Running [20:01:18] EvaluateFitnessOfStartingGeneSignatures 1 [20:01:19]: Computing pass 0 [21:26:14] Exiting PermutateGeneSignature [21:26:14] Writing final output [21:26:15] Closing Output Stream [21:26:15] Cleaning up Result.out = 2444912.000000 Run complete, CPU time: 26090.620000 21:26:15 (16178): called boinc_finish </stderr_txt> ]]> Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
For users having issues with VINA only this could be due to the workunits being run currently. We are seeing some larger than normal runtimes. Some of these workunits also experience large gaps between checkpoints. We are investigating to see if we need to modify the checkpointing code to take more frequent checkpoints or if this will be a temporary issue with only a small number of batches. I would recommend to any users seeing long running workunits allow them to continue as long as they show cpu being used. While we are investigating it may be beneficial for users to use the setting "Leavel Applications in Memory". This will leave the application in memory when it is swapped out for another task or user work and will then start exactly where it left off when the taks is started again.
Sgt. Joe, your issue is unrelated to this. Have you tried the latest client available for download? Thanks, armstrdj |
||
|
|
|