World Community Grid - View Thread - problems with unexpected long runtimes / no end(?)

World Community Grid Forums

Category: Completed Research

Forum: FightAIDS@Home

Thread: problems with unexpected long runtimes / no end(?)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 13

[ ]

Author

This topic has been viewed 3895 times and has 12 replies

LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

90 day badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

180 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


problems with unexpected long runtimes / no end(?)

i have sometimes WUs that do not end, like the one now:

FAHV_x3NF9_A_IN_LEDGFa_rig_0213178_2033

estimated calculation time: 32443 GFLOPS
runtime till last checkpoint: 2 h 21 min
calculation time: 3 h 6 min
runtime: 13 hours 32 minutes !!!
progress: 83,333%

This is one of about 10(?) WUs i have/had to kill cause there running endless...

I'm sure not the only person with such problems, but could not find another thread about this behavior.

Xubuntu 14.04 (fresh installed)
BOINC 7.2.42

----------------------------------------

NAS - Eigenbau
Xiaomi Mi 10T

[Apr 30, 2014 7:26:38 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: problems with unexpected long runtimes / no end(?)

Your reporting in both mcm and faah forums simultaneous suggests the problem is on your end shock

. Rule one: boot.

On signal 8 in your other post, this help is available: http://boincfaq.mundayweb.com/index.php?view=377 All linux signals are discussed here: http://boincfaq.mundayweb.com/index.php?view=165

Of course, the boring 'no don't see any of that' reply applies too. Maybe the reason you don't see any previous thread biggrin

[Apr 30, 2014 7:52:19 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

100 year badge for OpenPandemics - COVID-19


Re: problems with unexpected long runtimes / no end(?)

I mentioned this behavior in a thread a long time ago in a galaxy far away. The problem indeed was on my end. I am running Linux Mint. There is a two pronged approach to help alleviate the problem 99% of the time.
First: Make sure the way your connectivity with the internet is stable. It appears that if the connectivity is broken, even for short times, it will cause the WU to display a "signal 11 client is dead" message. The work units are pretty resilient and do recover from these, but there appears to be a hard limit after which BOINC will kill the job.
Second: For those jobs which appear to be running in an endless loop, suspend the job, wait about minut so another job can get a good start and then resume the unit. Once it begins running again when a slot opens up, it will take about a minute to get its bearings once again, and should start running normally.
I used to have these items both occur quite regularly until I gave my range extender a static IP address instead of letting the router give it a DHCP address. Since then I have only had an isolated incident of a running unit spinning its wheels and very little problem with intermittent connectivity glitches.
Hope this helps
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

----------------------------------------
[Edit 1 times, last edit by Sgt.Joe at May 1, 2014 12:57:36 AM]

[May 1, 2014 12:56:37 AM]

LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:


Re: problems with unexpected long runtimes / no end(?)

Thanks for your response, lavaflow and Sgt. Joe, and also for the useful link to the signals!

Signal 8 = SIGFPE: Floating point exception
So this seems for me that something really bad happens! crying

Or could this be a hardware problem?

@lavaflow:
I'm on the just-released Xubuntu 14.04 which is distributed with generic kernel 3.13 - both options on the BOINC FAQ address kernel between 2.6.20 and 2.6.27 neither do i get an output with the checks.
There is also mentioned "Recently this bug was fixed in the Kernel. You will need a kernel of 2.6.25.6 or higher for this fix."
So this would be a bad regression and address some more people!

As non-native english speaker it is indeed a problem to search for something you don't know (and don't understand, btw) about and shurely don't know more than two, three buzzwords...

@Sgt. Joe:
1. I checked the router and could not find any problems or instability with the internet connection. Somedays it is not the fastest (6 MBit/sec) but works since years without problems.
All machines in my net have a static address, but the ISPs address is not fix!

2. If i suspend a FAAH Vina 7.20 (or reboot, restart boinc client, ...) there is no checkpoint made and WUs start from the beginning even if there ~30% finished...
2a. I played a bit with this and got now a WU which restart with 36,666% finished.
I will keep this in mind and try it if more irregular things come up.

Very strange is this WU:
https://secure.worldcommunitygrid.org/ms/devi....do?workunitId=1075421658

CPU Time / Elapsed Time (hours):
3.14 / 13.58 (after 13,58 hours i aborted the WU)
valid Wu toke: 16.72 hours!

----------------------------------------

NAS - Eigenbau
Xiaomi Mi 10T

[May 3, 2014 4:31:42 PM]

LAZA74
Advanced Cruncher
Germany
Joined: Sep 28, 2008
Post Count: 56
Status: Offline
Project Badges:


Re: problems with unexpected long runtimes / no end(?)

I changed all changes on BIOS back to "Auto" and "Normal", but got two more WUs with problems!

Paused them both and the scheduler change to a another project, then they began to go crazy: percentage jumped from 83,333% to 1234% and back (and also other numbers)...
Real bad

----------------------------------------

NAS - Eigenbau
Xiaomi Mi 10T

[May 4, 2014 2:26:11 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: problems with unexpected long runtimes / no end(?)

Having this issue also for the past 2 days on Windows. I'm opting out of this specific project, I'll crunch the other 2 until someone bothers looking into it.

It started happening around the same time as all the downtime and appears to be a workunit heartbeat issue, as if it sits waiting for a server response before progressing. Units that were a few hours have become over 10 hours.

The clean energy and cancer projects are working fine.

[May 8, 2014 12:38:07 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries


Re: problems with unexpected long runtimes / no end(?)

LAZA74, When you have one of these workunits does it use cpu time in top or ps? Also can you post the stderr log from one of these you have aborted.

Thanks,
armstrdj

[May 9, 2014 2:36:10 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: problems with unexpected long runtimes / no end(?)

I am having this happen on multiple rigs (windows and linux)
It is only happening with the Vina units. Did we go through this once before and it was something with the wu's themselves?

[May 11, 2014 3:26:12 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:


Re: problems with unexpected long runtimes / no end(?)

I have these very occasionally. I finally caught one on MCM1. It was running on Linux Mint 14 32 bit on a Core2 Duo. Here is the Stderr_txt file.
It is valid. When I saw it was running (accruing time)but not showing any progress and the one core was idle I suspended the WU and then resumed it. It then started running normally. I presume if I had not caught it, it may have run indefinitely.
Result Log

Result Name: MCM1_ 0004153_ 1465_ 2--
<core_client_version>6.10.59</core_client_version>
<![CDATA[
<stderr_txt>
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0004153_1465.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Settings File
DateOfDesign = 11/08/2013
Designer = PMCC_OCI
WorkOrderID = 4153_1465
DatasetID = 17_72_SDG_v1
NumberOfGenesInStartingSignature = 19
NumberOfGenesInSignatureMin = 10
NumberOfGenesInSignatureMax = 20
GroupVectorValues = {A}{B}{C}{D}{E}{F}
ExplicitStartingGeneSignatures = A B D F
StartingGeneSignatureAlgorithm = randomFixedLengthSearch
SearchAlgorithmNumberToCreate = 1
SearchAlgorithmSequentialStartPosition = 5
RunPermutationAlgorithm = 1
PermutationGroups = A
PermutationGroupsForReplacement = G
PermutationAlgorithm = replaceFromRandomlyToRandomlyGreedy
PermutationsNumIterations = 54959
OptimizationAlgorithmFrequency = 0 0 1
FBeta = 1.5
SimAnnealIMax = 20000
SimAnnealAlpha = 0.9996
NReps = 10
TrainFrac = 0.7
NFolds = 10
VMethod = LOO
ModelType = SVM
FitnessFn = 0
MinFitness = 0.61
SvmArgs = "-v 0 -c 0.1 -t 1 -d 2 -r 0"
SvmLearnLimit = 500000
RSeed = 400891465

[22:29:00] Initializing
[22:29:03] Running
[22:29:03] EvaluateFitnessOfStartingGeneSignatures 1
[22:29:03]: Computing pass 0
04:22:01 (14403): No heartbeat from client for 30 sec - exiting
04:22:01 (14403): timer handler: client dead, exiting
Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.32_x86_64-pc-linux-gnu -SettingsFile MCM1_0004153_1465.txt -DatabaseFile dataset-17_72_SDG_v1.txt
[20:01:15] Initializing
[20:01:18] Running
[20:01:18] EvaluateFitnessOfStartingGeneSignatures 1
[20:01:19]: Computing pass 0
[21:26:14] Exiting PermutateGeneSignature
[21:26:14] Writing final output
[21:26:15] Closing Output Stream
[21:26:15] Cleaning up
Result.out = 2444912.000000
Run complete, CPU time: 26090.620000
21:26:15 (16178): called boinc_finish

</stderr_txt>
]]>

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[May 12, 2014 1:11:30 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:


Re: problems with unexpected long runtimes / no end(?)

For users having issues with VINA only this could be due to the workunits being run currently. We are seeing some larger than normal runtimes. Some of these workunits also experience large gaps between checkpoints. We are investigating to see if we need to modify the checkpointing code to take more frequent checkpoints or if this will be a temporary issue with only a small number of batches. I would recommend to any users seeing long running workunits allow them to continue as long as they show cpu being used. While we are investigating it may be beneficial for users to use the setting "Leavel Applications in Memory". This will leave the application in memory when it is swapped out for another task or user work and will then start exactly where it left off when the taks is started again.

Sgt. Joe, your issue is unrelated to this. Have you tried the latest client available for download?

Thanks,
armstrdj

[May 13, 2014 2:47:09 PM]

[ ]