World Community Grid - View Thread - task restarted after reboot of a system

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: task restarted after reboot of a system

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 12

[ ]

Author

This topic has been viewed 2676 times and has 11 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


task restarted after reboot of a system

I'm observing more than once here at a 64 bit system (i5, no GPU), that after few hours of runtime of a task and rebooting the RH system then that task was started from the beginning.
:-(

----------------------------------------
[Edit 2 times, last edit by Former Member at Jan 19, 2013 2:40:16 PM]

[Jan 18, 2013 12:20:42 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


CEP2: Task restarted from after reboot of a system

If you go to the BOINC data dir into the slots sub and find that task, then please copy the stderr.txt and post a copy. An example how an interrupted task looks like is this (a host that was improperly closed in this case):

INFO: No state to restore. Start from the beginning.
20:31:51 (6160): No heartbeat from core client for 30 sec - exiting
20:31:52 (6160): No heartbeat from core client for 30 sec - exiting
20:31:53 (6160): No heartbeat from core client for 30 sec - exiting
20:31:54 (6160): No heartbeat from core client for 30 sec - exiting
20:31:55 (6160): No heartbeat from core client for 30 sec - exiting
20:31:56 (6160): No heartbeat from core client for 30 sec - exiting
20:31:57 (6160): No heartbeat from core client for 30 sec - exiting
20:31:58 (6160): No heartbeat from core client for 30 sec - exiting
20:31:59 (6160): No heartbeat from core client for 30 sec - exiting
20:32:00 (6160): No heartbeat from core client for 30 sec - exiting
[20:33:59] Number of jobs = 16
[20:33:59] Starting job 0,CPU time has been restored to 0.000000.
No heartbeat: Exiting
INFO: No state to restore. Start from the beginning.
[11:11:01] Number of jobs = 16
[11:11:01] Starting job 0,CPU time has been restored to 0.000000.
[11:16:08] Finished Job #0
[11:16:08] Starting job 1,CPU time has been restored to 286.651838.
[11:31:10] Finished Job #1
[11:31:10] Starting job 2,CPU time has been restored to 1128.152432.

[Jan 18, 2013 12:33:47 PM]

l_mckeon
Senior Cruncher
Joined: Oct 20, 2007
Post Count: 439
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

90 day badge for Computing for Sustainable Water

14 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

2 year badge for OpenPandemics - COVID-19


Re: task restarted from after reboot of a system

Isn't this just another case of Clean Energy rarely check pointing?

Download BoincTasks and set it to show time since last checkpoint, or search for the Slots directory and see when last updated.

[Jan 19, 2013 12:48:33 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: task restarted from after reboot of a system

l_mckeon,

What it is and where the task resumes we will know much more closely when the result stderr.txt is posted. "that after few hours of runtime..." should have carried the task well into at least job #2 (3rd job), but it is said here "... started from the beginning.", which to me is 0:00:00

Just hoping that LAIM [Leave application in memory when suspended] is on, as else when the machine is used and BOINC is set to then pause, the task actually will make little to no progress.

Till then...

[Jan 19, 2013 8:29:03 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: task restarted from after reboot of a system

LAIM is on.

>"Isn't this just another case of Clean Energy rarely check pointing"
Yes - but that's what I do not understand - after about 3 hours of runtime there should be made a checkpoint already, or ?

[Jan 19, 2013 2:20:49 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: task restarted from after reboot of a system

Any chance of getting that stderr.txt file from the job slot? Until that is seen we can't even begin to understand what the job has been doing in those hours.

Additionally, also visit the message/event log file stdoutdae.txt and search through it from where that CEP2 job started originally through when it restarted ofter the boot. Post a copy for us to read and analyze, if you want us to.

Where the files are. Look at start of message log where the BOINC data dir path is printed. Slots\x is a subdirectory structure of that.

[Jan 19, 2013 2:28:36 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: task restarted from after reboot of a system

FWIW I think this is the stderr.txt : http://bpaste.net/show/71362/

[Jan 19, 2013 2:42:26 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


CEP2: Task restarted from after reboot of a system

That's what I thought. You wrote the job going back to the beginning, which is 00:00:00. The log is clear, it did not... it went back to the beginning of job #2 (3rd job) and that's the longest of the 16. That one takes very long, several hours and longer [saw one yesterday on my octo taking over 4 hours], so if you boot in that 3rd job, that's what you loose.

[00:06:04] Qink name = drvman
Quit requested: Exiting
[18:17:48] Number of jobs = 16
[18:17:48] Starting job 2,CPU time has been restored to 1716.514000.

The fact that CEP2 has these very long checkpoint intervals [there are 2, the second one somewhere job #11/12, is reason to actually check where the job is before booting. You can do that in the job properties of the BOINC Manager Tasks view, select the task and hit the Properties button on left. It telss when the last one was and how much time has passed since.

[Jan 19, 2013 2:51:52 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2: Task restarted from after reboot of a system

ok - thx for that explanation

[Jan 21, 2013 8:47:43 AM]

linguistian
Cruncher
Joined: Jul 3, 2008
Post Count: 4
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

180 day badge for The Clean Energy Project - Phase 2

14 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

1 year badge for Uncovering Genome Mysteries

14 day badge for Outsmart Ebola Together

14 day badge for Africa Rainfall Project

14 day badge for OpenPandemics - COVID-19


Re: CEP2: Task restarted from after reboot of a system

I have the same problem - BOINC never saves the checkpoints on CEP2. I understand the problem is in that stderr.txt file, but how can it be fixed?

[Mar 25, 2013 6:51:24 AM]

[ ]