Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 12
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2676 times and has 11 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
task restarted after reboot of a system

I'm observing more than once here at a 64 bit system (i5, no GPU), that after few hours of runtime of a task and rebooting the RH system then that task was started from the beginning.
:-(
----------------------------------------
[Edit 2 times, last edit by Former Member at Jan 19, 2013 2:40:16 PM]
[Jan 18, 2013 12:20:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
CEP2: Task restarted from after reboot of a system

If you go to the BOINC data dir into the slots sub and find that task, then please copy the stderr.txt and post a copy. An example how an interrupted task looks like is this (a host that was improperly closed in this case):

INFO: No state to restore. Start from the beginning.
20:31:51 (6160): No heartbeat from core client for 30 sec - exiting
20:31:52 (6160): No heartbeat from core client for 30 sec - exiting
20:31:53 (6160): No heartbeat from core client for 30 sec - exiting
20:31:54 (6160): No heartbeat from core client for 30 sec - exiting
20:31:55 (6160): No heartbeat from core client for 30 sec - exiting
20:31:56 (6160): No heartbeat from core client for 30 sec - exiting
20:31:57 (6160): No heartbeat from core client for 30 sec - exiting
20:31:58 (6160): No heartbeat from core client for 30 sec - exiting
20:31:59 (6160): No heartbeat from core client for 30 sec - exiting
20:32:00 (6160): No heartbeat from core client for 30 sec - exiting
[20:33:59] Number of jobs = 16
[20:33:59] Starting job 0,CPU time has been restored to 0.000000.
No heartbeat: Exiting
INFO: No state to restore. Start from the beginning.
[11:11:01] Number of jobs = 16
[11:11:01] Starting job 0,CPU time has been restored to 0.000000.
[11:16:08] Finished Job #0
[11:16:08] Starting job 1,CPU time has been restored to 286.651838.
[11:31:10] Finished Job #1
[11:31:10] Starting job 2,CPU time has been restored to 1128.152432.
[Jan 18, 2013 12:33:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
l_mckeon
Senior Cruncher
Joined: Oct 20, 2007
Post Count: 439
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: task restarted from after reboot of a system

Isn't this just another case of Clean Energy rarely check pointing?

Download BoincTasks and set it to show time since last checkpoint, or search for the Slots directory and see when last updated.
[Jan 19, 2013 12:48:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: task restarted from after reboot of a system

l_mckeon,

What it is and where the task resumes we will know much more closely when the result stderr.txt is posted. "that after few hours of runtime..." should have carried the task well into at least job #2 (3rd job), but it is said here "... started from the beginning.", which to me is 0:00:00

Just hoping that LAIM [Leave application in memory when suspended] is on, as else when the machine is used and BOINC is set to then pause, the task actually will make little to no progress.

Till then...
[Jan 19, 2013 8:29:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: task restarted from after reboot of a system

LAIM is on.

>"Isn't this just another case of Clean Energy rarely check pointing"
Yes - but that's what I do not understand - after about 3 hours of runtime there should be made a checkpoint already, or ?
[Jan 19, 2013 2:20:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: task restarted from after reboot of a system

Any chance of getting that stderr.txt file from the job slot? Until that is seen we can't even begin to understand what the job has been doing in those hours.

Additionally, also visit the message/event log file stdoutdae.txt and search through it from where that CEP2 job started originally through when it restarted ofter the boot. Post a copy for us to read and analyze, if you want us to.

Where the files are. Look at start of message log where the BOINC data dir path is printed. Slots\x is a subdirectory structure of that.
[Jan 19, 2013 2:28:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: task restarted from after reboot of a system

FWIW I think this is the stderr.txt : http://bpaste.net/show/71362/
[Jan 19, 2013 2:42:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
CEP2: Task restarted from after reboot of a system

That's what I thought. You wrote the job going back to the beginning, which is 00:00:00. The log is clear, it did not... it went back to the beginning of job #2 (3rd job) and that's the longest of the 16. That one takes very long, several hours and longer [saw one yesterday on my octo taking over 4 hours], so if you boot in that 3rd job, that's what you loose.

[00:06:04] Qink name = drvman
Quit requested: Exiting
[18:17:48] Number of jobs = 16
[18:17:48] Starting job 2,CPU time has been restored to 1716.514000.

The fact that CEP2 has these very long checkpoint intervals [there are 2, the second one somewhere job #11/12, is reason to actually check where the job is before booting. You can do that in the job properties of the BOINC Manager Tasks view, select the task and hit the Properties button on left. It telss when the last one was and how much time has passed since.
[Jan 19, 2013 2:51:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2: Task restarted from after reboot of a system

ok - thx for that explanation
[Jan 21, 2013 8:47:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
linguistian
Cruncher
Joined: Jul 3, 2008
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CEP2: Task restarted from after reboot of a system

I have the same problem - BOINC never saves the checkpoints on CEP2. I understand the problem is in that stderr.txt file, but how can it be fixed?
[Mar 25, 2013 6:51:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread