| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 12
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm observing more than once here at a 64 bit system (i5, no GPU), that after few hours of runtime of a task and rebooting the RH system then that task was started from the beginning.
----------------------------------------:-( [Edit 2 times, last edit by Former Member at Jan 19, 2013 2:40:16 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If you go to the BOINC data dir into the slots sub and find that task, then please copy the stderr.txt and post a copy. An example how an interrupted task looks like is this (a host that was improperly closed in this case):
INFO: No state to restore. Start from the beginning. 20:31:51 (6160): No heartbeat from core client for 30 sec - exiting 20:31:52 (6160): No heartbeat from core client for 30 sec - exiting 20:31:53 (6160): No heartbeat from core client for 30 sec - exiting 20:31:54 (6160): No heartbeat from core client for 30 sec - exiting 20:31:55 (6160): No heartbeat from core client for 30 sec - exiting 20:31:56 (6160): No heartbeat from core client for 30 sec - exiting 20:31:57 (6160): No heartbeat from core client for 30 sec - exiting 20:31:58 (6160): No heartbeat from core client for 30 sec - exiting 20:31:59 (6160): No heartbeat from core client for 30 sec - exiting 20:32:00 (6160): No heartbeat from core client for 30 sec - exiting [20:33:59] Number of jobs = 16 [20:33:59] Starting job 0,CPU time has been restored to 0.000000. No heartbeat: Exiting INFO: No state to restore. Start from the beginning. [11:11:01] Number of jobs = 16 [11:11:01] Starting job 0,CPU time has been restored to 0.000000. [11:16:08] Finished Job #0 [11:16:08] Starting job 1,CPU time has been restored to 286.651838. [11:31:10] Finished Job #1 [11:31:10] Starting job 2,CPU time has been restored to 1128.152432. |
||
|
|
l_mckeon
Senior Cruncher Joined: Oct 20, 2007 Post Count: 439 Status: Offline Project Badges:
|
Isn't this just another case of Clean Energy rarely check pointing?
Download BoincTasks and set it to show time since last checkpoint, or search for the Slots directory and see when last updated. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
l_mckeon,
What it is and where the task resumes we will know much more closely when the result stderr.txt is posted. "that after few hours of runtime..." should have carried the task well into at least job #2 (3rd job), but it is said here "... started from the beginning.", which to me is 0:00:00 Just hoping that LAIM [Leave application in memory when suspended] is on, as else when the machine is used and BOINC is set to then pause, the task actually will make little to no progress. Till then... |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
LAIM is on.
>"Isn't this just another case of Clean Energy rarely check pointing" Yes - but that's what I do not understand - after about 3 hours of runtime there should be made a checkpoint already, or ? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Any chance of getting that stderr.txt file from the job slot? Until that is seen we can't even begin to understand what the job has been doing in those hours.
Additionally, also visit the message/event log file stdoutdae.txt and search through it from where that CEP2 job started originally through when it restarted ofter the boot. Post a copy for us to read and analyze, if you want us to. Where the files are. Look at start of message log where the BOINC data dir path is printed. Slots\x is a subdirectory structure of that. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
FWIW I think this is the stderr.txt : http://bpaste.net/show/71362/
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That's what I thought. You wrote the job going back to the beginning, which is 00:00:00. The log is clear, it did not... it went back to the beginning of job #2 (3rd job) and that's the longest of the 16. That one takes very long, several hours and longer [saw one yesterday on my octo taking over 4 hours], so if you boot in that 3rd job, that's what you loose.
[00:06:04] Qink name = drvman Quit requested: Exiting [18:17:48] Number of jobs = 16 [18:17:48] Starting job 2,CPU time has been restored to 1716.514000. The fact that CEP2 has these very long checkpoint intervals [there are 2, the second one somewhere job #11/12, is reason to actually check where the job is before booting. You can do that in the job properties of the BOINC Manager Tasks view, select the task and hit the Properties button on left. It telss when the last one was and how much time has passed since. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
ok - thx for that explanation
|
||
|
|
linguistian
Cruncher Joined: Jul 3, 2008 Post Count: 4 Status: Offline Project Badges:
|
I have the same problem - BOINC never saves the checkpoints on CEP2. I understand the problem is in that stderr.txt file, but how can it be fixed?
|
||
|
|
|