Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 14
Posts: 14   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2351 times and has 13 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Uneffective work: saving problems

Thanks SekeRob,

just to clarify: the machine is put to hibernate, not reboot. "Keep the WUs in memory" option is chosen as well = however, at the moment I am running only WCG on CPU so this task is not switched off and on.

I have been crunching many BOINC projects for many years (albeit in pauses) and I think I can tell the difference between a proper WU behaviour and misbehavioiur when I spot it. If a i5 haswell laptop (pretty modern machine, not a desktop though) does not qualify then I guess it could be better communicated.

The other cruncher's result log is as follows:


Result Log

Result Name: E228006_ 406_ S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_ s1_ 14_ 1--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[11:18:48] Number of jobs = 8
[11:18:48] Starting job 0,CPU time has been restored to 0.000000.
[19:42:27] Finished Job #0
[19:42:27] Starting job 1,CPU time has been restored to 17690.638201.
[20:15:36] Finished Job #1
[20:15:36] Starting job 2,CPU time has been restored to 18871.440970.
[20:46:53] Finished Job #2
[20:46:53] Starting job 3,CPU time has been restored to 19959.345144.
[21:29:15] Finished Job #3
[21:29:15] Starting job 4,CPU time has been restored to 21474.504856.
[21:59:52] Finished Job #4
[21:59:52] Starting job 5,CPU time has been restored to 22559.398211.
[22:26:01] Finished Job #5
[22:26:01] Starting job 6,CPU time has been restored to 23502.751858.
Application exited with RC = 0x1
[09:50:32] Finished Job #6
[09:50:32] Starting job 7,CPU time has been restored to 37075.057259.
[09:50:32] Skipping Job #7
09:50:41 (4884): called boinc_finish

</stderr_txt>
]]>

unfortunately, to add an insult to injury, it looks that this WU likes to restart itseld as well :/:

2015-01-31 13:05:43 | World Community Grid | Task E228006_406_S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_s1_14_0 exited with zero status but no 'finished' file
2015-01-31 13:05:43 | World Community Grid | If this happens repeatedly you may need to reset the project.

Any ideas what I could do? Thanks for help.
[Jan 31, 2015 12:51:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Uneffective work: saving problems

Aegis... yes, CEP2 is very difficult to crunch (that's putting nicely biggrin ). Unfortunately, it is not a project where everybody can participate (without jumping through hoops to get WUs out the door). You really need a dedicated computer that runs 24x7 or only crunch CEP2 when you know that you aren't going to turn off your computer (along with some babysitting).
The fact that BOINC manager sucks and does not inform you when the last checkpoint has actually been done does not help either.
Actually you can see when the latest checkpoint was hit by the WU: In BOINC Manager select the desired WU and then select "Properties" on the left side menu. The field (in the popup) you would be looking for is "CPU Time at last checkpoint".

CJSL

Crunching like there's no tomorrow...
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


[Jan 31, 2015 12:58:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Uneffective work: saving problems


The fact that BOINC manager sucks and does not inform you when the last checkpoint has actually been done does not help either.
Actually you can see when the latest checkpoint was hit by the WU: In BOINC Manager select the desired WU and then select "Properties" on the left side menu. The field (in the popup) you would be looking for is "CPU Time at last checkpoint".
.


Thank you, CJSL! How have I missed it? How long do we have this feature? I remeber times when it was recommended to actually take a look at a modification time of the files... :)

Anyway, finally after one more restart the task finished the simulation successfully and decided it is enough. Taking a look, it had such restarts before, after various time elapsed. Well, it is not a stable simulation I guess.

Result Name: E228006_ 406_ S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_ s1_ 14_ 0--
<core_client_version>7.4.36</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[18:15:27] Number of jobs = 8
[18:15:27] Starting job 0,CPU time has been restored to 0.000000.
23:14:10 (8980): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[23:14:21] Number of jobs = 8
[23:14:21] Starting job 0,CPU time has been restored to 0.000000.
09:38:47 (9108): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[09:39:09] Number of jobs = 8
[09:39:09] Starting job 0,CPU time has been restored to 0.000000.
09:39:54 (9676): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[09:40:17] Number of jobs = 8
[09:40:17] Starting job 0,CPU time has been restored to 0.000000.
13:39:43 (5460): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[13:39:52] Number of jobs = 8
[13:39:52] Starting job 0,CPU time has been restored to 0.000000.
[20:44:35] Finished Job #0
[20:44:35] Starting job 1,CPU time has been restored to 17415.656250.
[21:10:28] Finished Job #1
[21:10:28] Starting job 2,CPU time has been restored to 18890.718750.
[21:33:36] Finished Job #2
[21:33:36] Starting job 3,CPU time has been restored to 20209.812500.
[22:20:15] Finished Job #3
[22:20:15] Starting job 4,CPU time has been restored to 22740.093750.
[01:51:07] Finished Job #4
[01:51:07] Starting job 5,CPU time has been restored to 24750.359375.
12:21:04 (8608): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:21:11] Number of jobs = 8
[12:21:11] Starting job 5,CPU time has been restored to 24750.359375.
[12:58:44] Finished Job #5
[12:58:44] Starting job 6,CPU time has been restored to 26381.281250.
17:02:37 (9140): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[17:08:15] Number of jobs = 8
[17:08:15] Starting job 6,CPU time has been restored to 26381.281250.
18:42:36 (4248): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[18:42:43] Number of jobs = 8
[18:42:43] Starting job 6,CPU time has been restored to 26381.281250.
21:44:57 (10584): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[21:45:05] Number of jobs = 8
[21:45:05] Starting job 6,CPU time has been restored to 26381.281250.
01:18:37 (10288): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[01:18:51] Number of jobs = 8
[01:18:51] Starting job 6,CPU time has been restored to 26381.281250.
03:18:54 (10580): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[03:19:09] Number of jobs = 8
[03:19:09] Starting job 6,CPU time has been restored to 26381.281250.
12:04:40 (10384): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:04:53] Number of jobs = 8
[12:04:53] Starting job 6,CPU time has been restored to 26381.281250.
13:05:35 (10840): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[13:06:06] Number of jobs = 8
[13:06:06] Starting job 6,CPU time has been restored to 26381.281250.
13:34:45 (11496): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[13:35:23] Number of jobs = 8
[13:35:23] Starting job 6,CPU time has been restored to 26381.281250.
[14:05:29] Number of jobs = 8
[14:05:29] Starting job 6,CPU time has been restored to 26381.281250.
Application exited with RC = 0x1
[16:40:34] Finished Job #6
[16:40:34] Starting job 7,CPU time has been restored to 35171.781250.
[16:40:34] Skipping Job #7
16:40:43 (4316): called boinc_finish

</stderr_txt>
]]>

Warm regards from snowy Warsaw and happy crunching everyone!
[Jan 31, 2015 4:49:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 376
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Uneffective work: saving problems

I was getting heartbeat errors running CEP2 work on Win XP machines with slow hard drives (eg. WD800AAJS). I had no problems running CEP2 work on Ubuntu machines with older/slower drives (eg. WD2500KS). I ended up removing CEP2 from my Win XP machines and running all CEP2 work on Ubuntu. I was able to run 4 units together with no problems on Ubuntu.

Try setting your profile to only allow 1 CEP2 unit per host if you are not already.

If the heartbeat errors still persist and jobs still get reset, unfortunately CEP2 may not be the project for you.
----------------------------------------
[Edit 5 times, last edit by AgrFan at Feb 1, 2015 1:30:08 AM]
[Feb 1, 2015 1:17:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 14   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread