Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 14
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks SekeRob,
just to clarify: the machine is put to hibernate, not reboot. "Keep the WUs in memory" option is chosen as well = however, at the moment I am running only WCG on CPU so this task is not switched off and on. I have been crunching many BOINC projects for many years (albeit in pauses) and I think I can tell the difference between a proper WU behaviour and misbehavioiur when I spot it. If a i5 haswell laptop (pretty modern machine, not a desktop though) does not qualify then I guess it could be better communicated. The other cruncher's result log is as follows: Result Log Result Name: E228006_ 406_ S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_ s1_ 14_ 1-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [11:18:48] Number of jobs = 8 [11:18:48] Starting job 0,CPU time has been restored to 0.000000. [19:42:27] Finished Job #0 [19:42:27] Starting job 1,CPU time has been restored to 17690.638201. [20:15:36] Finished Job #1 [20:15:36] Starting job 2,CPU time has been restored to 18871.440970. [20:46:53] Finished Job #2 [20:46:53] Starting job 3,CPU time has been restored to 19959.345144. [21:29:15] Finished Job #3 [21:29:15] Starting job 4,CPU time has been restored to 21474.504856. [21:59:52] Finished Job #4 [21:59:52] Starting job 5,CPU time has been restored to 22559.398211. [22:26:01] Finished Job #5 [22:26:01] Starting job 6,CPU time has been restored to 23502.751858. Application exited with RC = 0x1 [09:50:32] Finished Job #6 [09:50:32] Starting job 7,CPU time has been restored to 37075.057259. [09:50:32] Skipping Job #7 09:50:41 (4884): called boinc_finish </stderr_txt> ]]> unfortunately, to add an insult to injury, it looks that this WU likes to restart itseld as well :/: 2015-01-31 13:05:43 | World Community Grid | Task E228006_406_S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_s1_14_0 exited with zero status but no 'finished' file 2015-01-31 13:05:43 | World Community Grid | If this happens repeatedly you may need to reset the project. Any ideas what I could do? Thanks for help. |
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Aegis... yes, CEP2 is very difficult to crunch (that's putting nicely
----------------------------------------![]() The fact that BOINC manager sucks and does not inform you when the last checkpoint has actually been done does not help either. Actually you can see when the latest checkpoint was hit by the WU: In BOINC Manager select the desired WU and then select "Properties" on the left side menu. The field (in the popup) you would be looking for is "CPU Time at last checkpoint".CJSL Crunching like there's no tomorrow... |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The fact that BOINC manager sucks and does not inform you when the last checkpoint has actually been done does not help either. Actually you can see when the latest checkpoint was hit by the WU: In BOINC Manager select the desired WU and then select "Properties" on the left side menu. The field (in the popup) you would be looking for is "CPU Time at last checkpoint".. Thank you, CJSL! How have I missed it? How long do we have this feature? I remeber times when it was recommended to actually take a look at a modification time of the files... :) Anyway, finally after one more restart the task finished the simulation successfully and decided it is enough. Taking a look, it had such restarts before, after various time elapsed. Well, it is not a stable simulation I guess. Result Name: E228006_ 406_ S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_ s1_ 14_ 0-- <core_client_version>7.4.36</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [18:15:27] Number of jobs = 8 [18:15:27] Starting job 0,CPU time has been restored to 0.000000. 23:14:10 (8980): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [23:14:21] Number of jobs = 8 [23:14:21] Starting job 0,CPU time has been restored to 0.000000. 09:38:47 (9108): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [09:39:09] Number of jobs = 8 [09:39:09] Starting job 0,CPU time has been restored to 0.000000. 09:39:54 (9676): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [09:40:17] Number of jobs = 8 [09:40:17] Starting job 0,CPU time has been restored to 0.000000. 13:39:43 (5460): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [13:39:52] Number of jobs = 8 [13:39:52] Starting job 0,CPU time has been restored to 0.000000. [20:44:35] Finished Job #0 [20:44:35] Starting job 1,CPU time has been restored to 17415.656250. [21:10:28] Finished Job #1 [21:10:28] Starting job 2,CPU time has been restored to 18890.718750. [21:33:36] Finished Job #2 [21:33:36] Starting job 3,CPU time has been restored to 20209.812500. [22:20:15] Finished Job #3 [22:20:15] Starting job 4,CPU time has been restored to 22740.093750. [01:51:07] Finished Job #4 [01:51:07] Starting job 5,CPU time has been restored to 24750.359375. 12:21:04 (8608): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:21:11] Number of jobs = 8 [12:21:11] Starting job 5,CPU time has been restored to 24750.359375. [12:58:44] Finished Job #5 [12:58:44] Starting job 6,CPU time has been restored to 26381.281250. 17:02:37 (9140): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [17:08:15] Number of jobs = 8 [17:08:15] Starting job 6,CPU time has been restored to 26381.281250. 18:42:36 (4248): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [18:42:43] Number of jobs = 8 [18:42:43] Starting job 6,CPU time has been restored to 26381.281250. 21:44:57 (10584): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [21:45:05] Number of jobs = 8 [21:45:05] Starting job 6,CPU time has been restored to 26381.281250. 01:18:37 (10288): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [01:18:51] Number of jobs = 8 [01:18:51] Starting job 6,CPU time has been restored to 26381.281250. 03:18:54 (10580): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [03:19:09] Number of jobs = 8 [03:19:09] Starting job 6,CPU time has been restored to 26381.281250. 12:04:40 (10384): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:04:53] Number of jobs = 8 [12:04:53] Starting job 6,CPU time has been restored to 26381.281250. 13:05:35 (10840): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [13:06:06] Number of jobs = 8 [13:06:06] Starting job 6,CPU time has been restored to 26381.281250. 13:34:45 (11496): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [13:35:23] Number of jobs = 8 [13:35:23] Starting job 6,CPU time has been restored to 26381.281250. [14:05:29] Number of jobs = 8 [14:05:29] Starting job 6,CPU time has been restored to 26381.281250. Application exited with RC = 0x1 [16:40:34] Finished Job #6 [16:40:34] Starting job 7,CPU time has been restored to 35171.781250. [16:40:34] Skipping Job #7 16:40:43 (4316): called boinc_finish </stderr_txt> ]]> Warm regards from snowy Warsaw and happy crunching everyone! |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I was getting heartbeat errors running CEP2 work on Win XP machines with slow hard drives (eg. WD800AAJS). I had no problems running CEP2 work on Ubuntu machines with older/slower drives (eg. WD2500KS). I ended up removing CEP2 from my Win XP machines and running all CEP2 work on Ubuntu. I was able to run 4 units together with no problems on Ubuntu.
----------------------------------------Try setting your profile to only allow 1 CEP2 unit per host if you are not already. If the heartbeat errors still persist and jobs still get reset, unfortunately CEP2 may not be the project for you. [Edit 5 times, last edit by AgrFan at Feb 1, 2015 1:30:08 AM] |
||
|
|
![]() |