World Community Grid - View Thread - Uneffective work: saving problems

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Uneffective work: saving problems

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 14

[ ]

Author

This topic has been viewed 3449 times and has 13 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Uneffective work: saving problems

Thanks SekeRob,

just to clarify: the machine is put to hibernate, not reboot. "Keep the WUs in memory" option is chosen as well = however, at the moment I am running only WCG on CPU so this task is not switched off and on.

I have been crunching many BOINC projects for many years (albeit in pauses) and I think I can tell the difference between a proper WU behaviour and misbehavioiur when I spot it. If a i5 haswell laptop (pretty modern machine, not a desktop though) does not qualify then I guess it could be better communicated.

The other cruncher's result log is as follows:

Result Log

Result Name: E228006_ 406_ S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_ s1_ 14_ 1--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[11:18:48] Number of jobs = 8
[11:18:48] Starting job 0,CPU time has been restored to 0.000000.
[19:42:27] Finished Job #0
[19:42:27] Starting job 1,CPU time has been restored to 17690.638201.
[20:15:36] Finished Job #1
[20:15:36] Starting job 2,CPU time has been restored to 18871.440970.
[20:46:53] Finished Job #2
[20:46:53] Starting job 3,CPU time has been restored to 19959.345144.
[21:29:15] Finished Job #3
[21:29:15] Starting job 4,CPU time has been restored to 21474.504856.
[21:59:52] Finished Job #4
[21:59:52] Starting job 5,CPU time has been restored to 22559.398211.
[22:26:01] Finished Job #5
[22:26:01] Starting job 6,CPU time has been restored to 23502.751858.
Application exited with RC = 0x1
[09:50:32] Finished Job #6
[09:50:32] Starting job 7,CPU time has been restored to 37075.057259.
[09:50:32] Skipping Job #7
09:50:41 (4884): called boinc_finish

</stderr_txt>
]]>

unfortunately, to add an insult to injury, it looks that this WU likes to restart itseld as well :/:

2015-01-31 13:05:43 | World Community Grid | Task E228006_406_S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_s1_14_0 exited with zero status but no 'finished' file
2015-01-31 13:05:43 | World Community Grid | If this happens repeatedly you may need to reset the project.

Any ideas what I could do? Thanks for help.

[Jan 31, 2015 12:51:10 PM]

cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

90 day badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Uneffective work: saving problems

Aegis... yes, CEP2 is very difficult to crunch (that's putting nicely biggrin

). Unfortunately, it is not a project where everybody can participate (without jumping through hoops to get WUs out the door). You really need a dedicated computer that runs 24x7 or only crunch CEP2 when you know that you aren't going to turn off your computer (along with some babysitting).

The fact that BOINC manager sucks and does not inform you when the last checkpoint has actually been done does not help either.

Actually you can see when the latest checkpoint was hit by the WU: In BOINC Manager select the desired WU and then select "Properties" on the left side menu. The field (in the popup) you would be looking for is "CPU Time at last checkpoint".

CJSL

Crunching like there's no tomorrow...

----------------------------------------

I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team

[Jan 31, 2015 12:58:17 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Uneffective work: saving problems

The fact that BOINC manager sucks and does not inform you when the last checkpoint has actually been done does not help either.

Thank you, CJSL! How have I missed it? How long do we have this feature? I remeber times when it was recommended to actually take a look at a modification time of the files... :)

Anyway, finally after one more restart the task finished the simulation successfully and decided it is enough. Taking a look, it had such restarts before, after various time elapsed. Well, it is not a stable simulation I guess.

Result Name: E228006_ 406_ S.296.C36H22N6O2.WPJOEQODJJXVCT-UHFFFAOYSA-N.12_ s1_ 14_ 0--
<core_client_version>7.4.36</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[18:15:27] Number of jobs = 8
[18:15:27] Starting job 0,CPU time has been restored to 0.000000.
23:14:10 (8980): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[23:14:21] Number of jobs = 8
[23:14:21] Starting job 0,CPU time has been restored to 0.000000.
09:38:47 (9108): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[09:39:09] Number of jobs = 8
[09:39:09] Starting job 0,CPU time has been restored to 0.000000.
09:39:54 (9676): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[09:40:17] Number of jobs = 8
[09:40:17] Starting job 0,CPU time has been restored to 0.000000.
13:39:43 (5460): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[13:39:52] Number of jobs = 8
[13:39:52] Starting job 0,CPU time has been restored to 0.000000.
[20:44:35] Finished Job #0
[20:44:35] Starting job 1,CPU time has been restored to 17415.656250.
[21:10:28] Finished Job #1
[21:10:28] Starting job 2,CPU time has been restored to 18890.718750.
[21:33:36] Finished Job #2
[21:33:36] Starting job 3,CPU time has been restored to 20209.812500.
[22:20:15] Finished Job #3
[22:20:15] Starting job 4,CPU time has been restored to 22740.093750.
[01:51:07] Finished Job #4
[01:51:07] Starting job 5,CPU time has been restored to 24750.359375.
12:21:04 (8608): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:21:11] Number of jobs = 8
[12:21:11] Starting job 5,CPU time has been restored to 24750.359375.
[12:58:44] Finished Job #5
[12:58:44] Starting job 6,CPU time has been restored to 26381.281250.
17:02:37 (9140): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[17:08:15] Number of jobs = 8
[17:08:15] Starting job 6,CPU time has been restored to 26381.281250.
18:42:36 (4248): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[18:42:43] Number of jobs = 8
[18:42:43] Starting job 6,CPU time has been restored to 26381.281250.
21:44:57 (10584): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[21:45:05] Number of jobs = 8
[21:45:05] Starting job 6,CPU time has been restored to 26381.281250.
01:18:37 (10288): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[01:18:51] Number of jobs = 8
[01:18:51] Starting job 6,CPU time has been restored to 26381.281250.
03:18:54 (10580): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[03:19:09] Number of jobs = 8
[03:19:09] Starting job 6,CPU time has been restored to 26381.281250.
12:04:40 (10384): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:04:53] Number of jobs = 8
[12:04:53] Starting job 6,CPU time has been restored to 26381.281250.
13:05:35 (10840): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[13:06:06] Number of jobs = 8
[13:06:06] Starting job 6,CPU time has been restored to 26381.281250.
13:34:45 (11496): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[13:35:23] Number of jobs = 8
[13:35:23] Starting job 6,CPU time has been restored to 26381.281250.
[14:05:29] Number of jobs = 8
[14:05:29] Starting job 6,CPU time has been restored to 26381.281250.
Application exited with RC = 0x1
[16:40:34] Finished Job #6
[16:40:34] Starting job 7,CPU time has been restored to 35171.781250.
[16:40:34] Skipping Job #7
16:40:43 (4316): called boinc_finish

</stderr_txt>
]]>

Warm regards from snowy Warsaw and happy crunching everyone!

[Jan 31, 2015 4:49:20 PM]

AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 396
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project


Re: Uneffective work: saving problems

I was getting heartbeat errors running CEP2 work on Win XP machines with slow hard drives (eg. WD800AAJS). I had no problems running CEP2 work on Ubuntu machines with older/slower drives (eg. WD2500KS). I ended up removing CEP2 from my Win XP machines and running all CEP2 work on Ubuntu. I was able to run 4 units together with no problems on Ubuntu.

Try setting your profile to only allow 1 CEP2 unit per host if you are not already.

If the heartbeat errors still persist and jobs still get reset, unfortunately CEP2 may not be the project for you.

----------------------------------------

i5-10400 (Comet Lake, 6C/12T) @ 2.9 GHz
i5-7400 (Kaby Lake, 4C/4T) @ 3.0 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3330 (Ivy Bridge, 4C/4T) @ 3.0 GHz

----------------------------------------
[Edit 5 times, last edit by AgrFan at Feb 1, 2015 1:30:08 AM]

[Feb 1, 2015 1:17:00 AM]

[ ]