World Community Grid - View Thread - Clean Energy Project - Phase 2 BETA test new workunits

World Community Grid Forums

Category: Beta Testing

Forum: Beta Test Support Forum

Thread: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 118

[ ]

Author

This topic has been viewed 26679 times and has 117 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

And another encore for, yes staggered starting of heavy io apps such as cep2. How to do that: maybe get the agent to read a 'heavy' flag, then make put a stay on all these of count minus 1 and wait serially for 5-10 minutes before releasing the next and the next. Applies to both block starting and restarting, after a power up for instance. Opt-in science so who would be confused over this?

Of course linux suffers much more from the particular 'heavy i/o' issue as windows. Yes i did write that! Efficiency on linux is multiple percentage points worse on linux compared to windows when it involves this science.

[Aug 19, 2014 2:03:28 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I've just had another example of a unit restarting, but I'm puzzled by the detailed timings. Anyone care to explain?

Event Log has these lines (I have the checkpoint_debug log flag on):

19/08/2014 12:14:48 | World Community Grid | [checkpoint] result BETA_E225108_20_S.328.C44H28N4O1.RLFMUDRFIZQDNP-UHFFFAOYSA-N.3_s1_14_0 checkpointed
19/08/2014 12:19:25 | World Community Grid | Task BETA_E225108_20_S.328.C44H28N4O1.RLFMUDRFIZQDNP-UHFFFAOYSA-N.3_s1_14_0 exited with zero status but no 'finished' file
19/08/2014 12:19:25 | World Community Grid | If this happens repeatedly you may need to reset the project.
19/08/2014 12:19:25 | World Community Grid | Computation for task BETA_E225108_701_S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.10_s1_14_1 finished
19/08/2014 12:19:56 | World Community Grid | Starting task BETA_E225108_694_S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.3_s1_14_1
19/08/2014 12:19:57 | World Community Grid | [checkpoint] result BETA_E225108_20_S.328.C44H28N4O1.RLFMUDRFIZQDNP-UHFFFAOYSA-N.3_s1_14_0 checkpointed
19/08/2014 12:19:57 | World Community Grid | Started upload of BETA_E225108_701_S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.10_s1_14_1_0

So, once again, one unit finishing (BETA_E225108_701_...) seems to cause the exit of another unit (BETA_E225108_20_...), at 12:19:25 (times are GMT+1).

The Result Log for _20_ has these lines:

[12:14:47] Finished Job #3
[12:14:47] Starting job 4,CPU time has been restored to 15872.742948.
12:19:54 (9208): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:19:56] Number of jobs = 8
[12:19:56] Starting job 4,CPU time has been restored to 15872.742948.

Note that the No heartbeat line is timed at 12:19:54, which is 29 seconds AFTER unit _20_ exited. Well, you might say that was reasonable; once the unit has exited it can't send the heartbeat signal, BOINC takes 29 or 30 seconds to detect that before reporting it; a couple of seconds later, it restarts the unit ([12:19:56] Starting job 4). BUT, it means that the No heartbeat warning is caused by the unit's earlier exit, not the other way round (e.g. BOINC forcing an exit because of failing to receive the heartbeat - although the heartbeat warning seems to suggest that, maybe as a fallback).

I'm left with no satisfactory explanation as to why _20_ exited at 12:19:25. OK, another unit finished and probably kicked off loads of I/O activity (note 32 seconds before the upload started), but why should that cause another Windows process to exit?

I also think this is a case where staggered starts wouldn't have helped - all 4 cores were running well-staggered Beta CEP2 units when this happened.

FWIW, _20_ went on to complete successfully (now in PVal).

[Aug 19, 2014 3:34:06 PM]

Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Fight Childhood Cancer

50 year badge for The Clean Energy Project - Phase 2

45 day badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

So does my error indicate there were too many tasks all running that started at the same time, causing a bottleneck with writes to the hdd?

Or am I reading this wrong?

----------------------------------------

Mamajuanauk is the Name! Crunching is the Game!

[Aug 19, 2014 3:46:16 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I also think this is a case where staggered starts wouldn't have helped - all 4 cores were running well-staggered Beta CEP2 units when this happened.

I agree with you tonyh205.

From what I've observed there can be enough lockout between sub-jobs to cause this problem, not just at the start of the WU. This is why I suggested the need to either multi-task (thread) the heartbeat code (so that it can't lock-out because of an I/O wait) or extend the 30 seconds by some considerable margin (but that would be just sticky tape and not a proper fix).

[Aug 19, 2014 5:12:33 PM]

littlepeaks
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 748
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

90 day badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

90 day badge for Computing for Sustainable Water

180 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

1 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

So does my error indicate there were too many tasks all running that started at the same time, causing a bottleneck with writes to the hdd?

I posted a similar problem to the CEP2 forum last summer.

The main cause, in my case, seemed to be that I was running an AV program called "Immunet 3.0" which seemed to be doing a lot of its own reads and writes to the HDD at the same time CEP2 was "doing its thing" at the beginning of a WU.

BTW, received one beta last night -- no problems -- now in PV status, but ran for about 7.5 hours.

[Aug 19, 2014 5:15:00 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I had read your earlier post, Apis T, but it still doesn't explain the unit exit message in the Event Log at the same second as the completion of another unit. It looks there as if the heartbeat code or the 30 second wait was a consequence and not the problem.

[Aug 19, 2014 5:24:35 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Agreed. I can't explain that unless the message time also hangs at the start of the 30 seconds.

One for the techs to comment on (though I doubt they will).

[Aug 19, 2014 5:32:46 PM]

Mamajuanauk
Master Cruncher
United Kingdom
Joined: Dec 15, 2012
Post Count: 1900
Status: Offline
Project Badges:


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

So does my error indicate there were too many tasks all running that started at the same time, causing a bottleneck with writes to the hdd?

Nothing else running on this machine, so with a large amount of wu's all starting at the same time, sounds likely it caused he problem...

I'll remember that for next time...

Thanks

----------------------------------------

Mamajuanauk is the Name! Crunching is the Game!

[Aug 19, 2014 7:10:19 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Nothing else running on this machine, so with a large amount of wu's all starting at the same time, sounds likely it caused he problem...

However, your Result Log suggests that the exit occurred in Job#6, about 14 hours into the workunit's processing. The number of (Beta) CEP2 units running simultaneously may well be a factor, but it's unlikely after 14 hours that their starting at the same time has much influence. If you can still check the BOINC Event Log for messages at that time (09:08:06 or soon after), you might find that another CEP2 WU did start or finish then and caused the exit.

[Aug 19, 2014 7:52:20 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

The main cause, in my case, seemed to be that I was running an AV program called "Immunet 3.0" which seemed to be doing a lot of its own reads and writes to the HDD at the same time CEP2 was "doing its thing" at the beginning of a WU.

The security built into BOINC makes it perfectly acceptable to remove the BOINC directory from the AV scan, if your tool allows that.

[Aug 19, 2014 7:54:19 PM]

[ ]