| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 118
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And another encore for, yes staggered starting of heavy io apps such as cep2. How to do that: maybe get the agent to read a 'heavy' flag, then make put a stay on all these of count minus 1 and wait serially for 5-10 minutes before releasing the next and the next. Applies to both block starting and restarting, after a power up for instance. Opt-in science so who would be confused over this?
Of course linux suffers much more from the particular 'heavy i/o' issue as windows. Yes i did write that! Efficiency on linux is multiple percentage points worse on linux compared to windows when it involves this science. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've just had another example of a unit restarting, but I'm puzzled by the detailed timings. Anyone care to explain?
Event Log has these lines (I have the checkpoint_debug log flag on): 19/08/2014 12:14:48 | World Community Grid | [checkpoint] result BETA_E225108_20_S.328.C44H28N4O1.RLFMUDRFIZQDNP-UHFFFAOYSA-N.3_s1_14_0 checkpointed 19/08/2014 12:19:25 | World Community Grid | Task BETA_E225108_20_S.328.C44H28N4O1.RLFMUDRFIZQDNP-UHFFFAOYSA-N.3_s1_14_0 exited with zero status but no 'finished' file 19/08/2014 12:19:25 | World Community Grid | If this happens repeatedly you may need to reset the project. 19/08/2014 12:19:25 | World Community Grid | Computation for task BETA_E225108_701_S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.10_s1_14_1 finished 19/08/2014 12:19:56 | World Community Grid | Starting task BETA_E225108_694_S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.3_s1_14_1 19/08/2014 12:19:57 | World Community Grid | [checkpoint] result BETA_E225108_20_S.328.C44H28N4O1.RLFMUDRFIZQDNP-UHFFFAOYSA-N.3_s1_14_0 checkpointed 19/08/2014 12:19:57 | World Community Grid | Started upload of BETA_E225108_701_S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.10_s1_14_1_0 So, once again, one unit finishing (BETA_E225108_701_...) seems to cause the exit of another unit (BETA_E225108_20_...), at 12:19:25 (times are GMT+1). The Result Log for _20_ has these lines: [12:14:47] Finished Job #3 [12:14:47] Starting job 4,CPU time has been restored to 15872.742948. 12:19:54 (9208): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:19:56] Number of jobs = 8 [12:19:56] Starting job 4,CPU time has been restored to 15872.742948. Note that the No heartbeat line is timed at 12:19:54, which is 29 seconds AFTER unit _20_ exited. Well, you might say that was reasonable; once the unit has exited it can't send the heartbeat signal, BOINC takes 29 or 30 seconds to detect that before reporting it; a couple of seconds later, it restarts the unit ([12:19:56] Starting job 4). BUT, it means that the No heartbeat warning is caused by the unit's earlier exit, not the other way round (e.g. BOINC forcing an exit because of failing to receive the heartbeat - although the heartbeat warning seems to suggest that, maybe as a fallback). I'm left with no satisfactory explanation as to why _20_ exited at 12:19:25. OK, another unit finished and probably kicked off loads of I/O activity (note 32 seconds before the upload started), but why should that cause another Windows process to exit? I also think this is a case where staggered starts wouldn't have helped - all 4 cores were running well-staggered Beta CEP2 units when this happened. FWIW, _20_ went on to complete successfully (now in PVal). |
||
|
|
Mamajuanauk
Master Cruncher United Kingdom Joined: Dec 15, 2012 Post Count: 1900 Status: Offline Project Badges:
|
And another encore for, yes staggered starting of heavy io apps such as cep2. How to do that: maybe get the agent to read a 'heavy' flag, then make put a stay on all these of count minus 1 and wait serially for 5-10 minutes before releasing the next and the next. Applies to both block starting and restarting, after a power up for instance. Opt-in science so who would be confused over this? So does my error indicate there were too many tasks all running that started at the same time, causing a bottleneck with writes to the hdd?Of course linux suffers much more from the particular 'heavy i/o' issue as windows. Yes i did write that! Efficiency on linux is multiple percentage points worse on linux compared to windows when it involves this science. Or am I reading this wrong?
Mamajuanauk is the Name! Crunching is the Game!
![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I also think this is a case where staggered starts wouldn't have helped - all 4 cores were running well-staggered Beta CEP2 units when this happened. I agree with you tonyh205. From what I've observed there can be enough lockout between sub-jobs to cause this problem, not just at the start of the WU. This is why I suggested the need to either multi-task (thread) the heartbeat code (so that it can't lock-out because of an I/O wait) or extend the 30 seconds by some considerable margin (but that would be just sticky tape and not a proper fix). |
||
|
|
littlepeaks
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 748 Status: Offline Project Badges:
|
So does my error indicate there were too many tasks all running that started at the same time, causing a bottleneck with writes to the hdd? I posted a similar problem to the CEP2 forum last summer. The main cause, in my case, seemed to be that I was running an AV program called "Immunet 3.0" which seemed to be doing a lot of its own reads and writes to the HDD at the same time CEP2 was "doing its thing" at the beginning of a WU. BTW, received one beta last night -- no problems -- now in PV status, but ran for about 7.5 hours. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I had read your earlier post, Apis T, but it still doesn't explain the unit exit message in the Event Log at the same second as the completion of another unit. It looks there as if the heartbeat code or the 30 second wait was a consequence and not the problem.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I had read your earlier post, Apis T, but it still doesn't explain the unit exit message in the Event Log at the same second as the completion of another unit. It looks there as if the heartbeat code or the 30 second wait was a consequence and not the problem. Agreed. I can't explain that unless the message time also hangs at the start of the 30 seconds. One for the techs to comment on (though I doubt they will). |
||
|
|
Mamajuanauk
Master Cruncher United Kingdom Joined: Dec 15, 2012 Post Count: 1900 Status: Offline Project Badges:
|
So does my error indicate there were too many tasks all running that started at the same time, causing a bottleneck with writes to the hdd? I posted a similar problem to the CEP2 forum last summer. The main cause, in my case, seemed to be that I was running an AV program called "Immunet 3.0" which seemed to be doing a lot of its own reads and writes to the HDD at the same time CEP2 was "doing its thing" at the beginning of a WU. BTW, received one beta last night -- no problems -- now in PV status, but ran for about 7.5 hours. I'll remember that for next time... Thanks
Mamajuanauk is the Name! Crunching is the Game!
![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Nothing else running on this machine, so with a large amount of wu's all starting at the same time, sounds likely it caused he problem... However, your Result Log suggests that the exit occurred in Job#6, about 14 hours into the workunit's processing. The number of (Beta) CEP2 units running simultaneously may well be a factor, but it's unlikely after 14 hours that their starting at the same time has much influence. If you can still check the BOINC Event Log for messages at that time (09:08:06 or soon after), you might find that another CEP2 WU did start or finish then and caused the exit. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The main cause, in my case, seemed to be that I was running an AV program called "Immunet 3.0" which seemed to be doing a lot of its own reads and writes to the HDD at the same time CEP2 was "doing its thing" at the beginning of a WU. The security built into BOINC makes it perfectly acceptable to remove the BOINC directory from the AV scan, if your tool allows that. |
||
|
|
|