| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 118
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Two observations are still unexplained.
1) All 3 "exits with zero status but no finish file" occurred with the same time stamp to the second as another unit finishing or starting. That sounds like a BOINC or local system issue. Too much of a coincidence otherwise. 2) All 3 of those units that exited then restarted on the same machine. Not a normal situation! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm not sure this is specific to the beta or not, but I've noticed more "No heartbeat from core client for 30 sec - exiting" messages than I used to get on one of my machines. So this evening I spent some time watching it.
I observed both active beta tasks reset themselves while I was copying a large video file from the system disk (the one that BOINC uses) to a removable disk. It occurred to me that, even though the CPU was hardly being used, there was obviously continuous reading of the BOINC disk. And if the task communication is not threaded, and one of the beta tasks tried to do some I/O, its low priority would cause it to go into a wait state until the file had completed copying (something that took several minutes). I therefore conclude that either (a) inter-task "heartbeat" communication needs to be threaded so that it doesn't wait on I/O or (b -- and not as good) the 30 seconds need to be increased quite considerably. Just my 2p'th. |
||
|
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 823 Status: Offline Project Badges:
|
Mine errored out around 1 hr 20 min but log says:
----------------------------------------Computation for task BETA_E225106_546_S.326.C37H26N2O4S2.KRRWRYCZFPKTNW-UHFFFAOYSA-N.20_s1_14_4 finished Mac laptop, OS 10.9.4 2.4 ghz, i5 ![]() |
||
|
|
KWSN - A Shrubbery
Master Cruncher Joined: Jan 8, 2006 Post Count: 1585 Status: Offline |
No idea how many my systems grabbed. Results status page shows 14 pages worth. A few have ended rather early.
----------------------------------------Hopefully you'll get some useful data off the results. ![]() Distributed computing volunteer since September 27, 2000 |
||
|
|
KLiK
Master Cruncher Croatia Joined: Nov 13, 2006 Post Count: 3108 Status: Offline Project Badges:
|
Still waiting for the results:
----------------------------------------BETA_ E225108_ 236_ S.328.C46H31N3.OZIJTKWPTJUCKQ-UHFFFAOYSA-N.7_ s1_ 14_ 1-- p4l-fsc1410a In Progress 8/18/14 17:49:35 8/22/14 17:49:35 0.00 / 0.00 0.0 / 0.0 BETA_ E225108_ 70_ S.328.C41H25N7O1.JFONBYKDRGSVKP-UHFFFAOYSA-N.12_ s1_ 14_ 1-- VS4 In Progress 8/18/14 17:46:52 8/22/14 17:46:52 0.00 / 0.00 0.0 / 0.0 BETA_ E225106_ 551_ S.326.C37H25N1O5S2.KZXWWNDVNPIUHY-UHFFFAOYSA-N.5_ s1_ 14_ 0-- p4l-fsc1410a In Progress 8/15/14 16:42:48 8/19/14 16:42:48 0.00 / 0.00 0.0 / 0.0 One is on a laptop, and one on XEON server machine I use at home. |
||
|
|
astroWX
Advanced Cruncher USA Joined: Sep 1, 2007 Post Count: 56 Status: Offline Project Badges:
|
My farm caught a grand total of, count 'em, one task.
Nothing to add to what has already been posted. Task ran 5:20:46 on i5-3550 (Ivy Bridge), no problems. Four upload files, no problems. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Now that I can easily see the Result Log for the units that exited with zero status but no finish file, those exits were all caused by "No heartbeat from core client for 30 sec". In all 3 cases, they were at the same time as another unit starting or finishing. After restarting, all 3 continued successfully to finish during Job#6 with RC = 0x1 and are now in PVal state.
Units that ran on the i5-750 completed either during Job#0 in 1.2 hours or during Job#6 in 5.2 to 8.9 hours (with first checkpoint after 2.5 to 4 hours). On an i7-4770K, completions took 4.9 to 6.8 hours, with similar first checkpoints. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The heartbeat issue is a classic case of point 2 in your previous post, zero status an indicator, and too many causing a task to abort too. Again, yes again, we need staggered starting. Any time a device starts up or was out of work, then pulls a new set of cep2 this is a number one cause of heartbeat failure and or dreadful efficiency, all the tasks competing to get access to the storage area.
----------------------------------------This i will repeat till sick of it! The development ticket has been in long now, but they rather waste amazing effort on getting to transmit video presentations in the agent notices rather than focusing on getting science computed with least possible failure. Something to bring up at the sztaki workshop! Advocated climate change mitigation which cep2 is related to, are not going to be light, they are data hungry, and model growers. [Edit 1 times, last edit by Former Member at Aug 19, 2014 9:32:47 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
lavaflow, I agree completely.
|
||
|
|
Mamajuanauk
Master Cruncher United Kingdom Joined: Dec 15, 2012 Post Count: 1900 Status: Offline Project Badges:
|
I don't know if this has already been said, but all my Beta's on one machine - Ubuntu 12.04/server have errored! the error from one is below, let me know if you want more info...
----------------------------------------Result Log
Mamajuanauk is the Name! Crunching is the Game!
![]() ![]() |
||
|
|
|