| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 99
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Got 4 of the batch 929 rigid units. Switched LAIM off and suspended each at a checkpoint after they'd all checkpointed at least once. One is validated already so they're looking good.
Having seen how these run, and if they're typical of what we might get in production, I've concluded that I'm not so bothered about the occasional long checkpoint or the iffy progress display. It's not good, and we might have to remind a few people that things are not perfect, but the techs have got it working and it's far more important that we get on with the science. Well done, guys! |
||
|
|
deltavee
Ace Cruncher Texas Hill Country Joined: Nov 17, 2004 Post Count: 4894 Status: Offline Project Badges:
|
The 308s are checkpointing and finishing, so lets get this going.
|
||
|
|
KWSN-A Shrubbery
Senior Cruncher Joined: Jan 8, 2006 Post Count: 476 Status: Offline Project Badges:
|
Was away when these went out so I wasn't able to battle test them. Nine pages valid or pending.
----------------------------------------Looks like a go. ![]() |
||
|
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges:
|
I thought it was a bit strange, after suspending a restarting a work unit I happened to check the work unit properties and found that the CPU time and checkpoint time were greater than the elapsed time ..no other issues.
Beta_OET1_0000299_xEBGP-L_rig.004 CPU time at last checkpoint 00:12:21 CPU time 00:13:04 Elapsed time 00:05:03 Estimated time remaining 00;34:31 Fraction Done 60% I didn't get an opportunity to see if this happened on any other units. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The Result Log is still disconcerting at a restart. This one ran for 38min before I restarted it (LAIM off), said "Starting task 0,CPU time is 0.000000" at the restart, then ran just 18min to completion (so I assume it actually restarted correctly from a checkpoint). It ran as Quorum 1 and went Valid. The research result may be good, but this behaviour is likely to cause more forum queries from crunchers who come across it.
Result Name: BETA_ OET1_ 0000299_ xEBGP-L_ rig_ 0277_ 0-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [08:02:15] Number of tasks = 1 [08:02:15] Starting task 0,CPU time is 0.000000 [08:02:15] ./ZINC01563325.pdbqt size = 32 7 ../../projects/www.worldcommunitygrid.org/beta20.xEBGP-L_rig.pdbqt size = 2470 0 [08:40:31] Number of tasks = 1 [08:40:31] Starting task 0,CPU time is 0.000000 [08:40:31] ./ZINC01563325.pdbqt size = 32 7 ../../projects/www.worldcommunitygrid.org/beta20.xEBGP-L_rig.pdbqt size = 2470 0 [08:58:15] Finished task #0 cpu time used 3222.906750 08:58:15 (5804): called boinc_finish </stderr_txt> |
||
|
|
Falconet
Master Cruncher Portugal Joined: Mar 9, 2009 Post Count: 3315 Status: Offline Project Badges:
|
My 298 finished and validated but checkpoints are simply too far apart.
----------------------------------------45 minutes is way too much. ![]() - AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W - AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W - AMD Ryzen 7 7730U 8C/16T 3.0 GHz |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My 298 finished and validated but checkpoints are simply too far apart. 45 minutes is way too much. In an ideal world I would agree with you, but at least this isn't as bad as CEP2 which can go for hours between checkpoints. It would be good if it was more often, but at this level I doubt we're going to get the situation where machines are turned on and off again before a checkpoint occurs. If the WUs get longer, or if the techs feel it's not too onerous to do, I'd like to see more checkpoints as well. But I personally feel that we can live with it at this level. Just my 2p'th. Edit: spelling/grammar. [Edit 1 times, last edit by Former Member at Jan 9, 2015 10:43:52 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Eventually in this situation of 1 job flex, forced checkpointing, a choice will have to be made how often [or there are still technically opportune moments in the simulation], not to inundate a 4-8-16 core machine with checkpoint saves. OTOH, if the checkpointing for flex jobs listens to the write to disk setting, skipping a 'forced' checkpoint by the fast would be ideal. 'Too much' at 45 minutes is of course massively better than never in 48 hours
. Anyway, no issue here since using hibernation extensively with only the monthly boot for Windows and being bootless on Linux using KSplice. ![]() |
||
|
|
Falconet
Master Cruncher Portugal Joined: Mar 9, 2009 Post Count: 3315 Status: Offline Project Badges:
|
Indeed 45 minutes is better than CEP2 or no checkpoint at all.
----------------------------------------I may hibernate but many don't. I am just afraid there may be lots of hours lost across the grid with these kind of tasks. ![]() - AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W - AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W - AMD Ryzen 7 7730U 8C/16T 3.0 GHz |
||
|
|
I need a bath
Senior Cruncher USA Joined: Apr 12, 2007 Post Count: 347 Status: Offline Project Badges:
|
I have noticed that Beta units take up nearly as much CPU resources while that are "waiting to run" as when they are actually running.
----------------------------------------Is this a problem? ![]() |
||
|
|
|