Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 99
Posts: 99   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 354326 times and has 98 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Got 4 of the batch 929 rigid units. Switched LAIM off and suspended each at a checkpoint after they'd all checkpointed at least once. One is validated already so they're looking good.

Having seen how these run, and if they're typical of what we might get in production, I've concluded that I'm not so bothered about the occasional long checkpoint or the iffy progress display. It's not good, and we might have to remind a few people that things are not perfect, but the techs have got it working and it's far more important that we get on with the science.

Well done, guys!
[Jan 8, 2015 11:51:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
deltavee
Ace Cruncher
Texas Hill Country
Joined: Nov 17, 2004
Post Count: 4894
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

The 308s are checkpointing and finishing, so lets get this going.
[Jan 9, 2015 1:54:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KWSN-A Shrubbery
Senior Cruncher
Joined: Jan 8, 2006
Post Count: 476
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Was away when these went out so I wasn't able to battle test them. Nine pages valid or pending.

Looks like a go.
----------------------------------------

[Jan 9, 2015 2:03:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
slakin
Advanced Cruncher
Joined: Jul 4, 2008
Post Count: 79
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I thought it was a bit strange, after suspending a restarting a work unit I happened to check the work unit properties and found that the CPU time and checkpoint time were greater than the elapsed time ..no other issues.

Beta_OET1_0000299_xEBGP-L_rig.004
CPU time at last checkpoint 00:12:21
CPU time 00:13:04
Elapsed time 00:05:03
Estimated time remaining 00;34:31
Fraction Done 60%

I didn't get an opportunity to see if this happened on any other units.
[Jan 9, 2015 2:23:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

The Result Log is still disconcerting at a restart. This one ran for 38min before I restarted it (LAIM off), said "Starting task 0,CPU time is 0.000000" at the restart, then ran just 18min to completion (so I assume it actually restarted correctly from a checkpoint). It ran as Quorum 1 and went Valid. The research result may be good, but this behaviour is likely to cause more forum queries from crunchers who come across it.

Result Name: BETA_ OET1_ 0000299_ xEBGP-L_ rig_ 0277_ 0--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[08:02:15] Number of tasks = 1
[08:02:15] Starting task 0,CPU time is 0.000000
[08:02:15] ./ZINC01563325.pdbqt size = 32 7 ../../projects/www.worldcommunitygrid.org/beta20.xEBGP-L_rig.pdbqt size = 2470 0
[08:40:31] Number of tasks = 1
[08:40:31] Starting task 0,CPU time is 0.000000
[08:40:31] ./ZINC01563325.pdbqt size = 32 7 ../../projects/www.worldcommunitygrid.org/beta20.xEBGP-L_rig.pdbqt size = 2470 0
[08:58:15] Finished task #0 cpu time used 3222.906750
08:58:15 (5804): called boinc_finish

</stderr_txt>

[Jan 9, 2015 9:57:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Falconet
Master Cruncher
Portugal
Joined: Mar 9, 2009
Post Count: 3315
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

My 298 finished and validated but checkpoints are simply too far apart.
45 minutes is way too much.
----------------------------------------


- AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W
- AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W
- AMD Ryzen 7 7730U 8C/16T 3.0 GHz
[Jan 9, 2015 10:35:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

My 298 finished and validated but checkpoints are simply too far apart.
45 minutes is way too much.


In an ideal world I would agree with you, but at least this isn't as bad as CEP2 which can go for hours between checkpoints.

It would be good if it was more often, but at this level I doubt we're going to get the situation where machines are turned on and off again before a checkpoint occurs. If the WUs get longer, or if the techs feel it's not too onerous to do, I'd like to see more checkpoints as well. But I personally feel that we can live with it at this level.

Just my 2p'th.

Edit: spelling/grammar.
----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 9, 2015 10:43:52 AM]
[Jan 9, 2015 10:42:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Eventually in this situation of 1 job flex, forced checkpointing, a choice will have to be made how often [or there are still technically opportune moments in the simulation], not to inundate a 4-8-16 core machine with checkpoint saves. OTOH, if the checkpointing for flex jobs listens to the write to disk setting, skipping a 'forced' checkpoint by the fast would be ideal. 'Too much' at 45 minutes is of course massively better than never in 48 hours smile . Anyway, no issue here since using hibernation extensively with only the monthly boot for Windows and being bootless on Linux using KSplice. biggrin
[Jan 9, 2015 10:52:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Falconet
Master Cruncher
Portugal
Joined: Mar 9, 2009
Post Count: 3315
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

Indeed 45 minutes is better than CEP2 or no checkpoint at all.
I may hibernate but many don't. I am just afraid there may be lots of hours lost across the grid with these kind of tasks.
----------------------------------------


- AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W
- AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W
- AMD Ryzen 7 7730U 8C/16T 3.0 GHz
[Jan 9, 2015 12:11:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
I need a bath
Senior Cruncher
USA
Joined: Apr 12, 2007
Post Count: 347
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ]

I have noticed that Beta units take up nearly as much CPU resources while that are "waiting to run" as when they are actually running.
Is this a problem?
----------------------------------------

[Jan 9, 2015 7:05:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 99   Pages: 10   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread