Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: Beta Test - Outsmart Ebola Together - v7.14 - Jan 7, 2015 [ Issues Thread ] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 99
|
Author |
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,37626
Please post your issues/comments/questions for this beta test here. Yes, If you can please suspend these workunits with LAIM turned off. This will allow the application to show it is able to properly restore from checkpointing. Thanks, -Uplinger |
||
|
Falconet
Master Cruncher Portugal Joined: Mar 9, 2009 Post Count: 3295 Status: Offline Project Badges: |
I got one. Ran it for just over two minutes for it to checkpoint at 1% or so.
----------------------------------------Restarted it with LAIM off and it went back to 0.204%. AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W AMD Ryzen 7 7730U 8C/16T 3.0 GHz |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Falconet,
Thanks for the info, that sounds like it restored from mid checkpoint. I may send more work units for rigid here soon. Thanks, -Uplinger |
||
|
dango
Senior Cruncher Joined: Jul 27, 2009 Post Count: 307 Status: Offline Project Badges: |
got 6...
|
||
|
Falconet
Master Cruncher Portugal Joined: Mar 9, 2009 Post Count: 3295 Status: Offline Project Badges: |
You are welcome. I tried it again and it went from 3.265% to 2.244%.
----------------------------------------AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W AMD Ryzen 7 7730U 8C/16T 3.0 GHz |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
workunits with 298 in them are the start of the flexible work units.
295, 296, 297 are rigid work units with more than one job per workunits. 298 will only have 1 job per workunit. They are more complicated and may run longer. Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
All of the initial results have been sent out. If we decide to send out more results I'll let you know.
Thanks, -Uplinger |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I got 4 on two machines: 3 flex and one rigid. The flex ones are running well and do seem to be checkpointing a little more often, and closer to the "Write to disk at most every" time which I have set to 180sec. The rigid one ran for nearly 28 minutes before it checkpointed (at 20% progress), but shortly thereafter I suspended and resumed it with LAIM off and it seemed to correctly restart from that checkpoint. (I'm monitoring checkpoints with BoincTasks 1.66 which may not be fully reliable in that regard.)
----------------------------------------However (if the times are accurate) I think 28 minutes is way too long, and now at 37min progress appears stuck at 20%. Edit: At 38 minutes progress appears to have retreated to 10%. This is similar to the strange numbers that appear on long-running tasks that don't checkpoint, so maybe not too surprising but certainly unwelcome (and very off-putting for newbies). No further checkpoints have occurred either. Edit 2: Checked stderr.txt in the slot directory and, yes, it does look like it correctly restarted. Also, looking at the times of the .ckp files, it does appear that the checkpoints are at extended intervals. It just checkpointed for a second time at 55 minutes. [Edit 3 times, last edit by Former Member at Jan 8, 2015 12:12:01 AM] |
||
|
widdershins
Veteran Cruncher Scotland Joined: Apr 30, 2007 Post Count: 674 Status: Offline Project Badges: |
I've tried suspending and restarting a selection of units with LAIM off and all restarted without any problems. Work lost suspending and restarting was minimal for all the units tested when compared against units with similar run times and completion percentages that weren't suspended and restarted.
I wouldn't say no to a few more betas though, just to make sure. I'm still miles off my next badge. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That rigid unit checkpointed a third time at 69 minutes. This time I noticed that the number of checkpoint files is growing with each checkpoint, both in the slot directory and in the contained vina_checkpoint directory. Is that supposed to happen?
|
||
|
|