Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 179
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That would have to be a major hmmm, the techs maybe in the know if an extra routine kicks in after an unload / resume. An alternate test in that would be to untick LAIM and suspend a running Beta, which would unload it, then resume it, and see if the checkpoint intervals then too quasi double.
----------------------------------------Anyway, now I'm going to suspend the one beta I have and see if I can pump up the completion time. 7.10 beta21 BETA_avx101118-015_r9_1_wcgfahb00400000_0 07:39:05 (07:34:39) 99,04 36,367 13:35:14 03d,14:05:11 8/31/2015 7:24:26 AM [0] 00:00:19 Running 32.33 MB 57.39 MB (First step done, suspend the task with LAIM off did unload it per the TM, then set LAIM again and forced a resume. The task reappeared in the TM. It was at 7:34 for 36% equal to 36 checkpoints) Will get back when done, tomorrow, if I don't forget of course, but it would have to become visible in a few hours as then the time per percent would increase. Reminds me, had one on Linux, the only one that was restarted. It went from a linear projection of 48+ hours, to run 55. Here's the break snip [11:21:54] INFO: Checkpointed. Progress 81000 of 100000 steps complete CPU time 140431.097665 SecondsChkpntInterval [11:50:51] INFO: Checkpointed. Progress 82000 of 100000 steps complete CPU time 142163.434065 1732.336400 [12:19:38] INFO: Checkpointed. Progress 83000 of 100000 steps complete CPU time 143890.449300 1727.015235 [12:48:07] INFO: Checkpointed. Progress 84000 of 100000 steps complete CPU time 145599.502948 1709.053648 [13:16:42] INFO: Checkpointed. Progress 85000 of 100000 steps complete CPU time 147314.599345 1715.096397 [13:50:19] INFO:Turning trickle messaging on. [13:50:19] INFO:Turning intermediate uploads on. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.85000 agbnpf_assign_parameters(): info: attempting to load from SQL tables. [14:46:24] INFO: Checkpointed. Progress 86000 of 100000 steps complete CPU time 150648.582080 3333.982735 [15:43:44] INFO: Checkpointed. Progress 87000 of 100000 steps complete CPU time 154068.058018 3419.475938 [16:40:44] INFO: Checkpointed. Progress 88000 of 100000 steps complete CPU time 157468.356532 3400.298514 [17:38:05] INFO: Checkpointed. Progress 89000 of 100000 steps complete CPU time 160892.862115 3424.505583 [18:35:14] INFO: Sending trickle message to server. @Techs, concur with the previous observed: Toronto, we could have a problem! Seconds per checkpoint practically double after a break. [Edit 1 times, last edit by Former Member at Aug 31, 2015 3:36:54 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The restarted task on W8.1-64 displayed same... the checkpont interval runtime effectively went power 2 from about 760 seconds to 1500 or so:
----------------------------------------[16:46:07] INFO: Checkpointed. Progress 34000 of 100000 steps complete CPU time 25735.328125 [16:58:57] INFO: Checkpointed. Progress 35000 of 100000 steps complete CPU time 26497.546875 [17:11:46] INFO: Checkpointed. Progress 36000 of 100000 steps complete CPU time 27259.687500 [17:18:49] INFO:Turning trickle messaging on. [17:18:49] INFO:Turning intermediate uploads on. %IMPACT-I: Softcore binding energy with umax = 1000.00000 %IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic Non-Polar Hydration Model %IMPACT-I: Hybrid potential for binding with lambda = 0.07000 agbnpf_assign_parameters(): info: attempting to load from SQL tables. [17:44:49] INFO: Checkpointed. Progress 37000 of 100000 steps complete CPU time 28787.359375 [18:10:07] INFO: Checkpointed. Progress 38000 of 100000 steps complete CPU time 30291.703125 How to double your Beta runtime ;? [Edit 2 times, last edit by Former Member at Aug 31, 2015 4:20:29 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Never seen an answer to this either: If the generation 1 task is assigned to Linux, will all the following in the chain also get assigned to the same platform? Again, and opposed runtime doubling after restart, not the volunteers problem, just filling in the knowledge gaps.
|
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Offline Project Badges: |
I looked a little further into the apparent doubling of the interval between checkpoints after a restart. It only appears to happen after the first restart. It does not redouble after subsequent restarts, but just stays the double of the initial series before the restart. Here is a spread sheet showing the number of seconds between checkpoints.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: |
We are looking into the issue of CPU time doubling after a restart, thanks to the beta testers for catching this.
Also the continuation of the simulation on second, third, fourth, etc. generation are not bound to the same platform, they can run anywhere. Thanks, armstrdj |
||
|
Rarusu
Advanced Cruncher Germany Joined: Feb 7, 2006 Post Count: 64 Status: Offline Project Badges: |
I observered something that is somehow problematic in my opinion.
----------------------------------------As mentioned above I received the beta WU BETA_avx101118-049_r6_1_wcgfahb100000 It was a resend because the initial crucnher didn't make it in time. The deadline for the resend was pretty optimistic, not even 48 hours. Although the WU was running with high priority my barebone didn't make in in time and the WU was aborted after ~40hours of crunching and a new resend was send out with a deadline of not even 1 hour! It seems that my wingman was not able to crunch the WU in that small amount of time and a third resend was generated. BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 3-- - In Progress 01.09.15 00:53:28 02.09.15 10:29:27 6.75 116.6 / 0.0 BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 2-- - No Reply 01.09.15 00:52:33 01.09.15 00:53:24 6.83 116.6 / 0.0 BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 1-- - No Reply 30.08.15 15:16:26 01.09.15 00:52:25 40.47 272.0 / 0.0 <-- my machine BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 0-- - No Reply 26.08.15 15:16:01 30.08.15 15:16:01 11.76 310.8 / 0.0 If the WUs are not supposed to get smaller, I suppose to adjust the deadlines for resends in the future. Othwerwise a lot of people will not be able to complete the WUs and a lot of time and electricity will be wasted. Cheers Rarusu
Cheers,
Rarusu |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I agree with Rarusu's comment about resent WU deadlines, although I suspect that the short deadlines of the beta test might be contributing.
However, if a WU is server aborted as out of time, surely the "resend" should have had a new number as it should have started from the last successful trickle result (assuming that there was one). It might help if you could post the log from your run -- were there any trickle messages? Or maybe this isn't all sorted out yet ... ? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I observered something that is somehow problematic in my opinion. As mentioned above I received the beta WU BETA_avx101118-049_r6_1_wcgfahb100000 It was a resend because the initial crucnher didn't make it in time. The deadline for the resend was pretty optimistic, not even 48 hours. Although the WU was running with high priority my barebone didn't make in in time and the WU was aborted after ~40hours of crunching and a new resend was send out with a deadline of not even 1 hour! It seems that my wingman was not able to crunch the WU in that small amount of time and a third resend was generated. BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 3-- - In Progress 01.09.15 00:53:28 02.09.15 10:29:27 6.75 116.6 / 0.0 BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 2-- - No Reply 01.09.15 00:52:33 01.09.15 00:53:24 6.83 116.6 / 0.0 BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 1-- - No Reply 30.08.15 15:16:26 01.09.15 00:52:25 40.47 272.0 / 0.0 <-- my machine BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 0-- - No Reply 26.08.15 15:16:01 30.08.15 15:16:01 11.76 310.8 / 0.0 If the WUs are not supposed to get smaller, I suppose to adjust the deadlines for resends in the future. Othwerwise a lot of people will not be able to complete the WUs and a lot of time and electricity will be wasted. Cheers Rarusu Validator growing pains... of course the result would have to be recognized through the last thrickle it succeeded to complete i.e. the one done at 40.47 hours. A re-send only is supposed to include the steps from whatever last good trickle, so re-issuing in full is pure waste, but as said, that's validator and work generation pains [rules to refine]. Yes on Beta we see truly silly deadlines, we've even seen 0:00 hours. |
||
|
Rarusu
Advanced Cruncher Germany Joined: Feb 7, 2006 Post Count: 64 Status: Offline Project Badges: |
Thanks for the clarification.
----------------------------------------I didn't catch that many beta WUs in the past and wasn't aware that this is a known problem in the field of the beta test. As soon as I'm home I can look for a log and post the content. But the trickle messages shouldn't be the problem as I could observe a growth of elapsed time for the WU. In my understanding this can only mean that the trickle messages were received and validated by the server.
Cheers,
Rarusu |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Each science has it's own verification and validation rules, and this being a different beast entirely, they need enhancing hands-on.
----------------------------------------As for the trickles, the design is to verify them each 'on-the-fly', and if invalid, the task is aborted by instruction of the server [with a little scheduling delay]. Then a new task is generated to start from the last good trickle [See posts uplinger in this thread.] [Edit 2 times, last edit by Former Member at Sep 1, 2015 11:33:46 AM] |
||
|
|