Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 179
Posts: 179   Pages: 18   [ Previous Page | 8 9 10 11 12 13 14 15 16 17 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 470505 times and has 178 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

That would have to be a major hmmm, the techs maybe in the know if an extra routine kicks in after an unload / resume. An alternate test in that would be to untick LAIM and suspend a running Beta, which would unload it, then resume it, and see if the checkpoint intervals then too quasi double.

Anyway, now I'm going to suspend the one beta I have and see if I can pump up the completion time. cool

7.10 beta21 BETA_avx101118-015_r9_1_wcgfahb00400000_0 07:39:05 (07:34:39) 99,04 36,367 13:35:14 03d,14:05:11 8/31/2015 7:24:26 AM [0] 00:00:19 Running 32.33 MB 57.39 MB

(First step done, suspend the task with LAIM off did unload it per the TM, then set LAIM again and forced a resume. The task reappeared in the TM. It was at 7:34 for 36% equal to 36 checkpoints)

Will get back when done, tomorrow, if I don't forget of course, but it would have to become visible in a few hours as then the time per percent would increase. biggrin

Reminds me, had one on Linux, the only one that was restarted. It went from a linear projection of 48+ hours, to run 55. Here's the break snip

[11:21:54] INFO: Checkpointed. Progress 81000 of 100000 steps complete CPU time 140431.097665 SecondsChkpntInterval
[11:50:51] INFO: Checkpointed. Progress 82000 of 100000 steps complete CPU time 142163.434065 1732.336400
[12:19:38] INFO: Checkpointed. Progress 83000 of 100000 steps complete CPU time 143890.449300 1727.015235
[12:48:07] INFO: Checkpointed. Progress 84000 of 100000 steps complete CPU time 145599.502948 1709.053648
[13:16:42] INFO: Checkpointed. Progress 85000 of 100000 steps complete CPU time 147314.599345 1715.096397
[13:50:19] INFO:Turning trickle messaging on.
[13:50:19] INFO:Turning intermediate uploads on.
%IMPACT-I: Softcore binding energy with umax = 1000.00000
%IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic
Non-Polar Hydration Model
%IMPACT-I: Hybrid potential for binding with lambda = 0.85000
agbnpf_assign_parameters(): info: attempting to load from SQL tables.
[14:46:24] INFO: Checkpointed. Progress 86000 of 100000 steps complete CPU time 150648.582080 3333.982735
[15:43:44] INFO: Checkpointed. Progress 87000 of 100000 steps complete CPU time 154068.058018 3419.475938
[16:40:44] INFO: Checkpointed. Progress 88000 of 100000 steps complete CPU time 157468.356532 3400.298514
[17:38:05] INFO: Checkpointed. Progress 89000 of 100000 steps complete CPU time 160892.862115 3424.505583
[18:35:14] INFO: Sending trickle message to server.

@Techs, concur with the previous observed: Toronto, we could have a problem! Seconds per checkpoint practically double after a break.
----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 31, 2015 3:36:54 PM]
[Aug 31, 2015 3:34:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

The restarted task on W8.1-64 displayed same... the checkpont interval runtime effectively went power 2 from about 760 seconds to 1500 or so:

[16:46:07] INFO: Checkpointed. Progress 34000 of 100000 steps complete CPU time 25735.328125
[16:58:57] INFO: Checkpointed. Progress 35000 of 100000 steps complete CPU time 26497.546875
[17:11:46] INFO: Checkpointed. Progress 36000 of 100000 steps complete CPU time 27259.687500
[17:18:49] INFO:Turning trickle messaging on.
[17:18:49] INFO:Turning intermediate uploads on.
%IMPACT-I: Softcore binding energy with umax = 1000.00000
%IMPACT-I: Using AGBNP2: Analytical Generalized Born Model + Analytic
Non-Polar Hydration Model
%IMPACT-I: Hybrid potential for binding with lambda = 0.07000
agbnpf_assign_parameters(): info: attempting to load from SQL tables.
[17:44:49] INFO: Checkpointed. Progress 37000 of 100000 steps complete CPU time 28787.359375
[18:10:07] INFO: Checkpointed. Progress 38000 of 100000 steps complete CPU time 30291.703125

How to double your Beta runtime ;?
----------------------------------------
[Edit 2 times, last edit by Former Member at Aug 31, 2015 4:20:29 PM]
[Aug 31, 2015 4:18:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

Never seen an answer to this either: If the generation 1 task is assigned to Linux, will all the following in the chain also get assigned to the same platform? Again, and opposed runtime doubling after restart, not the volunteers problem, just filling in the knowledge gaps.
[Aug 31, 2015 5:22:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

I looked a little further into the apparent doubling of the interval between checkpoints after a restart. It only appears to happen after the first restart. It does not redouble after subsequent restarts, but just stays the double of the initial series before the restart. Here is a spread sheet showing the number of seconds between checkpoints.


Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Aug 31, 2015 6:11:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

We are looking into the issue of CPU time doubling after a restart, thanks to the beta testers for catching this.

Also the continuation of the simulation on second, third, fourth, etc. generation are not bound to the same platform, they can run anywhere.

Thanks,
armstrdj
[Aug 31, 2015 9:19:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rarusu
Advanced Cruncher
Germany
Joined: Feb 7, 2006
Post Count: 64
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

I observered something that is somehow problematic in my opinion.

As mentioned above I received the beta WU
BETA_avx101118-049_r6_1_wcgfahb100000

It was a resend because the initial crucnher didn't make it in time.
The deadline for the resend was pretty optimistic, not even 48 hours. Although the WU was running with high priority my barebone didn't make in in time and the WU was aborted after ~40hours of crunching and a new resend was send out with a deadline of not even 1 hour!
It seems that my wingman was not able to crunch the WU in that small amount of time and a third resend was generated.

BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 3-- - In Progress 01.09.15 00:53:28 02.09.15 10:29:27 6.75 116.6 / 0.0
BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 2-- - No Reply 01.09.15 00:52:33 01.09.15 00:53:24 6.83 116.6 / 0.0
BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 1-- - No Reply 30.08.15 15:16:26 01.09.15 00:52:25 40.47 272.0 / 0.0 <-- my machine
BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 0-- - No Reply 26.08.15 15:16:01 30.08.15 15:16:01 11.76 310.8 / 0.0


If the WUs are not supposed to get smaller, I suppose to adjust the deadlines for resends in the future. Othwerwise a lot of people will not be able to complete the WUs and a lot of time and electricity will be wasted.

Cheers
Rarusu
----------------------------------------
Cheers,
Rarusu


[Sep 1, 2015 9:41:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

I agree with Rarusu's comment about resent WU deadlines, although I suspect that the short deadlines of the beta test might be contributing.

However, if a WU is server aborted as out of time, surely the "resend" should have had a new number as it should have started from the last successful trickle result (assuming that there was one). It might help if you could post the log from your run -- were there any trickle messages?

Or maybe this isn't all sorted out yet ... ?
[Sep 1, 2015 10:05:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

I observered something that is somehow problematic in my opinion.

As mentioned above I received the beta WU
BETA_avx101118-049_r6_1_wcgfahb100000

It was a resend because the initial crucnher didn't make it in time.
The deadline for the resend was pretty optimistic, not even 48 hours. Although the WU was running with high priority my barebone didn't make in in time and the WU was aborted after ~40hours of crunching and a new resend was send out with a deadline of not even 1 hour!
It seems that my wingman was not able to crunch the WU in that small amount of time and a third resend was generated.

BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 3-- - In Progress 01.09.15 00:53:28 02.09.15 10:29:27 6.75 116.6 / 0.0
BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 2-- - No Reply 01.09.15 00:52:33 01.09.15 00:53:24 6.83 116.6 / 0.0
BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 1-- - No Reply 30.08.15 15:16:26 01.09.15 00:52:25 40.47 272.0 / 0.0 <-- my machine
BETA_ avx101118-049_ r6_ 1_ wcgfahb100000_ 0-- - No Reply 26.08.15 15:16:01 30.08.15 15:16:01 11.76 310.8 / 0.0


If the WUs are not supposed to get smaller, I suppose to adjust the deadlines for resends in the future. Othwerwise a lot of people will not be able to complete the WUs and a lot of time and electricity will be wasted.

Cheers
Rarusu

Validator growing pains... of course the result would have to be recognized through the last thrickle it succeeded to complete i.e. the one done at 40.47 hours. A re-send only is supposed to include the steps from whatever last good trickle, so re-issuing in full is pure waste, but as said, that's validator and work generation pains [rules to refine].

Yes on Beta we see truly silly deadlines, we've even seen 0:00 hours. wink
[Sep 1, 2015 10:15:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rarusu
Advanced Cruncher
Germany
Joined: Feb 7, 2006
Post Count: 64
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

Thanks for the clarification.
I didn't catch that many beta WUs in the past and wasn't aware that this is a known problem in the field of the beta test.

As soon as I'm home I can look for a log and post the content. But the trickle messages shouldn't be the problem as I could observe a growth of elapsed time for the WU. In my understanding this can only mean that the trickle messages were received and validated by the server.
----------------------------------------
Cheers,
Rarusu


[Sep 1, 2015 10:50:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: New Beta Test for PC v7.10 - August 25, 2015 [ Issues Thread ]

Each science has it's own verification and validation rules, and this being a different beast entirely, they need enhancing hands-on.

As for the trickles, the design is to verify them each 'on-the-fly', and if invalid, the task is aborted by instruction of the server [with a little scheduling delay]. Then a new task is generated to start from the last good trickle [See posts uplinger in this thread.]
----------------------------------------
[Edit 2 times, last edit by Former Member at Sep 1, 2015 11:33:46 AM]
[Sep 1, 2015 11:21:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 179   Pages: 18   [ Previous Page | 8 9 10 11 12 13 14 15 16 17 | Next Page ]
[ Jump to Last Post ]
Post new Thread