Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Help Cure Muscular Dystrophy - Phase 2 Forum Thread: WUs looping under Linux |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 13
|
Author |
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
I have had several WUs looping in my quad running Linux. They are all of the same RADIA/RADIA group, but I have also had some WUs of this group completing fine and validating. The looping WUs are as follows:
----------------------------------------CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 3458_ 1 at 8.805 % CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 5489_ 0 at 13.836 % CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 6405_ 1 at 68.553 % CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 6648_ 1 at 42.767 % CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 6989_ 1 at 22.012 % CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 8895_ 0 at 8.176 % CMD2_ 0002-RADIA.clustersOccur-RADIA.clustersOccur_ 8911_ 0 at 11.320 % These WUs seem to run correctly (runtime and TTC changing normally, and CPU at 100 %) but the percentage does not change, no checkpoint is taken and the temperature of the processor is a few degrees below its usual one at 100 % use. In case that would have helped I have rebooted the machine but they restarted at their last checkpoint and looped again at the same percentage. For the time being I have kept these WUs as Suspended in case there would be something to try or some test to do, although I am not too hopeful. Cheers. Jean. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
If a checkpoint is equal a position completion, equal to a hike in percent, it's not unthinkable the task has gone from short-positions (no this is not the financial market), to very long. I'd let at least one run and see if they cut out at some point.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
You are right and I have forgotten to talk about times...
----------------------------------------Thinking like you I have left the first three looping tasks much time to give them a chance to compute tougher positions. Below are the runtimes when I stopped them and the runtime they displayed after restarting them:
I think they had enough time to come to the next position... At this point I have noticed the lower temperature of the processor indicating that one (or more) was no longer computing anything (they are probably doing a short logical-only loop), so I stopped the other ones after they had missed a few checkpoints. Cheers. Jean. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Oddly, I noticed from a kicking in fan that the start up of CMD2 jobs indeed cause some temperature rise.
----------------------------------------Using Coretemp, so can see in the sysicon tray quickly which of the 4 threads is bumming. Still the RosettaView util does the job of launching an audio alert or pop up if a job makes no progress after 1 hour.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If a checkpoint is equal a position completion, equal to a hike in percent, it's not unthinkable the task has gone from short-positions (no this is not the financial market), to very long. I have one WU which has been going for over 17 hours since it last checkpointed. I wonder if that's "too" long on a P4? CMD2_0002-RADIA.clustersOccur-RADIA.clustersOccur_2648 checkpoint CPU time: 374.855400 current CPU time: 63999.130000 [edited to give latest figures before I go to bed] [Edit 1 times, last edit by Former Member at May 28, 2009 4:56:41 PM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
The routine checks at each checkpoint/position completion if it has reached the target maximum time, then consider if it should continue, if there are more positions. A position under way wont be cut off is my understanding AND I've entirely missed for tasks not reaching that eventually. Per knreed, if the task happens to start with a long position, it will be run to the end i.e. the task has at least one complete position.
----------------------------------------Maybe, the science app could get a logic to indicate it's actually making progress when working on the big ones through a mimicked %, as we only see it use CPU time... which makes members nervous.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: |
Sek good point. There are some workunits that can take hours to complete a position. I will modify the app to show progress during these workunits. Look for this update soon. Until the fix is in production as long as the cpu time is increasing the application should be making progress and it is not necessary to abort.
----------------------------------------Thanks, armstrdj [Edit 1 times, last edit by armstrdj at May 28, 2009 3:21:09 PM] |
||
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
armstrdj,
----------------------------------------If you say so I will switch back to Linux (I had no more "good" WUs in queue because of this morning's shortage) and re-activate them again. However I am still worried by the processor being 5 or 6 °C below what it is normally (when computing full speed) when all four cores are running these "looping" tasks. I have the feeling that they are simply what I would call "looping on a branch" and not using any of the computation modules of the processor. Cheers. Jean. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The one I listed above is now at:
checkpoint CPU time: 78189.820000 current CPU time: 98633.300000 fraction done: 0.025157 At that rate, it'll take over 1000 hours. Having checkpointed at over 21 hours with 2.5% done, shouldn't it have ended? |
||
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3715 Status: Offline Project Badges: |
Well, I have done as armstrdj said, and he was right, my 13 "looping" WUs have finally passed their "blocking" positions and completed.
----------------------------------------The good point is that there was no loop, my feeling was wrong. The bad points are 1. that it is very puzzling to see a task blocked more than 4 hours on the same percentage on a fast machine, and when there are two such long positions in the same WU it is very uncomfortable... 2. the time limit feature does not seem to apply at all for those WUs under Linux, although they are said to be version 6.13 the same as those which have the limiting feature under Windows. Is it broken, not activated, or has the limit been set to a higher time? Cheers. Jean. |
||
|
|