| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 46
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
just checked a machine that i have running. stuck at 100% for 159 hours. UD agent 3.0 (2844) device id 464498
----------------------------------------killed the ud_99xxxx process so that it would be sent back. [Edit 1 times, last edit by Former Member at Nov 16, 2006 9:48:09 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My Machine did that 100% and freeze on that work unit thing as well last night. Device ID 459578
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Be sure that you are killing the HPF2 application program and not the UD client. If you kill the client, it simply pops up again at the next boot. If you kill HPF2, then the client will request a new copy and a new work unit.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Be sure that you are killing the HPF2 application program and not the UD client. If you kill the client, it simply pops up again at the next boot. If you kill HPF2, then the client will request a new copy and a new work unit. what happens if you 'log off' windows and then 'log on' again? that's what i've been doing when i see the 100% problem, and then something starts at 0%. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello halfcard,
I suspect that the 100% error units are not writing the work files properly, so they cannot create a result file to return and just loop until you grow tired and shut them down. Then they just start over again from the beginning, endlessly repeating whenever you reboot / login. This is just an unconfirmed guess on my part. You can confirm it by selecting My Grid - Device Manager - Device Statistics and looking at the Last Results Returned column. If you returned any sort of result (even an error) and downloaded a new work unit, it will show up there. If that shows that you have not downloaded a new work unit, then use Task Manager to kill the WCGrid_Rosetta application. (I assume that is the name for HPF2.) Please tell us what you see in Device Statistics. Lawrence |
||
|
|
Alther
Former World Community Grid Tech United States of America Joined: Sep 30, 2004 Post Count: 414 Status: Offline Project Badges:
|
Hi folks. I don't have an explanation yet, but I am looking into the problem. The big problem is that I haven't been able to reproduce it yet, so some additional information would be great if you know it.
----------------------------------------* Was the workunit restarted from a checkpoint at any time during the WU run (e.g. did you stop/restart UD, log off/on Windows, reboot, etc.)? * ...or did the WU run straight through? * Do the graphics and the UD percent complete match up? * If you watch the graphics, how does the percent complete increment throughout the run? * When it's sitting at 100%, is the wcg_hpf2_rosetta.exe program taking 100% CPU still or is it a ud_* process, or is the CPU not being utilized at all and it's just sitting there? * Instead of killing the wcg_hpf2_rosetta.exe process, what happens if you instead cleanly shut down UD and restart it? What do you observe? Thanks,
Rick Alther
Former World Community Grid Developer |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi folks. I don't have an explanation yet, but I am looking into the problem. The big problem is that I haven't been able to reproduce it yet, so some additional information would be great if you know it. * Was the workunit restarted from a checkpoint at any time during the WU run (e.g. did you stop/restart UD, log off/on Windows, reboot, etc.)? * ...or did the WU run straight through? Ran straight through, machine only does world community grid 24x7 * Do the graphics and the UD percent complete match up? sorry didn't check * If you watch the graphics, how does the percent complete increment throughout the run? sorry didn't check * When it's sitting at 100%, is the wcg_hpf2_rosetta.exe program taking 100% CPU still or is it a ud_* process, or is the CPU not being utilized at all and it's just sitting there? fairly certain it was ud_* that was at 100% CPU usage, that's what i killed in order to get it to send the results back * Instead of killing the wcg_hpf2_rosetta.exe process, what happens if you instead cleanly shut down UD and restart it? What do you observe? killed up_* and sent the work unit back Thanks, [Edit 1 times, last edit by Former Member at Nov 17, 2006 3:26:34 PM] |
||
|
|
davidhobbs
Senior Cruncher England Joined: Dec 30, 2004 Post Count: 152 Status: Offline Project Badges:
|
I now have a HPF2 UD work unit stuck at 100%, so I hope the following info might help Rick.
Device ID 179553. Task run time almost 33 hours No result returned during this time No graphics viewable (get the "wait a few seconds then return" message) The PC will have been re-started at least four times, but using hibernation so the work unit itself should not have been interrupted. The process UD_9930506.exe is using 98% CPU time. I can see UD.exe also listed in task manager, but I can't see any other WCG processes. Shouldn't there be three of them? Aha! if I look at one of my other machines running the same project I see that ud_9930506 is listed but not consuming any significant processor time, and wcg_hpf2_rosetta.exe is the one using 98% CPU time. Perhaps this will give you a clue? I shut down the agent (by choosing EXIT from the sys tray icon) and then re-started it. The agent started again from 0% without contacting the grid server. Shall I kill this process and get a new work unit or do you want to see what happens when this one reaches 100% again? David. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello davidhobbs,
Thanks! The problem is not - directly - with the HPF2 application but with the UD thread when the HPF2 result should be returned. This is new information and is very important to know in tracking down the bug. Thanks and please don't wipe it until Rick Alther gets a chance to comment. He might want a copy of the work file or something. I am dropping him a note about this. Lawrence |
||
|
|
davidhobbs
Senior Cruncher England Joined: Dec 30, 2004 Post Count: 152 Status: Offline Project Badges:
|
OK,
It's now 2 hours and 4 minutes into the work unit, showing it as 35% completed. I'll keep an eye on the forums. David. |
||
|
|
|