Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 46
Posts: 46   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 6825 times and has 45 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

just checked a machine that i have running. stuck at 100% for 159 hours. UD agent 3.0 (2844) device id 464498

killed the ud_99xxxx process so that it would be sent back.
----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 16, 2006 9:48:09 PM]
[Nov 16, 2006 9:47:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
shock Re: Problem with program?

My Machine did that 100% and freeze on that work unit thing as well last night. Device ID 459578
[Nov 16, 2006 10:24:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

Be sure that you are killing the HPF2 application program and not the UD client. If you kill the client, it simply pops up again at the next boot. If you kill HPF2, then the client will request a new copy and a new work unit.
[Nov 17, 2006 12:25:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

Be sure that you are killing the HPF2 application program and not the UD client. If you kill the client, it simply pops up again at the next boot. If you kill HPF2, then the client will request a new copy and a new work unit.
what happens if you 'log off' windows and then 'log on' again? that's what i've been doing when i see the 100% problem, and then something starts at 0%.
[Nov 17, 2006 4:27:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
confused Re: Problem with program?

Hello halfcard,
I suspect that the 100% error units are not writing the work files properly, so they cannot create a result file to return and just loop until you grow tired and shut them down. Then they just start over again from the beginning, endlessly repeating whenever you reboot / login. This is just an unconfirmed guess on my part. You can confirm it by selecting My Grid - Device Manager - Device Statistics and looking at the Last Results Returned column. If you returned any sort of result (even an error) and downloaded a new work unit, it will show up there.

If that shows that you have not downloaded a new work unit, then use Task Manager to kill the WCGrid_Rosetta application. (I assume that is the name for HPF2.)

Please tell us what you see in Device Statistics.

Lawrence
[Nov 17, 2006 5:13:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Alther
Former World Community Grid Tech
United States of America
Joined: Sep 30, 2004
Post Count: 414
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

Hi folks. I don't have an explanation yet, but I am looking into the problem. The big problem is that I haven't been able to reproduce it yet, so some additional information would be great if you know it.

* Was the workunit restarted from a checkpoint at any time during the WU run (e.g. did you stop/restart UD, log off/on Windows, reboot, etc.)?
* ...or did the WU run straight through?
* Do the graphics and the UD percent complete match up?
* If you watch the graphics, how does the percent complete increment throughout the run?
* When it's sitting at 100%, is the wcg_hpf2_rosetta.exe program taking 100% CPU still or is it a ud_* process, or is the CPU not being utilized at all and it's just sitting there?
* Instead of killing the wcg_hpf2_rosetta.exe process, what happens if you instead cleanly shut down UD and restart it? What do you observe?

Thanks,
----------------------------------------
Rick Alther
Former World Community Grid Developer
[Nov 17, 2006 1:39:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

Hi folks. I don't have an explanation yet, but I am looking into the problem. The big problem is that I haven't been able to reproduce it yet, so some additional information would be great if you know it.

* Was the workunit restarted from a checkpoint at any time during the WU run (e.g. did you stop/restart UD, log off/on Windows, reboot, etc.)?


* ...or did the WU run straight through?
Ran straight through, machine only does world community grid 24x7

* Do the graphics and the UD percent complete match up?
sorry didn't check
* If you watch the graphics, how does the percent complete increment throughout the run?
sorry didn't check
* When it's sitting at 100%, is the wcg_hpf2_rosetta.exe program taking 100% CPU still or is it a ud_* process, or is the CPU not being utilized at all and it's just sitting there?
fairly certain it was ud_* that was at 100% CPU usage, that's what i killed in order to get it to send the results back
* Instead of killing the wcg_hpf2_rosetta.exe process, what happens if you instead cleanly shut down UD and restart it? What do you observe?
killed up_* and sent the work unit back
Thanks,

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 17, 2006 3:26:34 PM]
[Nov 17, 2006 2:11:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
davidhobbs
Senior Cruncher
England
Joined: Dec 30, 2004
Post Count: 152
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

I now have a HPF2 UD work unit stuck at 100%, so I hope the following info might help Rick.

Device ID 179553.
Task run time almost 33 hours
No result returned during this time
No graphics viewable (get the "wait a few seconds then return" message)
The PC will have been re-started at least four times, but using hibernation so the work unit itself should not have been interrupted.
The process UD_9930506.exe is using 98% CPU time.
I can see UD.exe also listed in task manager, but I can't see any other WCG processes. Shouldn't there be three of them? Aha! if I look at one of my other machines running the same project I see that ud_9930506 is listed but not consuming any significant processor time, and wcg_hpf2_rosetta.exe is the one using 98% CPU time. Perhaps this will give you a clue?
I shut down the agent (by choosing EXIT from the sys tray icon) and then re-started it. The agent started again from 0% without contacting the grid server.
Shall I kill this process and get a new work unit or do you want to see what happens when this one reaches 100% again?

David.
[Nov 26, 2006 11:56:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem with program?

Hello davidhobbs,
Thanks! The problem is not - directly - with the HPF2 application but with the UD thread when the HPF2 result should be returned. This is new information and is very important to know in tracking down the bug. Thanks and please don't wipe it until Rick Alther gets a chance to comment. He might want a copy of the work file or something. I am dropping him a note about this.

Lawrence
[Nov 26, 2006 1:21:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
davidhobbs
Senior Cruncher
England
Joined: Dec 30, 2004
Post Count: 152
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem with program?

OK,

It's now 2 hours and 4 minutes into the work unit, showing it as 35% completed. I'll keep an eye on the forums.

David.
[Nov 26, 2006 1:58:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 46   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread