| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I may potentially have a "bad" work unit and am not quite sure how to proceed to determine the best course of action from here, so I thought I'd post this up and see if a Tech or CA has any advice.
System is pretty much a dedicated cruncher. Specs are Q6600 running Windows XP Pro 32bit. Boinc client 6.2.18. This system (Basement) has completed 1482 WU without a single error in almost 3 months of crunching. The task was faah_4641_000105_MC_xMut_md02630_07_0. Symptom: I noticed that only 3 cores were working this morning (1000,EST) even though 4 units were loaded up. System idle process was @ 25%. The questionable WU started @ 0717 EST and had 00:30:34 of run time and was not moving. This was with almost 3 hours of "wall" run time. I flushed/uploaded all completed work (5 units) and suspended the suspect WU. The system immediately pick up the next unit and started working, now on all 4 cores. There are no errors indicated in the log the task just appeared hung even though it showed as "running" on the tasks tab. When I suspended the task, that info did not show up in the log either? You can see where it picked up the new task @ 10:10:18 though. Log: 24-Nov-2008 07:17:21 [World Community Grid] Starting faah4641_000105_MC_xMut_md02630_07_0 At this point, I'm not sure how to proceed. I resumed the suspect task after the new one started, so I assume the suspect task which shows "waiting to run" will attempt to resume after the first of the 4 running tasks completes. If it does not run successfully, my plan was to: 1. stop/restart the service and see if kicking the client does the trick. 2. not sure if there's any options left accept to abort the suspect WU. I'm trying to avoid aborting the WU if possible but don't know what else to try. I'd like to not get the system penalized for a "bad WU" if that is indeed what it is. Any advice or suggestions would be appreciated. TIA. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Your plan is probably the best thing you can do.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks Didactylos,
I'm not sure if this helps determine anything or not, but in the "slot" folder in the Boinc data path, I found this data in "stderr.txt". The "stderrdae.txt" in the main Boinc data folder is empty. Failed to get VersionInfo size: 1812 So maybe there was an error. I'm not familiar enough with WCG yet to know where else or what logs to look for. Cheers. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Normally if there is a fatal error, computation for the task stops and the work unit is reported as an error.
It is highly probable that the crash reporting code crashed or hung, and that is why the wheels fell off. Have a look at your Results Status page. Did anyone else complete this task successfully? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Normally if there is a fatal error, computation for the task stops and the work unit is reported as an error. It is highly probable that the crash reporting code crashed or hung, and that is why the wheels fell off. Have a look at your Results Status page. Did anyone else complete this task successfully? Good explanation, thanks. I had already checked the result status page and no one else has run the WU. Since this appears to be a single quorum project, I assume that means this WU hasn't been sent because of a previous error report. I'll proceed with the direction I was headed. Thanks for your help and explanations. Very quick response! I'll follow up on this post with either a success or fail report just for posterity... Cheers. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just wanted to post a follow up on my issue. The work unit restarted and ran to completion without further problems.
I did notice while it was running (it was the only faah unit running with 3x rice units) that in task manager, the original thread that hung was still sitting there idle. After the work unit finished, I kicked the client and while it was stopped I deleted the dead thread. No further issues. Thanks again Didactylos for your explanation and response. Cheers. |
||
|
|
cosmo_vk
Cruncher Russian Federation Joined: Jan 31, 2008 Post Count: 7 Status: Offline Project Badges:
|
I had a problem with 4 results. On each of them status Inconclusive.
----------------------------------------
in Inconclusive state: <core_client_version>6.2.28</core_client_version>It's bad for me or not? ![]() |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Hi,
----------------------------------------Remember, this is zero redundancy work thus Inconclusive means initially: Hey this machine had a problem, lets do some extra verification by sending out an extra result to confirm if it has returned to produce valid work. Happy Thanksgiving
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
cosmo_vk
Cruncher Russian Federation Joined: Jan 31, 2008 Post Count: 7 Status: Offline Project Badges:
|
All tasks have the valid status. It's very good!
----------------------------------------![]() ![]() |
||
|
|
|