Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1679 times and has 8 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Potential problem with a 4641 WU

I may potentially have a "bad" work unit and am not quite sure how to proceed to determine the best course of action from here, so I thought I'd post this up and see if a Tech or CA has any advice.

System is pretty much a dedicated cruncher. Specs are Q6600 running Windows XP Pro 32bit. Boinc client 6.2.18.

This system (Basement) has completed 1482 WU without a single error in almost 3 months of crunching.

The task was faah_4641_000105_MC_xMut_md02630_07_0.

Symptom: I noticed that only 3 cores were working this morning (1000,EST) even though 4 units were loaded up. System idle process was @ 25%. The questionable WU started @ 0717 EST and had 00:30:34 of run time and was not moving. This was with almost 3 hours of "wall" run time.

I flushed/uploaded all completed work (5 units) and suspended the suspect WU. The system immediately pick up the next unit and started working, now on all 4 cores. There are no errors indicated in the log the task just appeared hung even though it showed as "running" on the tasks tab. When I suspended the task, that info did not show up in the log either? You can see where it picked up the new task @ 10:10:18 though.

Log:
24-Nov-2008 07:17:21 [World Community Grid] Starting faah4641_000105_MC_xMut_md02630_07_0
24-Nov-2008 07:17:21 [World Community Grid] Starting task faah4641_000105_MC_xMut_md02630_07_0 using faah version 606
24-Nov-2008 07:17:23 [World Community Grid] Started upload of faah4640_001936_MC_xMut_md02500_01_0_0
24-Nov-2008 07:17:23 [World Community Grid] Started upload of faah4640_001936_MC_xMut_md02500_01_0_1
24-Nov-2008 07:17:27 [World Community Grid] Finished upload of faah4640_001936_MC_xMut_md02500_01_0_0
24-Nov-2008 07:17:27 [World Community Grid] Started upload of faah4640_001936_MC_xMut_md02500_01_0_2
24-Nov-2008 07:17:29 [World Community Grid] Finished upload of faah4640_001936_MC_xMut_md02500_01_0_1
24-Nov-2008 07:17:29 [World Community Grid] Started upload of faah4640_001936_MC_xMut_md02500_01_0_3
24-Nov-2008 07:17:31 [World Community Grid] Finished upload of faah4640_001936_MC_xMut_md02500_01_0_2
24-Nov-2008 07:17:33 [World Community Grid] Finished upload of faah4640_001936_MC_xMut_md02500_01_0_3
24-Nov-2008 08:21:36 [World Community Grid] Computation for task faah4640_001874_MC_xMut_md02500_02_1 finished
24-Nov-2008 08:21:36 [World Community Grid] Starting faah4641_000689_MC_xMut_md02630_02_0
24-Nov-2008 08:21:36 [World Community Grid] Starting task faah4641_000689_MC_xMut_md02630_02_0 using faah version 606
24-Nov-2008 08:21:38 [World Community Grid] Started upload of faah4640_001874_MC_xMut_md02500_02_1_0
24-Nov-2008 08:21:38 [World Community Grid] Started upload of faah4640_001874_MC_xMut_md02500_02_1_1
24-Nov-2008 08:21:42 [World Community Grid] Finished upload of faah4640_001874_MC_xMut_md02500_02_1_0
24-Nov-2008 08:21:42 [World Community Grid] Started upload of faah4640_001874_MC_xMut_md02500_02_1_2
24-Nov-2008 08:21:43 [World Community Grid] Finished upload of faah4640_001874_MC_xMut_md02500_02_1_1
24-Nov-2008 08:21:43 [World Community Grid] Started upload of faah4640_001874_MC_xMut_md02500_02_1_3
24-Nov-2008 08:21:45 [World Community Grid] Finished upload of faah4640_001874_MC_xMut_md02500_02_1_2
24-Nov-2008 08:21:48 [World Community Grid] Finished upload of faah4640_001874_MC_xMut_md02500_02_1_3
24-Nov-2008 08:42:17 [World Community Grid] Computation for task R00219_c3520802980fc6cbf7e51df12c0f35e3_03_004_14 finished
24-Nov-2008 08:42:17 [World Community Grid] Starting R00220_af9ba26927fb386a00e617674274ca5d_02_004_14
24-Nov-2008 08:42:17 [World Community Grid] Starting task R00220_af9ba26927fb386a00e617674274ca5d_02_004_14 using rice version 617
24-Nov-2008 08:42:19 [World Community Grid] Started upload of R00219_c3520802980fc6cbf7e51df12c0f35e3_03_004_14_0
24-Nov-2008 08:42:35 [World Community Grid] Finished upload of R00219_c3520802980fc6cbf7e51df12c0f35e3_03_004_14_0
24-Nov-2008 10:09:29 [World Community Grid] Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 5 completed tasks
24-Nov-2008 10:09:34 [World Community Grid] Scheduler request succeeded: got 0 new tasks
24-Nov-2008 10:10:18 [World Community Grid] Starting R00220_f4c149eb524b2d6dc9f83a9169ddc2cb_01_006_18
24-Nov-2008 10:10:18 [World Community Grid] Starting task R00220_f4c149eb524b2d6dc9f83a9169ddc2cb_01_006_18 using rice version 617


At this point, I'm not sure how to proceed. I resumed the suspect task after the new one started, so I assume the suspect task which shows "waiting to run" will attempt to resume after the first of the 4 running tasks completes. If it does not run successfully, my plan was to:
1. stop/restart the service and see if kicking the client does the trick.
2. not sure if there's any options left accept to abort the suspect WU.

I'm trying to avoid aborting the WU if possible but don't know what else to try. I'd like to not get the system penalized for a "bad WU" if that is indeed what it is.

Any advice or suggestions would be appreciated. TIA.
[Nov 24, 2008 4:56:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

Your plan is probably the best thing you can do.
[Nov 24, 2008 5:14:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

Thanks Didactylos,

I'm not sure if this helps determine anything or not, but in the "slot" folder in the Boinc data path, I found this data in "stderr.txt". The "stderrdae.txt" in the main Boinc data folder is empty.

Failed to get VersionInfo size: 1812
INFO:[07:17:21] Start AutoGrid...

autogrid: autogrid4: Successful Completion.
INFO:[07:17:55] End AutoGrid...
Beginning AutoDock...

autodock4: *** WARNING! Non-integral total charge (-1.02 e) on ligand! ***

INFO: Setting num_generations: 10000
About to enter main loop...(dockings already completed: 0)
_maxGenSeenSoFar changed: 2500
_maxGenSeenSoFar changed: 2626
_maxGenSeenSoFar changed: 2758
_maxGenSeenSoFar changed: 2896
_maxGenSeenSoFar changed: 3041
_maxGenSeenSoFar changed: 3194
_maxGenSeenSoFar changed: 3354
_maxGenSeenSoFar changed: 3522
_maxGenSeenSoFar changed: 3699
_maxGenSeenSoFar changed: 3885
_maxGenSeenSoFar changed: 4080
_maxGenSeenSoFar changed: 4285
_maxGenSeenSoFar changed: 4500
_maxGenSeenSoFar changed: 4726
_maxGenSeenSoFar changed: 4963
_maxGenSeenSoFar changed: 5212
_maxGenSeenSoFar changed: 5473
_maxGenSeenSoFar changed: 5747
_maxGenSeenSoFar changed: 6035
_maxGenSeenSoFar changed: 6337
_maxGenSeenSoFar changed: 6654
_maxGenSeenSoFar changed: 6987
_maxGenSeenSoFar changed: 7337
_maxGenSeenSoFar changed: 7704
_maxGenSeenSoFar changed: 8090
_maxGenSeenSoFar changed: 8495
_maxGenSeenSoFar changed: 8920
_maxGenSeenSoFar changed: 9367
_maxGenSeenSoFar changed: 9836
_maxGenSeenSoFar changed: 10328
Updating Best Energy for WU: 0.00
Finished Docking number 0
Updating Best Energy for WU: -14.03
Finished Docking number 1


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0x00000001

Engaging BOINC Windows Runtime Debugger...


So maybe there was an error. I'm not familiar enough with WCG yet to know where else or what logs to look for.

Cheers.
[Nov 24, 2008 5:23:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

Normally if there is a fatal error, computation for the task stops and the work unit is reported as an error.

It is highly probable that the crash reporting code crashed or hung, and that is why the wheels fell off.

Have a look at your Results Status page. Did anyone else complete this task successfully?
[Nov 24, 2008 5:42:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

Normally if there is a fatal error, computation for the task stops and the work unit is reported as an error.

It is highly probable that the crash reporting code crashed or hung, and that is why the wheels fell off.

Have a look at your Results Status page. Did anyone else complete this task successfully?


Good explanation, thanks.

I had already checked the result status page and no one else has run the WU. Since this appears to be a single quorum project, I assume that means this WU hasn't been sent because of a previous error report.

I'll proceed with the direction I was headed. Thanks for your help and explanations. Very quick response! smile

I'll follow up on this post with either a success or fail report just for posterity...

Cheers.
[Nov 24, 2008 5:51:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

Just wanted to post a follow up on my issue. The work unit restarted and ran to completion without further problems.

I did notice while it was running (it was the only faah unit running with 3x rice units) that in task manager, the original thread that hung was still sitting there idle. After the work unit finished, I kicked the client and while it was stopped I deleted the dead thread. No further issues.

Thanks again Didactylos for your explanation and response.

Cheers.
[Nov 25, 2008 4:00:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
cosmo_vk
Cruncher
Russian Federation
Joined: Jan 31, 2008
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
sad Re: Potential problem with a 4641 WU

I had a problem with 4 results. On each of them status Inconclusive.

  • faah4641_ 000190_ MC_ xMut_ md02630_ 0C_ 0
  • faah4641_ 000278_ MC_ xMut_ md02630_ 01_ 0
  • faah4641_ 000332_ MC_ xMut_ md02630_ 0C_ 0
  • faah4641_ 000646_ MC_ xMut_ md02630_ 00_ 0

in Inconclusive state:
<core_client_version>6.2.28</core_client_version>
<![CDATA[
<stderr_txt>
Failed to get VersionInfo size: 1812
INFO:[00:25:39] Start AutoGrid...

autogrid: autogrid4: Successful Completion.
INFO:[00:26:18] End AutoGrid...
Beginning AutoDock...

autodock4: *** WARNING! Non-integral total charge (-2.05 e) on ligand! ***

INFO: Setting num_generations: 10000
About to enter main loop...(dockings already completed: 0)
_maxGenSeenSoFar changed: 2500
_maxGenSeenSoFar changed: 2626
_maxGenSeenSoFar changed: 2758
_maxGenSeenSoFar changed: 2896
_maxGenSeenSoFar changed: 3041
_maxGenSeenSoFar changed: 3194
_maxGenSeenSoFar changed: 3354
_maxGenSeenSoFar changed: 3522
_maxGenSeenSoFar changed: 3699
_maxGenSeenSoFar changed: 3885
_maxGenSeenSoFar changed: 4080
_maxGenSeenSoFar changed: 4285
_maxGenSeenSoFar changed: 4500
_maxGenSeenSoFar changed: 4726
_maxGenSeenSoFar changed: 4963
_maxGenSeenSoFar changed: 5212
_maxGenSeenSoFar changed: 5473
_maxGenSeenSoFar changed: 5747
_maxGenSeenSoFar changed: 6035
_maxGenSeenSoFar changed: 6337
_maxGenSeenSoFar changed: 6654
_maxGenSeenSoFar changed: 6987
_maxGenSeenSoFar changed: 7337
_maxGenSeenSoFar changed: 7704
_maxGenSeenSoFar changed: 8090
_maxGenSeenSoFar changed: 8495
_maxGenSeenSoFar changed: 8920
_maxGenSeenSoFar changed: 9367
_maxGenSeenSoFar changed: 9836
_maxGenSeenSoFar changed: 10328
Updating Best Energy for WU: 0.00
Finished Docking number 0
Updating Best Energy for WU: -11.93
Finished Docking number 1
Updating Best Energy for WU: -12.55
Finished Docking number 2
Finished Docking number 3
Finished Docking number 4
Finished Docking number 5
Finished Docking number 6
Finished Docking number 7
Finished Docking number 8
Finished Docking number 9
Finished Docking number 10
Finished Docking number 11
Finished Docking number 12
Finished Docking number 13
Updating Best Energy for WU: -13.89
Finished Docking number 14
Finished Docking number 15
Finished Docking number 16
Finished Docking number 17
Finished Docking number 18
Finished Docking number 19

________________________________________________________________________________

autodock4: Successful Completion on "World Community Grid device"

________________________________________________________________________________

INFO:[05:36:28] Start AutoGrid...

autogrid: autogrid4: Successful Completion.
INFO:[05:37:04] End AutoGrid...
Beginning AutoDock...
INFO: Setting num_generations: 27000
About to enter main loop...(dockings already completed: 20)
Finished Docking number 0

________________________________________________________________________________

autodock4: Successful Completion on "World Community Grid device"

________________________________________________________________________________

called boinc_finish

</stderr_txt>
]]>
It's bad for me or not?
----------------------------------------

[Nov 27, 2008 6:16:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

Hi,

Remember, this is zero redundancy work thus Inconclusive means initially: Hey this machine had a problem, lets do some extra verification by sending out an extra result to confirm if it has returned to produce valid work.

Happy Thanksgiving
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Nov 27, 2008 6:24:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cosmo_vk
Cruncher
Russian Federation
Joined: Jan 31, 2008
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Potential problem with a 4641 WU

All tasks have the valid status. It's very good! smile
----------------------------------------

[Dec 4, 2008 5:24:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread