Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 3
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1349 times and has 2 replies Next Thread
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Work Units Hanging

In the last two weeks or so I have experienced a few instances of some FAAH WU's hanging. They are on some machines I do not check very often. If I suspend the units and and then resume the units they appear to finish normally. It has happened on three separate machines - 2 Core2 Duos and a and Dell 2950 with 2 Xeon 5320's. They are all running Linux Mint 11 32 bit. This has only occurred with FAAH, not any other projects. There is nothing in the result file which looks out of the ordinary. The reason I noticed is because production on the machines went down and I would find one WU running for 35+ hours when they normally finish in 6-7 hours. If it keeps recurring sporadically I will update the OS to Linux Mint 14 64 bit. Just posting to see if anyone else has encountered a similar problem and/or a solution.

Edit:
Result Log

Result Name: faah38913_ ZINC16472778_ xPR_ wC6_ 11_ 1ref9_ 00_ 0--
<core_client_version>6.12.33</core_client_version>
<![CDATA[
<stderr_txt>
INFO:[23:03:09] Start AutoGrid...

autogrid4: Successful Completion.
INFO:[23:07:01] End AutoGrid...
INFO:[23:07:01] Start AutoDock...
INFO: In AutoDock main_autodock()
Beginning AutoDock...
INFO: Setting num_generations: 10000
About to enter main loop...(dockings already completed: 0)
_maxGenSeenSoFar changed: 2500
_maxGenSeenSoFar changed: 2626
_maxGenSeenSoFar changed: 2758
_maxGenSeenSoFar changed: 2896
_maxGenSeenSoFar changed: 3041
_maxGenSeenSoFar changed: 3194
_maxGenSeenSoFar changed: 3354
_maxGenSeenSoFar changed: 3522
_maxGenSeenSoFar changed: 3699
_maxGenSeenSoFar changed: 3885
_maxGenSeenSoFar changed: 4080
_maxGenSeenSoFar changed: 4285
_maxGenSeenSoFar changed: 4500
_maxGenSeenSoFar changed: 4726
_maxGenSeenSoFar changed: 4963
_maxGenSeenSoFar changed: 5212
_maxGenSeenSoFar changed: 5473
_maxGenSeenSoFar changed: 5747
_maxGenSeenSoFar changed: 6035
_maxGenSeenSoFar changed: 6337
_maxGenSeenSoFar changed: 6654
_maxGenSeenSoFar changed: 6987
_maxGenSeenSoFar changed: 7337
_maxGenSeenSoFar changed: 7704
_maxGenSeenSoFar changed: 8090
_maxGenSeenSoFar changed: 8495
_maxGenSeenSoFar changed: 8920
_maxGenSeenSoFar changed: 9367
_maxGenSeenSoFar changed: 9836
_maxGenSeenSoFar changed: 10328
Updating Best Energy for WU: 0.00
Finished Docking number 0
Finished Docking number 1
Finished Docking number 2
Finished Docking number 3
Finished Docking number 4
Finished Docking number 5
Finished Docking number 6
Finished Docking number 7
Finished Docking number 8
Finished Docking number 9
Finished Docking number 10
02:49:51 (30662): No heartbeat from client for 30 sec - exiting
Restoring grahics. bestEnergy: -9.520640 maxGenSeen: 10328
AG Check: Found receptor.A.map
INFO:[02:49:59] Start AutoDock...
INFO: In AutoDock main_autodock()
Beginning AutoDock...
INFO: Setting num_generations: 10000
About to enter main loop...(dockings already completed: 11)
Finished Docking number 11
Finished Docking number 12
Finished Docking number 13
Finished Docking number 14
Finished Docking number 15
Finished Docking number 16
Finished Docking number 17
Finished Docking number 18
Finished Docking number 19
Finished Docking number 20
Finished Docking number 21
Finished Docking number 22
Finished Docking number 23
Finished Docking number 24
Finished Docking number 25
Updating Best Energy for WU: -9.52
Finished Docking number 26
07:49:00 (31408): No heartbeat from client for 30 sec - exiting
Restoring grahics. bestEnergy: -9.891410 maxGenSeen: 10328
AG Check: Found receptor.A.map
INFO:[08:47:46] Start AutoDock...
INFO: In AutoDock main_autodock()
Beginning AutoDock...
INFO: Setting num_generations: 10000
About to enter main loop...(dockings already completed: 27)
Finished Docking number 27
Finished Docking number 28

INFO:[09:25:26] End AutoDock...
INFO:[09:25:27] Start AutoGrid...

autogrid4: Successful Completion.
INFO:[09:26:19] End AutoGrid...
INFO: In AutoDock main_autodock()
Beginning AutoDock...
INFO: Setting num_generations: 27000
About to enter main loop...(dockings already completed: 29)
Finished Docking number 0
09:27:29 (7681): called boinc_finish

</stderr_txt>
I have highlighted the bold areas as possible areas of concern. Although it does not appear to have hung on the first instance of no heartbeat, but resumed. The second instance it appears to have done the hang for a considerable amount of time, but resumed normally after the suspend/resume action I took. It was showing a run time of over 35 hours, but only shows 8+hours of run time.

faah38913_ ZINC16472778_ xPR_ wC6_ 11_ 1ref9_ 00_ 1-- 715 Valid 4/4/13 03:43:07 4/4/13 18:12:29 5.67 123.3 / 121.5
faah38913_ ZINC16472778_ xPR_ wC6_ 11_ 1ref9_ 00_ 0-- 715 Valid 4/4/13 03:42:38 4/6/13 16:24:17 8.73 119.7 / 121.5 >mine

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
----------------------------------------
[Edit 2 times, last edit by Sgt.Joe at Apr 6, 2013 4:40:19 PM]
[Apr 6, 2013 4:27:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Work Units Hanging

Hello Sgt.Joe,

"No heartbeat" is always cause for concern. BOINC is designed to avoid causing problems for the user. If the heartbeat signal can not get through inter-process communications for 30 seconds, BOINC might go into modes designed to keep the project task from interfering with the user, even if that impacts BOINC throughput. Look at your software environment. What could block / hog inter-process communications?

Lawrence
[Apr 6, 2013 7:58:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Units Hanging

Hello Sgt.Joe,

"No heartbeat" is always cause for concern. BOINC is designed to avoid causing problems for the user. If the heartbeat signal can not get through inter-process communications for 30 seconds, BOINC might go into modes designed to keep the project task from interfering with the user, even if that impacts BOINC throughput. Look at your software environment. What could block / hog inter-process communications?

Lawrence

BOINC is the only thing running on these systems. I am sure there are background tasks the OS is running, but that should be it. I do not even have an any antivirus running.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Apr 6, 2013 8:15:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread