| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 3
|
|
| Author |
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
In the last two weeks or so I have experienced a few instances of some FAAH WU's hanging. They are on some machines I do not check very often. If I suspend the units and and then resume the units they appear to finish normally. It has happened on three separate machines - 2 Core2 Duos and a and Dell 2950 with 2 Xeon 5320's. They are all running Linux Mint 11 32 bit. This has only occurred with FAAH, not any other projects. There is nothing in the result file which looks out of the ordinary. The reason I noticed is because production on the machines went down and I would find one WU running for 35+ hours when they normally finish in 6-7 hours. If it keeps recurring sporadically I will update the OS to Linux Mint 14 64 bit. Just posting to see if anyone else has encountered a similar problem and/or a solution.
----------------------------------------Edit: Result Log Result Name: faah38913_ ZINC16472778_ xPR_ wC6_ 11_ 1ref9_ 00_ 0-- <core_client_version>6.12.33</core_client_version> <![CDATA[ <stderr_txt> INFO:[23:03:09] Start AutoGrid... autogrid4: Successful Completion. INFO:[23:07:01] End AutoGrid... INFO:[23:07:01] Start AutoDock... INFO: In AutoDock main_autodock() Beginning AutoDock... INFO: Setting num_generations: 10000 About to enter main loop...(dockings already completed: 0) _maxGenSeenSoFar changed: 2500 _maxGenSeenSoFar changed: 2626 _maxGenSeenSoFar changed: 2758 _maxGenSeenSoFar changed: 2896 _maxGenSeenSoFar changed: 3041 _maxGenSeenSoFar changed: 3194 _maxGenSeenSoFar changed: 3354 _maxGenSeenSoFar changed: 3522 _maxGenSeenSoFar changed: 3699 _maxGenSeenSoFar changed: 3885 _maxGenSeenSoFar changed: 4080 _maxGenSeenSoFar changed: 4285 _maxGenSeenSoFar changed: 4500 _maxGenSeenSoFar changed: 4726 _maxGenSeenSoFar changed: 4963 _maxGenSeenSoFar changed: 5212 _maxGenSeenSoFar changed: 5473 _maxGenSeenSoFar changed: 5747 _maxGenSeenSoFar changed: 6035 _maxGenSeenSoFar changed: 6337 _maxGenSeenSoFar changed: 6654 _maxGenSeenSoFar changed: 6987 _maxGenSeenSoFar changed: 7337 _maxGenSeenSoFar changed: 7704 _maxGenSeenSoFar changed: 8090 _maxGenSeenSoFar changed: 8495 _maxGenSeenSoFar changed: 8920 _maxGenSeenSoFar changed: 9367 _maxGenSeenSoFar changed: 9836 _maxGenSeenSoFar changed: 10328 Updating Best Energy for WU: 0.00 Finished Docking number 0 Finished Docking number 1 Finished Docking number 2 Finished Docking number 3 Finished Docking number 4 Finished Docking number 5 Finished Docking number 6 Finished Docking number 7 Finished Docking number 8 Finished Docking number 9 Finished Docking number 10 02:49:51 (30662): No heartbeat from client for 30 sec - exiting Restoring grahics. bestEnergy: -9.520640 maxGenSeen: 10328 AG Check: Found receptor.A.map INFO:[02:49:59] Start AutoDock... INFO: In AutoDock main_autodock() Beginning AutoDock... INFO: Setting num_generations: 10000 About to enter main loop...(dockings already completed: 11) Finished Docking number 11 Finished Docking number 12 Finished Docking number 13 Finished Docking number 14 Finished Docking number 15 Finished Docking number 16 Finished Docking number 17 Finished Docking number 18 Finished Docking number 19 Finished Docking number 20 Finished Docking number 21 Finished Docking number 22 Finished Docking number 23 Finished Docking number 24 Finished Docking number 25 Updating Best Energy for WU: -9.52 Finished Docking number 26 07:49:00 (31408): No heartbeat from client for 30 sec - exiting Restoring grahics. bestEnergy: -9.891410 maxGenSeen: 10328 AG Check: Found receptor.A.map INFO:[08:47:46] Start AutoDock... INFO: In AutoDock main_autodock() Beginning AutoDock... INFO: Setting num_generations: 10000 About to enter main loop...(dockings already completed: 27) Finished Docking number 27 Finished Docking number 28 INFO:[09:25:26] End AutoDock... INFO:[09:25:27] Start AutoGrid... autogrid4: Successful Completion. INFO:[09:26:19] End AutoGrid... INFO: In AutoDock main_autodock() Beginning AutoDock... INFO: Setting num_generations: 27000 About to enter main loop...(dockings already completed: 29) Finished Docking number 0 09:27:29 (7681): called boinc_finish </stderr_txt> I have highlighted the bold areas as possible areas of concern. Although it does not appear to have hung on the first instance of no heartbeat, but resumed. The second instance it appears to have done the hang for a considerable amount of time, but resumed normally after the suspend/resume action I took. It was showing a run time of over 35 hours, but only shows 8+hours of run time. faah38913_ ZINC16472778_ xPR_ wC6_ 11_ 1ref9_ 00_ 1-- 715 Valid 4/4/13 03:43:07 4/4/13 18:12:29 5.67 123.3 / 121.5 faah38913_ ZINC16472778_ xPR_ wC6_ 11_ 1ref9_ 00_ 0-- 715 Valid 4/4/13 03:42:38 4/6/13 16:24:17 8.73 119.7 / 121.5 >mine Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 2 times, last edit by Sgt.Joe at Apr 6, 2013 4:40:19 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello Sgt.Joe,
"No heartbeat" is always cause for concern. BOINC is designed to avoid causing problems for the user. If the heartbeat signal can not get through inter-process communications for 30 seconds, BOINC might go into modes designed to keep the project task from interfering with the user, even if that impacts BOINC throughput. Look at your software environment. What could block / hog inter-process communications? Lawrence |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Recently Active Project Badges:
|
Hello Sgt.Joe, "No heartbeat" is always cause for concern. BOINC is designed to avoid causing problems for the user. If the heartbeat signal can not get through inter-process communications for 30 seconds, BOINC might go into modes designed to keep the project task from interfering with the user, even if that impacts BOINC throughput. Look at your software environment. What could block / hog inter-process communications? Lawrence BOINC is the only thing running on these systems. I am sure there are background tasks the OS is running, but that should be it. I do not even have an any antivirus running. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|