| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 5
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Is this considered normal behavior and should I let it continue?
[03:22:58] Finished Job #13 [03:22:58] Starting job 14,CPU time has been restored to 22654.640625. [04:55:30] Finished Job #14 [04:55:30] Starting job 15,CPU time has been restored to 28115.718750. 06:46:42 (3288): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [00:06:00] Number of jobs = 16 [00:06:00] Starting job 15,CPU time has been restored to 28115.718750. 00:08:30 (3436): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [00:08:47] Number of jobs = 16 [00:08:47] Starting job 15,CPU time has been restored to 28115.718750. 01:39:47 (2336): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [01:39:50] Number of jobs = 16 [01:39:50] Starting job 15,CPU time has been restored to 28115.718750. 02:40:36 (1524): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [02:40:38] Number of jobs = 16 [02:40:38] Starting job 15,CPU time has been restored to 28115.718750. 04:11:47 (2340): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [04:11:56] Number of jobs = 16 [04:11:56] Starting job 15,CPU time has been restored to 28115.718750. 05:12:38 (2576): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [05:12:39] Number of jobs = 16 [05:12:39] Starting job 15,CPU time has been restored to 28115.718750. 06:43:48 (3048): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [06:43:50] Number of jobs = 16 [06:43:50] Starting job 15,CPU time has been restored to 28115.718750. 07:44:37 (164): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [07:45:03] Number of jobs = 16 [07:45:03] Starting job 15,CPU time has been restored to 28115.718750. 08:45:26 (2380): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [08:45:37] Number of jobs = 16 [08:45:37] Starting job 15,CPU time has been restored to 28115.718750. 09:46:15 (4048): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [09:46:19] Number of jobs = 16 [09:46:19] Starting job 15,CPU time has been restored to 28115.718750. 10:47:02 (1352): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [10:47:04] Number of jobs = 16 [10:47:04] Starting job 15,CPU time has been restored to 28115.718750. 12:18:15 (4004): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:18:41] Number of jobs = 16 [12:18:41] Starting job 15,CPU time has been restored to 28115.718750. 12:48:41 (3688): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:49:04] Number of jobs = 16 [12:49:04] Starting job 15,CPU time has been restored to 28115.718750. Windows XP SP3 BOINC Client 6.10.58 32bit Quad core 2GB memory |
||
|
|
deltavee
Ace Cruncher Texas Hill Country Joined: Nov 17, 2004 Post Count: 4894 Status: Offline Project Badges:
|
It looks pretty hopeless.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That sure looks like an endless loop to me. I would reboot. If the loop continued, I would then Abort the Task.
Lawrence |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I just happened to catch it one time and what happens is the disk light comes on and stays on for several minutes. Task manager shows that the only process doing I/O is BOINC.EXE. Evidently during this heavy I/O period, it doesn't talk to the science processes and all four reset back to the beginning of the job they were working on. Task Manager shows ~26,000,000,000 I/O bytes for BOINC.EXE and ~1.5 million I/O Reads. Single largest process when sorted by I/O Reads
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Looks like a know issue but no resolution. Berkeley has ticket 336 open:
Problem with the heartbeat mechanism: if the client does something that blocks for > 30 secs (e.g. a synchronous DNS lookup, a disk-space scan, a debugger break) then all apps quit, producing confusing messages and possibly wasting CPU time. Proposed solution: remove heartbeat mechanism. Include client process ID in the app_init_data file. The API periodically sees if that process is still alive, and exits if not. Between not being able to upload results and not being able to run the science, this isn't a very productive project for me. I think I'm going to take my cue from the science and EXIT... |
||
|
|
|