| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 5
|
|
| Author |
|
|
cqexbesd
Cruncher Joined: Oct 13, 2008 Post Count: 14 Status: Offline Project Badges:
|
I have BOINC, attached to WCG and other projects such as rosetta and SIMAP, running on 2 FreeBSD 7 boxes (both with Linux emulation enabled). Sometimes I find that a WCG task has "fallen asleep" - that is is not consuming any CPU time - even though the boinc_gui claims it should be running. If I suspend and resume the task via the GUI it comes back to life. Both machines have 2 processors and I have only ever seen it on one processor at a time. I haven't found a definite pattern but it does seem to occur after something else has used a lot of processor time (e.g. a full CPUs worth - the longer the more likely the problem is to occur). I only see this with WCG tasks. The current task is hcc. I don't recall seeing it with any other tasks but I can't bee 100% sure there.
If I use ps I see the process is marked as sleeping. truss shows nothing (unsurprisingly). Interestingly the process seems to have a zombie... USER PID PPID %CPU %MEM RSS TT STAT STARTED TIME COMMAND boinc 8494 767 0.0 5.6 28924 ?? IN Wed11AM 812:35.32 wcg_hcc1_img_6.03_i686-pc-linux-gnu X0000053540352200507282103.jp2 boinc 8495 8494 0.0 0.0 0 ?? ZN Wed11AM 0:46.55 <defunct> Obviously this leaves some CPU time idle that might otherwise be going to a good cause. I searched the forums to see if this had come up before but with no success. Any clues? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The Mac version of BOINC had a problem with zombie processes. I'm not sure whether that was resolved, but at the time it was thought to be Mac-specific, not affecting other platforms.
My only conjecture is that something is interfering with the "heartbeat" mechanism, which is how BOINC tracks the health and status of its child processes. Please will you check the error log for one of these processes after you have restarted it? You will find it in stderr.txt in one of the slot directories. Post the contents here. Thank you. |
||
|
|
cqexbesd
Cruncher Joined: Oct 13, 2008 Post Count: 14 Status: Offline Project Badges:
|
Please will you check the error log for one of these processes after you have restarted it? You will find it in stderr.txt in one of the slot directories. Post the contents here. Yep...I have to wait for it to happen again as the process quoted above has finished now. It would expect it to happen again today or on Monday. Thanks for your quick response! Andrew |
||
|
|
cqexbesd
Cruncher Joined: Oct 13, 2008 Post Count: 14 Status: Offline Project Badges:
|
You will find it in stderr.txt in one of the slot directories. Post the contents here. Well I cheated and set off a big compile and the freeze happened in about 10 minutes. stderr.txt looks like: Unrecognized XML in parse_init_data_file: computation_deadline Skipping: 1230080915.000000 Skipping: /computation_deadline In ExtractGlcmFeatures: End of 0 iteration of outer loop. In ExtractGlcmFeatures: End of 1 iteration of outer loop. In ExtractGlcmFeatures: End of 2 iteration of outer loop. In ExtractGlcmFeatures: End of 3 iteration of outer loop. In ExtractGlcmFeatures: End of 4 iteration of outer loop. In ExtractGlcmFeatures: End of 5 iteration of outer loop. In ExtractGlcmFeatures: End of 6 iteration of outer loop. In ExtractGlcmFeatures: End of 7 iteration of outer loop. In ExtractGlcmFeatures: End of 8 iteration of outer loop. In ExtractGlcmFeatures: End of 9 iteration of outer loop. SIGILL: illegal instruction Stack trace (10 frames): [0x81d218b] [0x8238f64] [0xbfbfffbb] [0x8054b42] [0x805c411] [0x806d0fc] [0x807d741] [0x807dae8] [0x823b1da] [0x8048131] Exiting... Unrecognized XML in parse_init_data_file: computation_deadline Skipping: 1230080915.000000 Skipping: /computation_deadline -- This is an hcc task as well. I wonder if its an imperfect emulation of a linux system call that only gets called when available CPU is low? Thanks, Andrew |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
You could very well be right.
The remaining question is why BOINC didn't handle the crash correctly. Judging from the error log, I don't think you have the latest BOINC version. Upgrading may not help, but it's the only option I can come up with. |
||
|
|
|