Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Help Defeat Cancer Thread: Units restarting |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 22
|
Author |
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
I nearly posted this in my last thread but decided to start a new one in case these are unrelated.
----------------------------------------I've had a number of units now on different machines exiting on a zero status without finishing only to restart from scratch. Yesterday, after goining through a fresh install, resetting projects and installing a seperate hdd for more swap file area, my first unit in restarted twice before running through. That one's a P4 1.7 with 768Mb and 2.4Gb dedicated swap drive. No applications running beside Ubuntu Linux 5.1 (fully updated) and BOINC 5.4.9 I'm also getting them on my P4 2.4GHz with 1024Mb and 541Mb swap partition, Ubuntu Linux 6.04 with daily mixed use and BOINC 5.4.9 An example from this machine: - Task B01276_0170_CTMA1Aa-16-5-1_1 exited with zero status but no "finished" file. then a few messages later, but with the same time stamp (so within a second) - Restarting Task B01276_0170_CTMA1Aa-16-5-1_1 using hdc version 505 I've had a few of these do this several times before running to completetion. This one is in my log restarting every sixty seconds from 4:14 this morning (as far back as the log goes) till 5:36am when apparently my isp came back online. I'm normally online constantly (DSL) but apparently was off for a while. Could this be related? Currently being moderated under false pretences |
||
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
I've got another unit doing the same thing but on a 30 minute cycle on another machine.
----------------------------------------B04513_0299_CTMA3C2-6-7-4c1_0 This one is still doing it even though the network is fine now. Currently being moderated under false pretences |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
When one person has so many errors, the easy answer is: your computer's fried.
This sort of computing will test the limits of your hardware. It's not enough that your memory and CPU perform well, they have to perform perfectly, constantly. Most computers can manage this, but if your computer has a minor problem, WCG throws it in sharp relief. |
||
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
Actually, you're telling me I have 2 fried computers, not one.
----------------------------------------Currently being moderated under false pretences |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Could be, could be.
What does stderr say? |
||
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
The one in "slot 0" says:
----------------------------------------"No heartbeat from core client for 31 sec - exiting" The one in the BOINC main directory is full of: "Another instance of BOINC is already running" Currently being moderated under false pretences |
||
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
The one's on my other machine have nothing since the 1st of this month or nothing at all.
----------------------------------------Currently being moderated under false pretences |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Could it be that you have misconfigured BOINC? That, or you have misconfigured your computers. If you always get the no heartbeat error 30 seconds into a work unit, then your inter-process communication is failing. BOINC uses RPC on port 31416, IIRC.
There's just more stuff that can go wrong on Linux. If you want further help, we're going to need complete logs. Seriously, though: if setting up Linux is beyond you, use a preconfigured distro. If you believe you know exactly what you are doing, then try working it out with the BOINC folk. You may have found a bug; who knows? |
||
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
I'll look into the configuration, but it hasn't been happening to every unit and the restarting thing has only just started in the last 48-36hrs as far as I can tell.
----------------------------------------The distro I use is preconfigured btw. I don't compile my own kernel. I did recently update my kernel (automatic update from the official repositories) so I'll boot back into my previous version and see what happens. Currently being moderated under false pretences |
||
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 721 Status: Offline Project Badges: |
Ok, I tried a different kernel ( precompiled, I'm not rolling my own) and freed up some ram and still the same, so I'm pulling this machine off hdc work.
----------------------------------------Thanks to all who tried to help. Currently being moderated under false pretences |
||
|
|