Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 22
Posts: 22   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3423 times and has 21 replies Next Thread
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Units restarting

I nearly posted this in my last thread but decided to start a new one in case these are unrelated.

I've had a number of units now on different machines exiting on a zero status without finishing only to restart from scratch.

Yesterday, after goining through a fresh install, resetting projects and installing a seperate hdd for more swap file area, my first unit in restarted twice before running through.
That one's a P4 1.7 with 768Mb and 2.4Gb dedicated swap drive. No applications running beside Ubuntu Linux 5.1 (fully updated) and BOINC 5.4.9

I'm also getting them on my P4 2.4GHz with 1024Mb and 541Mb swap partition, Ubuntu Linux 6.04 with daily mixed use and BOINC 5.4.9

An example from this machine:

- Task B01276_0170_CTMA1Aa-16-5-1_1 exited with zero status but no "finished" file.

then a few messages later, but with the same time stamp (so within a second)

- Restarting Task B01276_0170_CTMA1Aa-16-5-1_1 using hdc version 505

I've had a few of these do this several times before running to completetion.
This one is in my log restarting every sixty seconds from 4:14 this morning (as far back as the log goes) till 5:36am when apparently my isp came back online. I'm normally online constantly (DSL) but apparently was off for a while. Could this be related?
----------------------------------------

Currently being moderated under false pretences
[Oct 4, 2006 10:24:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Units restarting

I've got another unit doing the same thing but on a 30 minute cycle on another machine.

B04513_0299_CTMA3C2-6-7-4c1_0

This one is still doing it even though the network is fine now.
----------------------------------------

Currently being moderated under false pretences
[Oct 4, 2006 10:29:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Units restarting

When one person has so many errors, the easy answer is: your computer's fried.

This sort of computing will test the limits of your hardware. It's not enough that your memory and CPU perform well, they have to perform perfectly, constantly. Most computers can manage this, but if your computer has a minor problem, WCG throws it in sharp relief.
[Oct 4, 2006 10:34:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Units restarting

Actually, you're telling me I have 2 fried computers, not one.
----------------------------------------

Currently being moderated under false pretences
[Oct 4, 2006 11:05:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Units restarting

Could be, could be.

What does stderr say?
[Oct 4, 2006 11:32:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Units restarting

The one in "slot 0" says:

"No heartbeat from core client for 31 sec - exiting"

The one in the BOINC main directory is full of:

"Another instance of BOINC is already running"
----------------------------------------

Currently being moderated under false pretences
[Oct 4, 2006 11:49:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Units restarting

The one's on my other machine have nothing since the 1st of this month or nothing at all.
----------------------------------------

Currently being moderated under false pretences
[Oct 4, 2006 11:52:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Units restarting

Could it be that you have misconfigured BOINC? That, or you have misconfigured your computers. If you always get the no heartbeat error 30 seconds into a work unit, then your inter-process communication is failing. BOINC uses RPC on port 31416, IIRC.

There's just more stuff that can go wrong on Linux. If you want further help, we're going to need complete logs. Seriously, though: if setting up Linux is beyond you, use a preconfigured distro. If you believe you know exactly what you are doing, then try working it out with the BOINC folk. You may have found a bug; who knows?
[Oct 5, 2006 12:07:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Units restarting

I'll look into the configuration, but it hasn't been happening to every unit and the restarting thing has only just started in the last 48-36hrs as far as I can tell.
The distro I use is preconfigured btw. I don't compile my own kernel.

I did recently update my kernel (automatic update from the official repositories) so I'll boot back into my previous version and see what happens.
----------------------------------------

Currently being moderated under false pretences
[Oct 5, 2006 12:15:46 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 721
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Units restarting

Ok, I tried a different kernel ( precompiled, I'm not rolling my own) and freed up some ram and still the same, so I'm pulling this machine off hdc work.

Thanks to all who tried to help.
----------------------------------------

Currently being moderated under false pretences
[Oct 5, 2006 3:16:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 22   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread