World Community Grid - View Thread

World Community Grid Forums

Category: Completed Research

Forum: Discovering Dengue Drugs - Together

Thread: Frozen work units?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 53

[ ]

Author

This topic has been viewed 9343 times and has 52 replies

Lone-Wolf
Cruncher
Joined: Apr 10, 2007
Post Count: 33
Status: Offline


Re: Frozen work units?

Unfortunately I'm starting to suspect the motherboard is the cause as I did change the chipset fan recently as it was howling and running erratically.

Temps are great now but likely the damage is done. :(

----------------------------------------

[Nov 10, 2007 6:09:00 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Frozen work units?

Do you see both processes in Task Manager?

Normally, if a process dies prematurely or stops responding, then BOINC will detect this (it uses a heartbeat mechanism) and kill/restart the process, as well as logging the problem.

Just stopping is usually symptomatic of a problem with the science application or work unit. I wasn't aware of any such issue with AutoDock, and it is also unusual that only you are affected (so far).

A hardware problem is perfectly possible. Please keep us updated, and we will review this again if further reports come in.

[Nov 10, 2007 6:32:57 PM]

Lone-Wolf
Cruncher
Joined: Apr 10, 2007
Post Count: 33
Status: Offline


Re: Frozen work units?

Yes in task manager I can see both processes and when operating properly they will be displaying about 48-50% CPU usage each while currently one says zero and the other is at 49%.

I see no visible clues as to the problem other than the counter stopping and task manager showing an idle core.

----------------------------------------

[Nov 10, 2007 8:11:47 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Frozen work units?

Hello Lone-Wolf,
The Useful Utilities thread in Start Here at http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=2490 offers many useful system diagnostic programs such as CPU-Z, Everest and HotCPU.

Lawrence

[Nov 10, 2007 9:01:34 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Frozen work units?

On my dual core laptop running Linux, I saw a WU today that kept stopping. I had both a DDDT and a HCC WU crunching. The DDDT WU all of a sudden stopped incrementing the CPU time. I checked CPU usage using 'top' and indeed only the HCC WU was still running. I restarted BOINC and the DDDT WU resumed along with the HCC WU. There is something fishy going on. There weren't any messages in the message tab. It just seemed to stop completely as if it crashed. The DDDT app wasn't even present as far as I could tell. I'll be on the lookout for this to see if I can catch it happening again and if I can find any interesting info anywhere else.

[Nov 11, 2007 1:41:19 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Frozen work units?

Hi saccia,
An idea - - check your Preferences to see if RAM limits might have forced BOINC to unload one work unit.

Lawrence

[Nov 11, 2007 1:45:25 AM]

Lone-Wolf
Cruncher
Joined: Apr 10, 2007
Post Count: 33
Status: Offline


Re: Frozen work units?

My scenario is exactly as saccia describes and I have 2GB of RAM in that machine.

For what it's worth I have noticed that upon a reboot the frozen work unit will run but often times that unit then ends up getting a "computation error" resulting in a new work unit starting.

Although there's much about the innards of a computer that I don't fully grasp it baffles me a bit that one unit will run fine if indeed I have a hardware failure.

I would be inclined to think that if I cooked the chip it would either work or not.

I may just set that machine to crunch something else and monitor what happens.

FWIW I just checked the temperatures and under full load the CPU is operating at 35C both cores and the chip is showing in the same range so I highly doubt it's a heat issue.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Lone-Wolf at Nov 11, 2007 1:59:57 AM]

[Nov 11, 2007 1:57:01 AM]

Lone-Wolf
Cruncher
Joined: Apr 10, 2007
Post Count: 33
Status: Offline


Re: Frozen work units?

Update: I've changed the profile on the machine in question to crunch only HPF and it's been running just fine since on both cores.

At this stage I'd have to say I'm a bit suspicious.

I had set my crunchers to participate in the DDD work units when they first came out and kept noticing that various computers required a reboot but I didn't have time to investigate.

I simply changed them back to HPF and they kept running fine.

With time elapsed I thought I'd try DDD again a couple of weeks back and it seemed my machines were doing better this time except for this one which I attributed at the time to the chipset fan requiring attention which has been dealt with and currently cooling is far better than stock allowances for this equipment as I've beefed it all up yet still have not overclocked it.

I'll keep monitoring its progress and report back if it should freeze up again but it looks promising now.

----------------------------------------

[Nov 11, 2007 7:38:23 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Frozen work units?

Check your cc_config.xml file if the default flags were set to >0< instead of >1<. It may explain why we see so little messages, other than start/completed/upload.

The flags are:

<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>

cheers.

This FAQ: How do I configure my client using the cc_config.xml file? explains the configuration file.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Nov 11, 2007 10:29:13 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Frozen work units?

Hello Lone-Wolf,

Update: I've changed the profile on the machine in question to crunch only HPF and it's been running just fine since on both cores.

At this stage I'd have to say I'm a bit suspicious.

BOINC allows you to set Preferences setting a maximum RAM usage. If 2 processes combined exceed that maximum, then one of the processes will be suspended. Unless you have checked the 'Leave in memory' option the suspended process will be removed from memory. Meaning that it will have to start over from the last check point. Which is why I am suspicious of using this limit.

I suppose that I was too terse in my previous post mentioning this possibility.

Lawrence

[Nov 11, 2007 11:14:28 AM]

[ ]