World Community Grid - View Thread - work units not finishing

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: work units not finishing

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 30

[ ]

Author

This topic has been viewed 302279 times and has 29 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: work units not finishing

cleanenergy, the point I was trying to make is that it appears that all of the failures are occurring in job 12 and I was trying to find out if there is anything specific to job 12 which is causing this. These failures appear to have nothing to do with exceeding the maximum processing time allowed and I wanted to know if any work is(has) being done to determine the specific reason for the failure. If something is found which can be related to WCG or BOINC or other, I would think the Community Advisors should be clued in to handle messages of this sort.

[Feb 21, 2011 6:21:40 PM]

gb077492
Advanced Cruncher
Joined: Dec 24, 2004
Post Count: 96
Status: Offline


Re: work units not finishing

I've recently started seeing an occasional RC = 0xc0000005 in step 6, too. I've seen it more than once and on two separate machines. In all cases that I've checked the wingmen have got past step 6 so the WU has gone Inconclusive and then later my machine has been given Invalid with a low points score. That's fair enough if there's a hiccough in the processing, but I just wanted to express surprise that I'm seeing more of it and no obvious pattern as to why.

Some current ones if you want to check:
E201264_ 080_ A.29.C22H10O2S4Se.65.0.set1d06_ 0--
E201279_ 880_ A.30.C23H11NO2S4.158.4.set1d06_ 1--

Mike

[Feb 21, 2011 6:27:17 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: work units not finishing

That is the way it has been with me for about the last month or so too. Almost all of them stop after job 12, at least if they stop before about 7 hours (Q8300 quad-core at 3 GHz, Win7 64-bit). Only about 1 in 10 go longer than that and complete. But if Harvard is not worried about it, why should I? They seem to be getting the science they want.

----------------------------------------
[Edit 1 times, last edit by Jim1348 at Feb 21, 2011 10:25:34 PM]

[Feb 21, 2011 10:24:29 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: work units not finishing

But, if Harvard is not worried about it, why do jobs 12 to 15 exist? There must be some valuable information returned for those jobs and, if we have the capability to increase the maximum cpu time beyond 12 hours, those that wish to increase the time as a result of the processing power of their machines can normallly run all jobs to completion. The question remains as to what job 12 is doing which causes the failure resulting in jobs 13, 14, and 15 skipping the processing.

[Feb 23, 2011 2:57:10 AM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:


Re: work units not finishing

There must be some valuable information returned for those jobs and, if we have the capability to increase the maximum cpu time beyond 12 hours, those that wish to increase the time as a result of the processing power of their machines can normallly run all jobs to completion.

If you read the other discussions (I don't have the link), you will see that the first jobs are the most important scientifically. The time you would spend completing all the jobs in a task are in some cases better spent on a new task.

[Feb 24, 2011 4:42:14 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

90 day badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: work units not finishing

[Edit]: My similar problem was solved by replacing the disk that contained the BOINC DATA.

@ Legrandpiou, @cleanenergy, ...

Re: CEP2 (and DDDT2) WUs that crash with message "exited with zero status but no 'finished' file"
(This is the error reportd by Legrandpiou at the top of this thread.
dkt raised a different issue, ie that the WCG website Results logs of almost all CEP2 WUs show "Application exited with RC = 0xnnn" after about Job 11)

I get the "exited with zero status ..." events if I run CEP2 on one particular 4-core computer, but not on my other machines.
This computer had the same problem earlier on, when I tried to run 4 DDDT2 tasks simultaneously, and I spent lots of time then trying to find why.

First, though, let us examine what happens.
I have encountered the events running XP-64 and XP-32 but have not tried Linux.
Looking at Legrandpiou's messages log, you will see that the tasks which exit get restarted almost immediately.
They do not restart from "zero" but from their last checkpoints, unless they have not yet made their first checkpoint.
I run with LAIM ON and it does not affect checkpointing. CEP2 checkpoints at the end of each internal Job, and only then.
Here is an extract from the Result log for one of these WUs in my My Grid on WCG website:
...
Result Name: E201356_ 297_ A.29.C22H14N2S4Si.28.0.set1d06_ 0--
<core_client_version>6.2.19</core_client_version>
...
[00:59:44] Finished Job #4
[00:59:44] Starting job 5,CPU time has been restored to 8177.015625.
01:04:21 (2856): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[01:04:39] Number of jobs = 16
[01:04:39] Starting job 5,CPU time has been restored to 8177.015625.
01:05:39 (4004): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[01:05:52] Number of jobs = 16
[01:05:52] Starting job 5,CPU time has been restored to 8177.015625.
[01:10:01] Finished Job #5
...
This WU went on to a normal exit with RC=5 after Job 11 and is still PV.

@cleanenergy & WCG techs: If you trawl thru all the CEP2 results logs, you will be able to determine the prevalence of the problem.
Also, if you match up the names of the devices which crunched the crashers, I think you will find that it is only particular computers which do this, and they do it regularly. The majority of machines are unaffected.
Furthermore, and this is only my theory, if we knew hardware details of the problem machines, in particular the storage subsystems (HDD/SSD model names, amount of onboard cache, firware version and perhaps controller details) of the devices hosting the pagefiles and BOINC Data directories, a pattern may emerge.

CEP2 sometimes runs for long periods between checkpoints. (Yesterday, a WU on my AMD went 6hr50m between 1st & 2nd, and I had to re-crunch about 5hrs of it when I shut down to shuffle RAM sticks sad

). If the exits were happening at random times within the internal Jobs, significant amounts of work would need to be re-done. However, in the result logs that I have examined so far, the exits have been near the start of the Jobs, so not much crunching time is being lost there.
*** Clue? Is something special happening at this point in each Job? Heavy disc I/O?

On the other hand, the behaviour surrounding these crashes & restarts is costing significant crunch time. When I examine the CPU Usage traces in the Windows Task Manager on all machine that are running CEP2, I very often see dropouts in CPU usage happening about every 30 - 60 sec.
[You need to set the Core Affinities for each of the WCG tasks in order to tell whether processing on all cores drops out, or whether it's just the CEP2 ones. If BOINC is installed as a "protected application" as in all of my machines, you get "access denied" errors if you try to assign cores, so I need to re-install as an ordinary process to counter this]. All of these dropouts correspond to bursts of intense HDD activity, and drives with smaller onboard caches can be heard thrashing madly. Some of these dropouts are very severe, especially on the machine that is getting the exit/restart events, and they sometimes last for tens of seconds. At these times, the machine almost freezes, and this would be very annoying if the machine was being used with other interactive programs. Not all of these long dropouts/freezes lead to exits & restarts.

The sequences that lead to an exit/restart start with an extended burst of HDD activity, during which the CPU usage traces oscillate madly. I think the disc is being overwhelmed with I/O requests and basically locks up. A CEP2/DDDT2 process is blocked by waiting for I/O and fails to communicate with BOINC for 30+ sec. The CPU average usage on a 4-core falls to 75%. BOINC kills the laggard process, and then tries to restart it, adding vastly to the no of outstanding system I/O requests. This often causes one or more other WCG processes, not necessarily CEP2 ones, to halt waiting for I/O, and 30sec later, BOINC kills these and tries to restart them too. CPU usage drops to 50%, 25%, ... The machine freezes, with the HDD LED ON. Tens of seconds pass. Suddenly, the skies clear, the sun shines, CPU usage jumps to 100%, and everybody lives happily ever after. Until next time ...

My guess is that it's a storage device hardware/firmware problem. A large onboard cache may be implicated. Jim1348's report that changing his SSD cured the problem on his machine adds evidence.
[Edit]: Confirmed. See Q9650C, Configuration 4, next:

My "farm":
--------------------------------------------------------------------------------
CPU | Motherboard | HDD controller | O/S | HDD | HDD onboard cache | WCG Science
--------------------------------------------------------------------------------
Get freeze/exit/restart events:
* Q9650C, Configuration 1:
Q9650 | Asus P5Q Dlx | Intel ICH10 | XP-64 | WD5000AADS 500GB SATA, firmware 01.01A01 | 32MB | DDDT2
* Q9650C, Configuration 3 (recent, previous):
Q9650 | Asus P5Q Dlx | Intel ICH10 | XP-32 | System & pagefile: ST 38410A 8.4GB IDE | 512kB /
... 2nd drive with BOINC Data: WD5000AADS 500GB SATA, firmware 01.01A01 | 32MB | CEP2
--------------------------
Unaffected:
* A64X2:
Athlon64x2 | Asus A8N5X | Nvidia NnForce 4 | Win 2000 | WD3200JS 320GB SATA | 8MB
* Q9650A:
Q9650 | Gigabyte GA-EP45-UD3R | Intel ICH10 | XP-64 | WD7500AACS 750GB SATA, firmware 01.01A01 | 16MB
* Q9650B:
Q9650 | Asus P5K3 Dlx | Intel ICH9 | XP-32 | Samsung HD753LJ 750GB SATA | 32MB
* Q9650C, Configuration 2:
Q9650 | Asus P5Q Dlx | Intel ICH10 | XP-32 | ST38410A 8.4GB IDE | 512kB | DDDT2 (CEP2 won't fit)
[Edit - NEW] * Q9650C, Configuration 4 (current):
Q9650 | Asus P5Q Dlx | Intel ICH10 | XP-32 | System & pagefile: ST 38410A 8.4GB IDE | 512kB /
... 2nd drive with BOINC Data: Samsung SP2014N 200GB IDE | 8MB | 4 x CEP2
-------------------------------------------------------------------------
The 2 WD SATA (Caviar Green) drives have very similar model nos and the same firmware revision numbers, but it's the one with the larger cache (32MB) that has the problem. The faster Samsung drive also has 32MB cache but it's good.

I tried disabling Windows disk caching but that made the machine totally unresponsive.
I installed the Intel ICH10 Matrix Storage/RAID driver, which enables the drive to run the SATA drive in AHCI Mode, which enables the firmware to run speed-enhancing optimisations (NCQ), but this did not help.

Links
BOINC Agent Support > exited with zero st...o 'finished' file
Beta Test Support Forum > Re: BETA Clean ...ject phase 2 version 6.35 - post by JollyJimmy
My thread @ XS > WCG - WCG tasks die "...;finished' file"
Will CEP2 units eventually NOT Bogart the host when they start up?

Suggestion for other crunchers who get these WU exit/restart events:
Please post details of your computer's hardware, particularly your HDD/SSD model no, firmware version if known, plus onboard cache, for the drive(s) containing the pagefile(s) and BOINC data.
Let us see whether there is a pattern.

If you don't know these details and you're running Windows, PC-Wizard (freeware, and I have no financial interest) may tell you.

----------------------------------------
[Edit 5 times, last edit by Rickjb at Mar 15, 2011 11:24:23 AM]

[Mar 2, 2011 1:47:08 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: work units not finishing

Snip

If BOINC is installed as a "protected application" as in all of my machines, you get "access denied" errors if you try to assign cores, so I need to re-install as an ordinary process to counter this].

Oh, well sorry, never had this issue with XP, Vista or W7. In latter 2 OSses, not remembering how it was in XP, the TaskManager has an All Processes / admin button. Then right clicking a process allows changing pretty much anything, affinity / priority (ineffective btw)...

In another thread a member brought up winAFC. Well Process Lasso ** does it all and automated at that if you like too... not for the weak of heart and not for those not knowing what they're doing or remember what they did, then report on the boards that the jobs are crumbling or run slow.

--//--

** A permanent companion on my Windows instances.

[Mar 2, 2011 2:25:00 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: work units not finishing

Dear dkt and others,

the point I was trying to make is that it appears that all of the failures are occurring in job 12 and I was trying to find out if there is anything specific to job 12 which is causing this.

You are right, there are a couple of points in the wu where it is more likely for a job to fail (e.g., job8, job12), because the projection from one job to the next is more complicated â for most jobs it is relatively straightforward. If one job of the sequence fails, then the successive jobs donât have an input and will be skipped. So, the reasons for all this are well understood but there isnât much one can do about it. It would be nice if the wus just ran without problems, but it is not the end of the world if they donât because the host will simply start crunching on the next one. Thanks anyways for pointing this out.

Dear Mike,

It is somewhat mysterious when one wingman has an error and the other has no problems. This does not seem to be a science problem, so maybe the IBM team can chime in.

Dear Rickjb,
Maybe IBM can have a look for hardware patterns in cases of errors. The problem of checkpointing in CEP2 is known and discussed in a number of threads. For the time being, LAIM is the way to go, but we are working on improved checkpointing as well - itâs not trivial though.

However, in the result logs that I have examined so far, the exits have been near the start of the Jobs, so not much crunching time is being lost there.

I am not sure whether I understand you correctly. If there is a fundamental problem with a job it more or less stops immediately and no time is lost. Sometimes a calc cannot converge to a result, and that takes a while to pan out â but that is the nature of science. In neither case will there be a restart and checkpointing does not come into play.
If your machines get unresponsive, check that you donât run out of RAM and try to not run too many simultaneous CEP2 wus.

Best wishes

Your Harvard CEP team

[Mar 2, 2011 11:29:27 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:


Re: work units not finishing

@cleanenergy: Thanks very much for your attention.
Sorry, the result log that I posted was a bad example. I stopped running CEP2 on the machine and other logs have scrolled off my WCG Results pages.
In the example I posted above, the Job (#5) was very short, and the 2 timeout events could have been at random points of the job.
In another WU result that I examined, the 3 timeouts were within the first 5-10 min of the Job, but with such a small sample, these too could have been at random progress points.
I'm running another CEP2 WU on the device, and a while ago it seemed to exit after about 1 hr & restart to near zero. Its result log may be interesting.
If the exits are happening at random points of the Jobs, on average 1/2 Job per WU per exit/restart event needs to be re-done. If 12 Jobs are executed per WU, that's 4% per restart, plus the interruption time for WUs running on other cores.

Here are the actual numbers for the WU posted above:
Starting job 5: 00:59:44
No h'beat exit: 01:04:21
* Job 5 1st attempt ran 04:37, timed out & was killed by BOINC
Starting job 5: 01:04:39 - Re-doing 04:37 work + 18 sec restart delay
No h'beat exit: 01:05:39
* Job 5 2nd attempt ran 01.00, timed out & was killed by BOINC
Starting job 5: 01:05:52 - Re-doing total of 06:08 since 00:59:44
Finished Job 5: 01:10:01
Job 5 3rd attempt ran 04.09, normal exit
-----
"Maybe IBM can have a look for hardware patterns ...". I assume that by IBM you mean WCG techs.
-----
@Sekerob: I tried again to set core affinities of BOINC science processes with Task Mangler under XP and 2000 on different machines, but as always, I get a popup window titled "Unable to Access or Set Process Affinity" with text "The operation could not be completed. Access is denied." I was logged in as a user with admin privileges.
I have not tried Process Lasso or winAFC. In the short term, it would be quicker for me to reinstall BOINC as an ordinary process temporarily.
-----
I will continue experimenting with the machine over the next few weeks, excluding next week when I'll be away. I will try setting core affinities, getting TaskManager screenshots, other HDDs. - Rick

[Mar 3, 2011 8:25:30 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: work units not finishing

Dear Rickjb,
Sorry, I am afraid we don't have an answer for you. Again, maybe IBM (=WCG techs) can help out.
Best
Your Harvard CEP team

[Mar 3, 2011 4:10:00 PM]

[ ]