World Community Grid - View Thread - Still producing errors

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project Forum

Thread: Still producing errors

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 118

[ ]

Author

This topic has been viewed 20020 times and has 117 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Still producing errors

Could you keep us informed about your findings regarding the possible correlation between the client version and the number of errors?

Yep, I'll keep you informed.
I was running 5.4.11 on Debian Etch and most of my machines are upgraded now to Lenny that contains version 6.2.14.
More useful info after the weekend I think.

[Feb 13, 2009 7:02:07 PM]

mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:

20 year badge for Human Proteome Folding - Phase 2

5 year badge for Discovering Dengue Drugs - Together

10 year badge for Nutritious Rice for the World

10 year badge for The Clean Energy Project

20 year badge for Help Fight Childhood Cancer

5 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

20 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

20 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Still producing errors

There are still bad work units out there. It looks like the new version of CEP has not fixed all of the problems.

I have had three more failures with ERROR in the last two days.

Everyone who processed WU E000375_683A_00292k015 failed so it is bad WUs, not my machines.

Result Name Device Name Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E000384_ 110A_ 002a2b00q_ 2-- ASUS-i7-965 Error 2/14/09 08:18:30 2/14/09 12:19:59 3.98 95.4 / 0.0
E000375_ 683A_ 00292k015_ 4-- ASUS-i7-965 Error 2/13/09 07:53:49 2/13/09 08:21:17 0.10 2.5 / 0.0
E000382_ 828A_ 002a1h004_ 2-- ASUS-i7-965 Error 2/13/09 05:15:51 2/13/09 13:46:06 7.77 186.9 / 0.0

Result Log for E000375_683A_00292k015

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Encountered error. Exiting.

</stderr_txt>
]]>

Workunit Status

EVERYONE WHO PROCESSED THIS WU FAILED WITH AN ERROR

Project Name: The Clean Energy Project
Created: 2/9/09
Name: E000375_683A_00292k015
Minimum Quorum: 2
Initial Replication: 2

Result Name Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E000375_ 683A_ 00292k015_ 6-- Error 2/13/09 09:07:30 2/14/09 06:44:12 0.14 1.7 / 0.0
E000375_ 683A_ 00292k015_ 5-- Error 2/13/09 08:25:11 2/13/09 09:04:12 0.19 2.9 / 0.0
E000375_ 683A_ 00292k015_ 4-- Error 2/13/09 07:53:49 2/13/09 08:21:17 0.10 2.5 / 0.0
E000375_ 683A_ 00292k015_ 3-- Error 2/13/09 07:22:55 2/13/09 07:47:26 0.14 1.8 / 0.0
E000375_ 683A_ 00292k015_ 2-- Error 2/13/09 05:03:57 2/13/09 07:05:40 0.11 1.7 / 0.0
E000375_ 683A_ 00292k015_ 1-- Error 2/10/09 22:14:18 2/13/09 05:00:43 0.27 2.8 / 0.0
E000375_ 683A_ 00292k015_ 0-- Too Late 2/10/09 22:12:49 2/11/09 17:47:13 0.69 8.3 / 0.0

----------------------------------------

[Feb 14, 2009 2:30:40 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Still producing errors

I've not had this particular "exit code 29 (0x1d" yet and still I think 6.4.5. should be send to the gutter unless you need it for GPU crunching... is there GPU crunching on the side during CEP jobs?

Went through my CEP list, client 6.2.28, and last 7 are clean logged, 2 had heartbeat issue and validated. 1 had heartbeat issue several times and was invalid. 1 had the "Failed..." because the other probably had it too at the same time, a valid invalid so to speak.

I'm having a bunch of E000354 and they all ran so far swift, clean log, valid in no time with quorum 2 partners that ran the result in very similar time and credit claim. Got 10 more in the queue of this batch.

Continueing to run them 1 at the time, manual control and let HCC and RICE run as side jobs.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Feb 14, 2009 2:46:06 PM]

[Feb 14, 2009 2:44:44 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Still producing errors

I too got the error mentioned below, but then I thought "Why can't the system write to the specified device"

<core_client_version>6.2.28</core_client_version>
<![CDATA[
<message>
Het systeem kan niet naar het opgegeven apparaat schrijven. (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Encountered error. Exiting.

</stderr_txt>
]]>

In my case I opened the data directory with a file manager, in my case Total Commander. Closed Total Commander as to be sure no lock would be set on the data-directory and thereby preventing data being written to that directory. For some reason the lock remained on the data-directory resulting in the error. This happened several times but I found a workaround for that.

After you open the data-directory with a filemanager be sure to switch to another ( other than the data-directory) directory before you close (or exit) the file manager.

Sofar I have not seen the same the same thing happening on an XP machine so it could be OS related (Win2K SP4) or it might be due to Total Commander. It might also be chipset related (NForce2 ATA controller)

Anyway I think this might be something to keep in mind when error 29 creeps up.

Hope this helps to prevents error 29

Cheers

Jos

[Feb 14, 2009 3:27:04 PM]

David_L6
Senior Cruncher
USA
Joined: Aug 24, 2006
Post Count: 296
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

5 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Still producing errors

Still getting some errors. Not nearly as many as previously though.

Window XP Pro 32 bit.

Project Name: The Clean Energy Project
Created: 2/12/09
Name: E000409_846A_002d1u00g

<core_client_version>6.2.28</core_client_version>
<![CDATA[
<message>
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Encountered error. Exiting.

</stderr_txt>
]]>

I also had a strange result a few days ago (can't find the work unit in my results now???). The work unit ran for 48 hours but when finished got credit for only 9 hours or so. I saw that one in progress (at over 30 hours) and let it run just to see what would happen. It was valid, but it ran for a lot longer than the time it was given credit for. I guess I really should have posted something about it as soon as it happened but I work a 12 hour schedule (plus driving time to and from work) and just didn't feel like it at the time.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by David_L6 at Feb 15, 2009 10:47:53 AM]

[Feb 15, 2009 1:50:05 AM]

rkar22
Cruncher
Joined: Nov 17, 2004
Post Count: 48
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Still producing errors

Project Name: The Clean Energy Project
Created: 09-02-08
Name: E000372_712A_00290m00a
Minimum Quorum: 2
Initial Replication: 2

Result Name Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E000372_ 712A_ 00290m00a_ 6-- Too Late 09-02-14 11:35:17 09-02-15 11:55:47 15.53 177.0 / 0.0
E000372_ 712A_ 00290m00a_ 5-- Error 09-02-14 01:04:24 09-02-14 11:32:38 8.26 149.6 / 0.0
E000372_ 712A_ 00290m00a_ 4-- Error 09-02-13 23:43:34 09-02-14 13:51:11 12.77 152.1 / 0.0
E000372_ 712A_ 00290m00a_ 3-- Error 09-02-11 13:15:40 09-02-13 23:38:45 7.25 137.9 / 0.0
E000372_ 712A_ 00290m00a_ 2-- Error 09-02-10 22:24:59 09-02-11 13:11:57 9.20 174.9 / 0.0
E000372_ 712A_ 00290m00a_ 0-- Error 09-02-10 08:43:39 09-02-14 00:55:31 9.94 172.4 / 0.0
E000372_ 712A_ 00290m00a_ 1-- Error 09-02-10 08:39:13 09-02-10 22:24:06 8.63 136.9 / 0.0

For me this looks like 3 days of wasted CPU time. It would be interesting to know whether the other members' results logs differ from mine:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 29 (0x1d, -227)
</message>
<stderr_txt>
Calling gridPlatform.init()
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Encountered error. Exiting.

</stderr_txt>
]]>

I'm wondering why two successful copies are sufficient to confirm that the result of a WU is valid, while 6 or 7 are needed to confirm an error?!

----------------------------------------
[Edit 1 times, last edit by rkar22 at Feb 15, 2009 2:36:58 PM]

[Feb 15, 2009 2:29:48 PM]

mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:


Re: Still producing errors

The machine that had the three errors is brand new. This is why 6.4.5 is running, I went to the website and downloaded the current version. I do not think CUDA is being used, becasue I am only running World Community Grid, and I do not think. This is an I7 965 running Vista Ultmate so it is running 8 WCG processes and nothing else. I do not know how to make sure only one CEP is running, but I doubt that more than one is running becasue i usually do not see very many in the queue waiting to run.

----------------------------------------

[Feb 15, 2009 7:27:49 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Still producing errors

Hello rkar22,

I'm wondering why two successful copies are sufficient to confirm that the result of a WU is valid, while 6 or 7 are needed to confirm an error?!

If a Valid result cannot be produced by 7 tries, the work unit is withdrawn from the server and humans try to figure out what is wrong. The problem with CEP is that a molecule might not converge to a stable solution. If it does not converge, then CEP marks it as an error. This strikes me as sensible for a private run on a lab computer but extremely unfriendly to the members of a grid, who are used to being rewarded with credit for contributing computer time for scientific research.

It was probably never an option to ask that basic features of CHARMM be rewritten to be grid-friendly. I was already precociously working with computers when this giant FORTRAN program was first conceived in 1969. I think it is the most gigantic program we have ever run in terms of code length. There has been enough time spent writing this program to make it inconceivably difficult to read.

Lawrence

[Feb 16, 2009 12:31:41 AM]

widdershins
Veteran Cruncher
Scotland
Joined: Apr 30, 2007
Post Count: 677
Status: Offline
Project Badges:

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

10 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Still producing errors

I understand what you are saying about it marking it as an error if the solutions didn't converge and how it may be unrealistic to expect CHARMM to be rewritten. Is it unrealistic to ask though that where CEP has flagged a unit as errored and this has happened twice no more copies are issued of that WU? That doesn't require a rewrite of CHARMM, only a rewrite of the reissue rules on the WCG servers.

The scientists can then carry off the WU concerned and the two errored results and study it at their leisure. Meanwhile if the other 5 reissues hadn't been sent out all that crunching time saved could produce more valid results for CEP or other projects. In that way everyone wins and no-one loses except the first two to return the unit (who could be granted half credit perhaps).

[Feb 16, 2009 1:36:58 AM]

Bearcat
Master Cruncher
USA
Joined: Jan 6, 2007
Post Count: 2803
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

10 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for Computing for Sustainable Water

10 year badge for Outsmart Ebola Together

1 year badge for Africa Rainfall Project


Re: Still producing errors

I've been getting about 1 error per batch in my mac's but haven't notified anyone. Figured WCG folks would look at them when they get a chance. Is this right or are we supposed to let you guys know when they happen? I just suck it up and keep on crunching unless i start getting allot of them, then would switch projects until they are fixed. Haven't had to many since I started in 07. Yeah, it sucks to loose points on bad wu's but just enjoy crunching for a good cause.

----------------------------------------

Crunching for humanity since 2007!

[Feb 16, 2009 3:03:54 AM]

[ ]