| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 118
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Could you keep us informed about your findings regarding the possible correlation between the client version and the number of errors? Yep, I'll keep you informed. I was running 5.4.11 on Debian Etch and most of my machines are upgraded now to Lenny that contains version 6.2.14. More useful info after the weekend I think. |
||
|
|
mclaver
Veteran Cruncher Joined: Dec 19, 2005 Post Count: 566 Status: Offline Project Badges:
|
There are still bad work units out there. It looks like the new version of CEP has not fixed all of the problems.
----------------------------------------I have had three more failures with ERROR in the last two days. Everyone who processed WU E000375_683A_00292k015 failed so it is bad WUs, not my machines. Result Name Device Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit E000384_ 110A_ 002a2b00q_ 2-- ASUS-i7-965 Error 2/14/09 08:18:30 2/14/09 12:19:59 3.98 95.4 / 0.0 E000375_ 683A_ 00292k015_ 4-- ASUS-i7-965 Error 2/13/09 07:53:49 2/13/09 08:21:17 0.10 2.5 / 0.0 E000382_ 828A_ 002a1h004_ 2-- ASUS-i7-965 Error 2/13/09 05:15:51 2/13/09 13:46:06 7.77 186.9 / 0.0 Result Log for E000375_683A_00292k015 <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) </message> <stderr_txt> Calling initGraphics() INFO: No state to restore. Start from the beginning. Encountered error. Exiting. </stderr_txt> ]]> Workunit Status EVERYONE WHO PROCESSED THIS WU FAILED WITH AN ERROR Project Name: The Clean Energy Project Created: 2/9/09 Name: E000375_683A_00292k015 Minimum Quorum: 2 Initial Replication: 2 Result Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit E000375_ 683A_ 00292k015_ 6-- Error 2/13/09 09:07:30 2/14/09 06:44:12 0.14 1.7 / 0.0 E000375_ 683A_ 00292k015_ 5-- Error 2/13/09 08:25:11 2/13/09 09:04:12 0.19 2.9 / 0.0 E000375_ 683A_ 00292k015_ 4-- Error 2/13/09 07:53:49 2/13/09 08:21:17 0.10 2.5 / 0.0 E000375_ 683A_ 00292k015_ 3-- Error 2/13/09 07:22:55 2/13/09 07:47:26 0.14 1.8 / 0.0 E000375_ 683A_ 00292k015_ 2-- Error 2/13/09 05:03:57 2/13/09 07:05:40 0.11 1.7 / 0.0 E000375_ 683A_ 00292k015_ 1-- Error 2/10/09 22:14:18 2/13/09 05:00:43 0.27 2.8 / 0.0 E000375_ 683A_ 00292k015_ 0-- Too Late 2/10/09 22:12:49 2/11/09 17:47:13 0.69 8.3 / 0.0 ![]() ![]() ![]() |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I've not had this particular "exit code 29 (0x1d" yet and still I think 6.4.5. should be send to the gutter unless you need it for GPU crunching... is there GPU crunching on the side during CEP jobs?
----------------------------------------Went through my CEP list, client 6.2.28, and last 7 are clean logged, 2 had heartbeat issue and validated. 1 had heartbeat issue several times and was invalid. 1 had the "Failed..." because the other probably had it too at the same time, a valid invalid so to speak. I'm having a bunch of E000354 and they all ran so far swift, clean log, valid in no time with quorum 2 partners that ran the result in very similar time and credit claim. Got 10 more in the queue of this batch. Continueing to run them 1 at the time, manual control and let HCC and RICE run as side jobs.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Feb 14, 2009 2:46:06 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I too got the error mentioned below, but then I thought "Why can't the system write to the specified device"
<core_client_version>6.2.28</core_client_version> <![CDATA[ <message> Het systeem kan niet naar het opgegeven apparaat schrijven. (0x1d) - exit code 29 (0x1d) </message> <stderr_txt> Calling initGraphics() INFO: No state to restore. Start from the beginning. Encountered error. Exiting. </stderr_txt> ]]> In my case I opened the data directory with a file manager, in my case Total Commander. Closed Total Commander as to be sure no lock would be set on the data-directory and thereby preventing data being written to that directory. For some reason the lock remained on the data-directory resulting in the error. This happened several times but I found a workaround for that. After you open the data-directory with a filemanager be sure to switch to another ( other than the data-directory) directory before you close (or exit) the file manager. Sofar I have not seen the same the same thing happening on an XP machine so it could be OS related (Win2K SP4) or it might be due to Total Commander. It might also be chipset related (NForce2 ATA controller) Anyway I think this might be something to keep in mind when error 29 creeps up. Hope this helps to prevents error 29 Cheers Jos |
||
|
|
David_L6
Senior Cruncher USA Joined: Aug 24, 2006 Post Count: 296 Status: Offline Project Badges:
|
Still getting some errors. Not nearly as many as previously though.
----------------------------------------Window XP Pro 32 bit. Project Name: The Clean Energy Project Created: 2/12/09 Name: E000409_846A_002d1u00g <core_client_version>6.2.28</core_client_version> <![CDATA[ <message> The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) </message> <stderr_txt> Calling initGraphics() INFO: No state to restore. Start from the beginning. Encountered error. Exiting. </stderr_txt> ]]> I also had a strange result a few days ago (can't find the work unit in my results now???). The work unit ran for 48 hours but when finished got credit for only 9 hours or so. I saw that one in progress (at over 30 hours) and let it run just to see what would happen. It was valid, but it ran for a lot longer than the time it was given credit for. I guess I really should have posted something about it as soon as it happened but I work a 12 hour schedule (plus driving time to and from work) and just didn't feel like it at the time. ![]() [Edit 1 times, last edit by David_L6 at Feb 15, 2009 10:47:53 AM] |
||
|
|
rkar22
Cruncher Joined: Nov 17, 2004 Post Count: 48 Status: Offline Project Badges:
|
Project Name: The Clean Energy Project
----------------------------------------Created: 09-02-08 Name: E000372_712A_00290m00a Minimum Quorum: 2 Initial Replication: 2 Result Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit E000372_ 712A_ 00290m00a_ 6-- Too Late 09-02-14 11:35:17 09-02-15 11:55:47 15.53 177.0 / 0.0 E000372_ 712A_ 00290m00a_ 5-- Error 09-02-14 01:04:24 09-02-14 11:32:38 8.26 149.6 / 0.0 E000372_ 712A_ 00290m00a_ 4-- Error 09-02-13 23:43:34 09-02-14 13:51:11 12.77 152.1 / 0.0 E000372_ 712A_ 00290m00a_ 3-- Error 09-02-11 13:15:40 09-02-13 23:38:45 7.25 137.9 / 0.0 E000372_ 712A_ 00290m00a_ 2-- Error 09-02-10 22:24:59 09-02-11 13:11:57 9.20 174.9 / 0.0 E000372_ 712A_ 00290m00a_ 0-- Error 09-02-10 08:43:39 09-02-14 00:55:31 9.94 172.4 / 0.0 E000372_ 712A_ 00290m00a_ 1-- Error 09-02-10 08:39:13 09-02-10 22:24:06 8.63 136.9 / 0.0 For me this looks like 3 days of wasted CPU time. It would be interesting to know whether the other members' results logs differ from mine: <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> process exited with code 29 (0x1d, -227) </message> <stderr_txt> Calling gridPlatform.init() Calling initGraphics() INFO: No state to restore. Start from the beginning. Encountered error. Exiting. </stderr_txt> ]]> I'm wondering why two successful copies are sufficient to confirm that the result of a WU is valid, while 6 or 7 are needed to confirm an error?! [Edit 1 times, last edit by rkar22 at Feb 15, 2009 2:36:58 PM] |
||
|
|
mclaver
Veteran Cruncher Joined: Dec 19, 2005 Post Count: 566 Status: Offline Project Badges:
|
The machine that had the three errors is brand new. This is why 6.4.5 is running, I went to the website and downloaded the current version. I do not think CUDA is being used, becasue I am only running World Community Grid, and I do not think. This is an I7 965 running Vista Ultmate so it is running 8 WCG processes and nothing else. I do not know how to make sure only one CEP is running, but I doubt that more than one is running becasue i usually do not see very many in the queue waiting to run.
----------------------------------------![]() ![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello rkar22,
I'm wondering why two successful copies are sufficient to confirm that the result of a WU is valid, while 6 or 7 are needed to confirm an error?! If a Valid result cannot be produced by 7 tries, the work unit is withdrawn from the server and humans try to figure out what is wrong. The problem with CEP is that a molecule might not converge to a stable solution. If it does not converge, then CEP marks it as an error. This strikes me as sensible for a private run on a lab computer but extremely unfriendly to the members of a grid, who are used to being rewarded with credit for contributing computer time for scientific research. It was probably never an option to ask that basic features of CHARMM be rewritten to be grid-friendly. I was already precociously working with computers when this giant FORTRAN program was first conceived in 1969. I think it is the most gigantic program we have ever run in terms of code length. There has been enough time spent writing this program to make it inconceivably difficult to read. Lawrence |
||
|
|
widdershins
Veteran Cruncher Scotland Joined: Apr 30, 2007 Post Count: 677 Status: Offline Project Badges:
|
I understand what you are saying about it marking it as an error if the solutions didn't converge and how it may be unrealistic to expect CHARMM to be rewritten. Is it unrealistic to ask though that where CEP has flagged a unit as errored and this has happened twice no more copies are issued of that WU? That doesn't require a rewrite of CHARMM, only a rewrite of the reissue rules on the WCG servers.
The scientists can then carry off the WU concerned and the two errored results and study it at their leisure. Meanwhile if the other 5 reissues hadn't been sent out all that crunching time saved could produce more valid results for CEP or other projects. In that way everyone wins and no-one loses except the first two to return the unit (who could be granted half credit perhaps). |
||
|
|
Bearcat
Master Cruncher USA Joined: Jan 6, 2007 Post Count: 2803 Status: Offline Project Badges:
|
I've been getting about 1 error per batch in my mac's but haven't notified anyone. Figured WCG folks would look at them when they get a chance. Is this right or are we supposed to let you guys know when they happen? I just suck it up and keep on crunching unless i start getting allot of them, then would switch projects until they are fixed. Haven't had to many since I started in 07. Yeah, it sucks to loose points on bad wu's but just enjoy crunching for a good cause.
----------------------------------------
Crunching for humanity since 2007!
![]() |
||
|
|
|