| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 118
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Failure rate isn't as bad as it was a few weeks ago but no other project produces errors on my computers. Possible exception is HPF2. I no longer run that project. OK, I agree on HPF2, I also see quite some errors there, escpecially the last weeks and on Windows only. Currently I have one Rice error on a Linux box and one CEP error, in total 8 errors in a timeframe of around 2 weeks with 29 device installations. Can you tell me what the actual failure rate is for CEP and if it occurs on a particular machine? |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Sorry about that mclaver. About 20 have already reported "exit code 29 (0x1d)" and searching google nothing really pops up as to what it is, other than our CEP project.
----------------------------------------I'm explicitly testing CEP on my quad now for about 5 days, and all do fine, in the combo BOINC 6.2.28, in protected install, all user control, no graphics. Vista HP 32 bit with latest NVidia 181.22 Intel Q6600 Permitted 3GB ram use, both use and idle. LARGE 10GB work space permission for BOINC AV not to scan the BOINC Data directory and job slots LARGE Swap file minimum of 4.5GB and free to grow. AND, I only permit them to run 1 by 1, which is not the idea, but they run that way about 20% faster in any combination of jobs on the other cores. There was a beta for CEP version 6.28, so something is in the works. Until then, do not hesitate to suspend participation in the CEP project.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I don't mind a large work unit, but I don't think that was the case with this one. I let that particular work unit run overnight after I suspected there was a problem. I also tried shutting it down and re-starting it and re-booting computer. It did the same thing some previous work units did (nothing) so I aborted that one. There was no prograss at all after that night? How long did you wait after restarting? |
||
|
|
mclaver
Veteran Cruncher Joined: Dec 19, 2005 Post Count: 566 Status: Offline Project Badges:
|
During that same timeframe of three days, with one failure a day, I did process 28 WU successfully. Both my machines are Quads running vista and the I7 is running 8 tasks at once. Both machines are only doing WCG 24x7. Nothing esle runs on them.
----------------------------------------![]() ![]() ![]() |
||
|
|
Dark Angel
Veteran Cruncher Australia Joined: Nov 11, 2005 Post Count: 728 Status: Offline Project Badges:
|
******* ****** After your first post reporting problems with CEP, **********. You are having legitimate problems with a project ************. You are running multi-core machines (known problem with CEP). You are overclocking (might be a problem). You are running CEP on Vista machines (known problem). By looking at your machines, I can tell you are going to have problems with CEP. *********. ********** I for one and glad David posts. ******I understand his frustrations, both with the errors and with certain people. David is trying to better the project by reporting the errors he gets, which is what we all should be doing on every project. Hanging work units in particular are a big problem for those members that have remote machines or very large numbers of machines. Having to go around and manually reset, in some cases literally hundreds of machines, is really quite a large problem. As for your jibe that David should "FIX YOUR PROBLEM", his hardware is quite likely more stable and better maintained than most. David and his team mates are well aware of the accuracy requirements of the project and work to achieve 24/7/365 stability ("five nines" might ring a bell if you're in the business) from their hardware. I've seen plenty of brand new, stock machines that can't boast the hardware stability they have. Consistent errors across machines that are running stock as well as those he has identified as over-clocked should indicate there is a different issue here. As for him having to fix Vista (which your post implies) ... that's not his problem, that's Microsoft's ... and besides, nothing but a complete reformat and install of either XP or Linux can fix that. Unfortunately, Vista will be with us for a while yet, bugs and all, so it's left up to the WCG techs to make the projects play nice with it. (Sorry guys. You have my sympathy.) As for your comments on multi-core machines ... perhaps you haven't noticed but the majority of machines sold these days are multi-core units. Saying the project has issues with them is hardly an excuse. **Edited for intolerance**tkh ![]() Currently being moderated under false pretences [Edit 3 times, last edit by TKH at Feb 6, 2009 1:41:25 PM] |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
I've been running rosettaview and it does an excellent job, even monitoring remote hosts and alerting and or auto cancelling stalled jobs. Just make sure that the times to check are off from the BOINC disk write timing. Set those to e.g. 357 seconds and rv to e.g. 30 minutes exact, so the least chance of client_state.xml access conflict occurs. This only seems to happen due slow permissions in the networking I suppose, on remote hosts, not the local host.
----------------------------------------http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=20318#213042
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Feb 5, 2009 10:02:52 PM] |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Just out version 6.28 for all 3 main platforms, Linux, Mac, Windows
----------------------------------------http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=24464 Suggest to open a clean thread reporting on issues with this "improved" version. Please make sure to report related platform, client version information. cheers
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Feb 5, 2009 10:26:20 PM] |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
CEP (windows) v 6.28 is still producing errors:
----------------------------------------E000328_ 820A_ 001x0r008_ 5-- | In Progress | 6/02/09 19:41:40 | 10/02/09 18:44:04 | 0.00 | 0.0 / 0.0 E000328_ 820A_ 001x0r008_ 4-- | Error | 6/02/09 11:05:48 | 6/02/09 19:41:03 | 5.07 | 136.6 / 0.0 <== mine E000328_ 820A_ 001x0r008_ 3-- | Pending Validation | 6/02/09 05:10:46 | 6/02/09 22:54:27 | 9.62 | 173.5 / 0.0 E000328_ 820A_ 001x0r008_ 2-- | Error | 5/02/09 16:06:14 | 6/02/09 11:03:55 | 7.83 | 153.4 / 0.0 E000328_ 820A_ 001x0r008_ 1-- | Error | 5/02/09 05:03:35 | 5/02/09 16:02:42 | 8.06 | 146.0 / 0.0 E000328_ 820A_ 001x0r008_ 0-- | Error | 5/02/09 05:02:45 | 6/02/09 05:07:52 | 12.42 | 116.1 / 0.0 Task Manager showed that wcgrid_cep1_6.28_windows_intelx86 ran this WU. Device: Intel quad, Win XP-32 SP3, probably running 1 x CEP, 3 x faah when this WU stopped. The log file shows that the infamous Error Code 29 (0x1d) Problem occurred. Sek, I too Googled for Windows error codes, and found a free MS utility called err.exe. "The system cannot write to the specified device" seems to be the interpretation of the error code ERROR_WRITE_FAULT defined in winerror.h. The fact that the device that crunched copy _3 seems to have proceeded to the end while 4 others (update: 5) stopped could be the result of a system timing problem (or using uninitialised data, etc, etc). The error occurs on Linux (see widdershins' post below), so it's unlikely that the problem is due to the way that CEP interacts with the operating system. It has been many weeks since I crunched any CEP WUs, and I see that things don't seem to have improved much. 1 error out of 1 ain't good. Update: My 2nd recent WU has been declared valid, even though it too got an error "[ERROR] Failed to open either source or destination files while copying wcgrestart.rst to ..." after 64 min. The other quorum member crunched for similar points & time, so he probably got the same error at the same place. (Update: This error has been described in other posts). 2 out of 2 (Update: 3) ain't good, either ![]() Good post, Haris Dublas (below, next). Thanks. [Edit 6 times, last edit by Rickjb at Feb 11, 2009 2:20:20 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
We cannot always say our systems are rock solid. I have an athlon 64 3000 (1.8GHz) that was overclocked to 2.2GHz that runs COD4, NBA 2k9, and my other games without problems. It also passed a few stress test programs at that setting but for some reason I got computation errors on some WUs. I fiddled with some settings in the BIOS (memory timings, etc) and now seems to be running fine so far.
I operate an internet cafe and most of the WU errors I get are caused by my cafe management program. Even if we think we have rock solid systems, sometimes there are some flaky software that just doesn't get along with the science apps. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Project Name: The Clean Energy Project
Created: 2/5/09 Name: E000346_294A_00260n00a Minimum Quorum: 2 Initial Replication: 2 Result Name Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit E000346_ 294A_ 00260n00a_ 6-- In Progress 2/8/09 06:47:45 2/12/09 05:50:09 0.00 0.0 / 0.0 E000346_ 294A_ 00260n00a_ 5-- Error 2/8/09 00:57:57 2/8/09 06:17:43 5.06 101.3 / 0.0 E000346_ 294A_ 00260n00a_ 4-- In Progress 2/8/09 00:52:45 2/11/09 23:55:09 0.00 0.0 / 0.0 E000346_ 294A_ 00260n00a_ 3-- Error 2/7/09 14:04:55 2/8/09 00:57:49 7.22 76.2 / 0.0 E000346_ 294A_ 00260n00a_ 2-- Error 2/7/09 09:27:45 2/8/09 00:36:34 10.96 88.0 / 0.0 E000346_ 294A_ 00260n00a_ 1-- Error 2/6/09 18:06:16 2/7/09 14:04:40 5.33 82.1 / 0.0 E000346_ 294A_ 00260n00a_ 0-- Error 2/6/09 18:04:29 2/7/09 09:23:35 7.13 61.8 / 0.0 <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) </message> <stderr_txt> Calling initGraphics() INFO: No state to restore. Start from the beginning. Encountered error. Exiting. |
||
|
|
|