Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 118
Posts: 118   Pages: 12   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 19712 times and has 117 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

Failure rate isn't as bad as it was a few weeks ago but no other project produces errors on my computers. Possible exception is HPF2. I no longer run that project.

OK, I agree on HPF2, I also see quite some errors there, escpecially the last weeks and on Windows only. Currently I have one Rice error on a Linux box and one CEP error, in total 8 errors in a timeframe of around 2 weeks with 29 device installations.
Can you tell me what the actual failure rate is for CEP and if it occurs on a particular machine?
[Feb 5, 2009 3:42:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

Sorry about that mclaver. About 20 have already reported "exit code 29 (0x1d)" and searching google nothing really pops up as to what it is, other than our CEP project.

I'm explicitly testing CEP on my quad now for about 5 days, and all do fine, in the combo

BOINC 6.2.28, in protected install, all user control, no graphics.
Vista HP 32 bit with latest NVidia 181.22
Intel Q6600
Permitted 3GB ram use, both use and idle.
LARGE 10GB work space permission for BOINC
AV not to scan the BOINC Data directory and job slots
LARGE Swap file minimum of 4.5GB and free to grow.

AND, I only permit them to run 1 by 1, which is not the idea, but they run that way about 20% faster in any combination of jobs on the other cores.

There was a beta for CEP version 6.28, so something is in the works. Until then, do not hesitate to suspend participation in the CEP project.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Feb 5, 2009 3:46:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

I don't mind a large work unit, but I don't think that was the case with this one. I let that particular work unit run overnight after I suspected there was a problem. I also tried shutting it down and re-starting it and re-booting computer. It did the same thing some previous work units did (nothing) so I aborted that one.

There was no prograss at all after that night? How long did you wait after restarting?
[Feb 5, 2009 3:48:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mclaver
Veteran Cruncher
Joined: Dec 19, 2005
Post Count: 566
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Still producing errors

During that same timeframe of three days, with one failure a day, I did process 28 WU successfully. Both my machines are Quads running vista and the I7 is running 8 tasks at once. Both machines are only doing WCG 24x7. Nothing esle runs on them.
----------------------------------------



[Feb 5, 2009 8:56:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dark Angel
Veteran Cruncher
Australia
Joined: Nov 11, 2005
Post Count: 728
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Still producing errors

*******

****** After your first post reporting problems with CEP, **********.

You are having legitimate problems with a project ************. You are running multi-core machines (known problem with CEP). You are overclocking (might be a problem). You are running CEP on Vista machines (known problem). By looking at your machines, I can tell you are going to have problems with CEP. *********.

**********


I for one and glad David posts. ******I understand his frustrations, both with the errors and with certain people. David is trying to better the project by reporting the errors he gets, which is what we all should be doing on every project.
Hanging work units in particular are a big problem for those members that have remote machines or very large numbers of machines. Having to go around and manually reset, in some cases literally hundreds of machines, is really quite a large problem.
As for your jibe that David should "FIX YOUR PROBLEM", his hardware is quite likely more stable and better maintained than most. David and his team mates are well aware of the accuracy requirements of the project and work to achieve 24/7/365 stability ("five nines" might ring a bell if you're in the business) from their hardware. I've seen plenty of brand new, stock machines that can't boast the hardware stability they have.
Consistent errors across machines that are running stock as well as those he has identified as over-clocked should indicate there is a different issue here. As for him having to fix Vista (which your post implies) ... that's not his problem, that's Microsoft's ... and besides, nothing but a complete reformat and install of either XP or Linux can fix that. Unfortunately, Vista will be with us for a while yet, bugs and all, so it's left up to the WCG techs to make the projects play nice with it. (Sorry guys. You have my sympathy.)
As for your comments on multi-core machines ... perhaps you haven't noticed but the majority of machines sold these days are multi-core units. Saying the project has issues with them is hardly an excuse.

**Edited for intolerance**tkh
----------------------------------------

Currently being moderated under false pretences
----------------------------------------
[Edit 3 times, last edit by TKH at Feb 6, 2009 1:41:25 PM]
[Feb 5, 2009 9:26:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

I've been running rosettaview and it does an excellent job, even monitoring remote hosts and alerting and or auto cancelling stalled jobs. Just make sure that the times to check are off from the BOINC disk write timing. Set those to e.g. 357 seconds and rv to e.g. 30 minutes exact, so the least chance of client_state.xml access conflict occurs. This only seems to happen due slow permissions in the networking I suppose, on remote hosts, not the local host.

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=20318#213042
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Feb 5, 2009 10:02:52 PM]
[Feb 5, 2009 9:48:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

Just out version 6.28 for all 3 main platforms, Linux, Mac, Windows

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=24464

Suggest to open a clean thread reporting on issues with this "improved" version. Please make sure to report related platform, client version information.

cheers
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Feb 5, 2009 10:26:20 PM]
[Feb 5, 2009 10:24:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Still producing errors

CEP (windows) v 6.28 is still producing errors:
E000328_ 820A_ 001x0r008_ 5-- | In Progress | 6/02/09 19:41:40 | 10/02/09 18:44:04 | 0.00 | 0.0 / 0.0
E000328_ 820A_ 001x0r008_ 4-- | Error | 6/02/09 11:05:48 | 6/02/09 19:41:03 | 5.07 | 136.6 / 0.0 <== mine
E000328_ 820A_ 001x0r008_ 3-- | Pending Validation | 6/02/09 05:10:46 | 6/02/09 22:54:27 | 9.62 | 173.5 / 0.0
E000328_ 820A_ 001x0r008_ 2-- | Error | 5/02/09 16:06:14 | 6/02/09 11:03:55 | 7.83 | 153.4 / 0.0
E000328_ 820A_ 001x0r008_ 1-- | Error | 5/02/09 05:03:35 | 5/02/09 16:02:42 | 8.06 | 146.0 / 0.0
E000328_ 820A_ 001x0r008_ 0-- | Error | 5/02/09 05:02:45 | 6/02/09 05:07:52 | 12.42 | 116.1 / 0.0
Task Manager showed that wcgrid_cep1_6.28_windows_intelx86 ran this WU. Device: Intel quad, Win XP-32 SP3, probably running 1 x CEP, 3 x faah when this WU stopped.
The log file shows that the infamous Error Code 29 (0x1d) Problem occurred. Sek, I too Googled for Windows error codes, and found a free MS utility called err.exe. "The system cannot write to the specified device" seems to be the interpretation of the error code ERROR_WRITE_FAULT defined in winerror.h.
The fact that the device that crunched copy _3 seems to have proceeded to the end while 4 others (update: 5) stopped could be the result of a system timing problem (or using uninitialised data, etc, etc).
The error occurs on Linux (see widdershins' post below), so it's unlikely that the problem is due to the way that CEP interacts with the operating system.
It has been many weeks since I crunched any CEP WUs, and I see that things don't seem to have improved much. 1 error out of 1 ain't good.
Update: My 2nd recent WU has been declared valid, even though it too got an error "[ERROR] Failed to open either source or destination files while copying wcgrestart.rst to ..." after 64 min. The other quorum member crunched for similar points & time, so he probably got the same error at the same place. (Update: This error has been described in other posts). 2 out of 2 (Update: 3) ain't good, either sad
Good post, Haris Dublas (below, next). Thanks.
----------------------------------------
[Edit 6 times, last edit by Rickjb at Feb 11, 2009 2:20:20 AM]
[Feb 7, 2009 6:52:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

We cannot always say our systems are rock solid. I have an athlon 64 3000 (1.8GHz) that was overclocked to 2.2GHz that runs COD4, NBA 2k9, and my other games without problems. It also passed a few stress test programs at that setting but for some reason I got computation errors on some WUs. I fiddled with some settings in the BIOS (memory timings, etc) and now seems to be running fine so far.

I operate an internet cafe and most of the WU errors I get are caused by my cafe management program. Even if we think we have rock solid systems, sometimes there are some flaky software that just doesn't get along with the science apps.
[Feb 7, 2009 4:08:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Still producing errors

Project Name: The Clean Energy Project
Created: 2/5/09
Name: E000346_294A_00260n00a
Minimum Quorum: 2
Initial Replication: 2


Result Name Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E000346_ 294A_ 00260n00a_ 6-- In Progress 2/8/09 06:47:45 2/12/09 05:50:09 0.00 0.0 / 0.0
E000346_ 294A_ 00260n00a_ 5-- Error 2/8/09 00:57:57 2/8/09 06:17:43 5.06 101.3 / 0.0
E000346_ 294A_ 00260n00a_ 4-- In Progress 2/8/09 00:52:45 2/11/09 23:55:09 0.00 0.0 / 0.0
E000346_ 294A_ 00260n00a_ 3-- Error 2/7/09 14:04:55 2/8/09 00:57:49 7.22 76.2 / 0.0
E000346_ 294A_ 00260n00a_ 2-- Error 2/7/09 09:27:45 2/8/09 00:36:34 10.96 88.0 / 0.0
E000346_ 294A_ 00260n00a_ 1-- Error 2/6/09 18:06:16 2/7/09 14:04:40 5.33 82.1 / 0.0
E000346_ 294A_ 00260n00a_ 0-- Error 2/6/09 18:04:29 2/7/09 09:23:35 7.13 61.8 / 0.0

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Encountered error. Exiting.
[Feb 8, 2009 8:48:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 118   Pages: 12   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread