| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 89
|
|
| Author |
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
I too see far too many WUs error out. I currently have 21 WUs with an error status in the database. Six of those were axed because of the 18 hour deadline. These WUs simply cannot be computed in 18 hours on my AMDs. So that deadline clearly needs to be extended. When I look at the wingmen, I often see that these WUs already errored out 3/4/5 times. I estimate my failure rate to be 1 in 6 assuming all WUs still in PV jail will validate, so in reality it is likely going to be worse. This is unacceptably inefficient, so I will disable this project for now.
|
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
I too see far too many WUs error out. I currently have 21 WUs with an error status in the database. Six of those were axed because of the 18 hour deadline. These WUs simply cannot be computed in 18 hours on my AMDs. So that deadline clearly needs to be extended. When I look at the wingmen, I often see that these WUs already errored out 3/4/5 times. I estimate my failure rate to be 1 in 6 assuming all WUs still in PV jail will validate, so in reality it is likely going to be worse. This is unacceptably inefficient, so I will disable this project for now. The way I look at it is that efficiency is hard for us crunchers to measure. If the scientists think they are getting good results (and "failure" may be a useful thing to know), then that is OK with me. I am a little concerned about unnecessary redundancy (non-zero quorum), but that is another matter that can be dealt with if they decide to do so. A lot of projects require a non-zero quorum all the time, and it is just some of the time with CEP2. [Edit 1 times, last edit by Jim1348 at Sep 7, 2014 9:17:06 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm seeing the same thing as PVH513. I see 0x100 in step 6 be considered an error and also it is considered valid. Due to the excessive numbers of unexplained errors and lack of response to the issues raised in multiple threads, I too, am disabling this project.
|
||
|
|
johncmacalister2010@gmail.com
Veteran Cruncher Canada Joined: Nov 16, 2010 Post Count: 799 Status: Offline Project Badges:
|
I think this research is very important and needs to be advanced. The current issues will, I am confident, be resolved and I will continue processing. Answers to the questions raised will follow in due course, as they have up to now. We need efficient, afford able solar power, not because of any 'green' considerations, rather because there is an almost unlimited stream of energy pouring in from the sun and it is a shame to ignore such a source of energy.
---------------------------------------- crunching, crunching, crunching. AMD Ryzen 5 2600 6-core Processor with Windows 11 64 Pro. AMD Ryzen 7 3700X 8-Core Processor with Windows 11 64 Pro (part time) ![]() |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Bad and frustrating situation !
----------------------------------------E225138_ 972_ S.364.C35H19N9O1S4.CAGKRHSPAJDXPZ-UHFFFAOYSA-N.9_ s1_ 14_ 2-- Error 9/6/14 15:54:39 9/7/14 10:51:39 9.32 / 9.52 225.4 / 0.0 Since I did not regularly check my results since one week, I assume that I had more errored WUs. Cheers, Yves |
||
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
The way I look at it is that efficiency is hard for us crunchers to measure. If the scientists think they are getting good results (and "failure" may be a useful thing to know), then that is OK with me. I don't think that the scientists learn anything from error results. These are simply thrown away and never looked at again. In other words these error results are a complete waste of electricity. Of all people, these scientists should be more sensitive to this issue... Yes, CEP2 is a worthy cause, but the current failure rate is simply not acceptable. My current status in the database is 15 WUs error, 17 WUs in PV, and only 10 WUs valid. So the valid WUs are now a minority... In the most optimistic case, if all WUs in PV validate (which I doubt) the failure rate would still be a whopping 1 in 3. I can tolerate some failure rate, but this is unacceptable. There are other worthy causes in BOINC that make more efficient use of my hardware and electricity. I will crunch those until this is fixed. |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
I don't think that the scientists learn anything from error results. These are simply thrown away and never looked at again. In other words these error results are a complete waste of electricity. Of all people, these scientists should be more sensitive to this issue... But the "errors" here are not machine errors where the results are garbage. It seems that they are cases where the calculations blow up (i.e., do not converge to a useful result), presumably due to the initial conditions. It might be useful to know what initial conditions cause this. At any rate, they can not at present predict what these conditions are, or they would avoid them already. So the real question is how useful are the results? If they are more useful than running calculations that always converge, then it makes sense to run the ones that sometimes blow up. (Rockets used to sometimes blow up as you may remember, but you can't get to the moon with a plane.) |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi I will reply briefly to this (I am about to go into a bunch of meetings) and reply more fully later.
The errors that you are seeing here are mainly caused when the job times out. We are discussing at the moment with IBM whether we can call these jobs 'valid' or not. I think that since you guys have put the compute in, we probably should; but there may be technical reasons why we would not. I will keep you in the loop on this. With regards to the 'predictability' of the failed jobs - we can get some idea about how long it will take from the number of electrons but this is only really strictly true of a single point calculation. The geometry optimization is entirely dependent upon how close the initial guess is to the final geometry (i.e. how many steps it takes to get there). I have done much a lot of work improving this, but it is something that is very tough to tell just looking at the geometry itself. Yes, some of the molecules that we are using now are slightly more difficult, but for this project to be worthwhile, we have to be constantly pushing at the boundaries, and trying new things. Sometime that will mean a few more errors - and I try my best to minimise them as much as possible; but would you really prefer to spend your crunching time on something that was safe but boring? I strongly believe that the new library will result in many more promising molecules, which can go a long way to achieving our goal of developing effficient organic solar cells. Your Harvard CEP Team |
||
|
|
deltavee
Ace Cruncher Texas Hill Country Joined: Nov 17, 2004 Post Count: 4894 Status: Offline Project Badges:
|
Yes, some of the molecules that we are using now are slightly more difficult, but for this project to be worthwhile, we have to be constantly pushing at the boundaries, and trying new things. Sometime that will mean a few more errors - and I try my best to minimise them as much as possible; but would you really prefer to spend your crunching time on something that was safe but boring? Thanks for the explanation. From now on I will think of the errors as "pushing at the boundaries." Bring 'em on. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The errors themselves don't bother me but the apparent inconsistency does. Such as an 0x100 error in step 6 is sometimes and error and sometimes not. What's up with that. One of the main reasons I disabled this project was due to lack of communication. No explanation, No crunching.....
|
||
|
|
|