| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 68
|
|
| Author |
|
|
mike047
Senior Cruncher Joined: Aug 22, 2006 Post Count: 262 Status: Offline Project Badges:
|
THANKS NELS, I have too much time on my hands
---------------------------------------- Most people have a long holiday weekend and deserve it.
mike
Crunch Hard, Crunch Often |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
mike, it is likely that it was a problem with the batch of work units. Each work unit behaves differently, but there is usually a fair amount of similarity in a batch. Still, what the techs want to understand is why it failed on one computer but not on another.
XS_Fr3ak, the nature of the error will dictate whether the work unit is returned as an error or marked invalid. When it is marked invalid, BOINC thinks that everything has gone correctly. The science application has terminated properly, and there is a result file to send back. However, if something goes seriously wrong, and the science application crashes, then BOINC returns an error. A computational error just needs one bit to go astray, and the entire result will be wrong, but the program will almost certainly be unable to detect this. While this is a more common issue on overclocked machines, it can happen to anyone. And since it is such a small error, often diagnostic tools can't see anything wrong with the hardware, since it works perfectly 99.9999% of the time. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi Meshmesh, Mike, et al.
I'm curious about meshmesh's team result problems and we haven't heard back since the initial post? Mike, I think that your problem will be the power outages. If BOINC is writing a checkpoint when you have a power outage it will return an invalid result. Four invalids from all of your machines during a power outage would seem typical I should think. That doesn't answer the original question from meshmesh and team unless they all live in the same power district though. Cheers. ozylynx ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just so people don't get the idea that we're being unresponsive, the WCG Tech team is aware of this thread. It is the Thanksgiving Holiday today, though, and most of the team is on vacation until Monday. We'll all have a closer look at the results at that time. Thanks for providing the device ID's for the machines. Have a happy holiday for those to whom it applies. Thank you for replying to my post. I am aware that it is the holidays, and do not expect a response before some time next week. So thank you very much for taking the time today to reply. As for device IDs, I will edit the first post as soon as I get them. Probably during the weekend. @everybody: after reading the different posts in this thread, may I request that everyone refrain from speculating on the causes. Only participants and project admins have direct access to the result reports. If participants feel that the results reports do not show abnormality, and they know that the WUs ran to completion, then it is up to the project admins to inspect them for further analysis when their work schedule permits. These machines are run by experienced participants so please no need to state the obvious. My post above was directed to the admins for exactly this reason. Have a nice weekend everone, and thank you very much. [Edit 4 times, last edit by Former Member at Nov 23, 2006 8:32:03 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi meshmesh. You forgot to close the quote, that's why there's a gray bar down the side of the page. Can you edit it please?
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Sorry Didactylos. Not familiar with this software. I realise the problem, tried to edit the post three times. think I got it this time. Thanks.
----------------------------------------[Edit 1 times, last edit by Former Member at Nov 23, 2006 8:32:48 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Sometimes you get logged out and can't edit. Log back in, and the pencil icon will be back. When you edit a post, it's the "Edit a post" button that submits your edit. Weird, but there you go.
|
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
Meshmesh,
The reason these are showing invalids could be for a number of reasons. It could be that when uploading the result files from those machines the files didn't get transfered properly. With HDC all 8 output files need to match all 8 from another to get proper validation. If one does not match, the entire result is considered invalid. Also, when you get more information on those machines we'll look into it more. I looked at the host associated with your member name and it has 100% validation. -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
hi mike, just a few thoughts to add to yours
---------------------------------------- This unit had previously run QMC/Leiden for several months without issue. It is an Opteron on a good quality board. I know nothing about Leiden but I am pretty sure QMC does not use a quorum to validate results. In other words, that computer might have been returning an occassional "bad" result but QMC never told you and just gave you the credits you claimed. By "bad" I mean a result that processed without BOINC detecting any errors on your end but which had errors nevertheless. BOINC cannot detect all errors and that's why some projects use a quorum. Some projects, and I think QMC is one, are able to use methods other than a quorum and some are simply able to use results that have minor errors. Perhaps WCG has more stringent error checking than you are used to so now you're seeing more invalid results than you are used to. There is no doubt you're using good quality gear but even good quality stuff can deteriorate to the point where fails occassionally under extreme conditions. Crunching FAAH is definitely an acid test. So, it seems the wu is actually "defective" and is not invalid due to points claim. Am I understanding the process properly?? Defective FAAH workunits have been very few and very far between. By defective I mean a work unit that fails on every machine that crunches it. The work units you are asking about are not defective, there was an error during crunching, an error BOINC did not detect, an error that showed up when compared with the other results in the quorum. Just trying to get a handle on this to better understand and maybe fix the problem. That's the spirit. In my experience some problems just go away on their own and you never find the cause. Other problems take a long time and special tools to track down. Some can be fixed by a lengthy and sometimes expensive process of elimination where you swap out various components until you get lucky and swap out the one causing the problem and then you get to say, "Aha! It was the @$#! torque rotor. I had a hunch it was the torque rotor." The worst is where you actually have 2 bad components but you're convinced you have only 1 bad. I have 3 boxes with invalid units, one on each and two on the example, out of 43 boxes that is fairly good Yah, it is a pretty low invalid rate, if I'm reading your numbers correctly. And it might get even better if you wait a bit. Would a power outage cause a wu completion to be invalid?? The power has gone out several times lately. That is definitely a possible cause. The best quality system in the world will not crunch properly if it isn't powered properly. It's not just power outages you need to worry about, you also need to handle fluctuations in the voltage. All of your machines need to be protected by a good quality spike protector and noise filter. If they are not then the components inside the box will deteriorate and you'll get errors. If you can't afford a UPS then at least get spike protection and noise filters and make sure all of your mains receptacles are grounded properly else the spike protection and noise filters will not work optimally. --- [Edit 2 times, last edit by Former Member at Nov 23, 2006 8:59:19 PM] |
||
|
|
mike047
Senior Cruncher Joined: Aug 22, 2006 Post Count: 262 Status: Offline Project Badges:
|
Hi Roan,
----------------------------------------Thanks for your input, I use spike supressors on all my units but UPS for 43 would build another 5-6 boxes or more All my wiring is up to par, but I can't contol the power company I am sure in my case it will not be a long term thing but I wanted to know more and be sure. I've received an abundance of very good help and information from all that posted, and appreciate it. Hopefully it will help some others, also.
mike
Crunch Hard, Crunch Often |
||
|
|
|