Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 68
Posts: 68   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 6891 times and has 67 replies Next Thread
mike047
Senior Cruncher
Joined: Aug 22, 2006
Post Count: 262
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

THANKS NELS, I have too much time on my hands blushing biggrin

Most people have a long holiday weekend and deserve it.
----------------------------------------
mike
Crunch Hard, Crunch Often


[Nov 23, 2006 6:59:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

mike, it is likely that it was a problem with the batch of work units. Each work unit behaves differently, but there is usually a fair amount of similarity in a batch. Still, what the techs want to understand is why it failed on one computer but not on another.

XS_Fr3ak, the nature of the error will dictate whether the work unit is returned as an error or marked invalid. When it is marked invalid, BOINC thinks that everything has gone correctly. The science application has terminated properly, and there is a result file to send back. However, if something goes seriously wrong, and the science application crashes, then BOINC returns an error.

A computational error just needs one bit to go astray, and the entire result will be wrong, but the program will almost certainly be unable to detect this. While this is a more common issue on overclocked machines, it can happen to anyone. And since it is such a small error, often diagnostic tools can't see anything wrong with the hardware, since it works perfectly 99.9999% of the time.
[Nov 23, 2006 7:11:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

Hi Meshmesh, Mike, et al.

I'm curious about meshmesh's team result problems and we haven't heard back since the initial post?

Mike, I think that your problem will be the power outages. If BOINC is writing a checkpoint when you have a power outage it will return an invalid result. Four invalids from all of your machines during a power outage would seem typical I should think.

That doesn't answer the original question from meshmesh and team unless they all live in the same power district though.

Cheers. ozylynx smile
[Nov 23, 2006 7:27:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.


Just so people don't get the idea that we're being unresponsive, the WCG Tech team is aware of this thread. It is the Thanksgiving Holiday today, though, and most of the team is on vacation until Monday. We'll all have a closer look at the results at that time.

Thanks for providing the device ID's for the machines.

Have a happy holiday for those to whom it applies.



Thank you for replying to my post. I am aware that it is the holidays, and do not expect a response before some time next week. So thank you very much for taking the time today to reply.

As for device IDs, I will edit the first post as soon as I get them. Probably during the weekend.

@everybody: after reading the different posts in this thread, may I request that everyone refrain from speculating on the causes. Only participants and project admins have direct access to the result reports. If participants feel that the results reports do not show abnormality, and they know that the WUs ran to completion, then it is up to the project admins to inspect them for further analysis when their work schedule permits. These machines are run by experienced participants so please no need to state the obvious. My post above was directed to the admins for exactly this reason.

Have a nice weekend everone, and thank you very much.
----------------------------------------
[Edit 4 times, last edit by Former Member at Nov 23, 2006 8:32:03 PM]
[Nov 23, 2006 7:37:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

Hi meshmesh. You forgot to close the quote, that's why there's a gray bar down the side of the page. Can you edit it please?
[Nov 23, 2006 8:16:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

Sorry Didactylos. Not familiar with this software. I realise the problem, tried to edit the post three times. think I got it this time. Thanks.
----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 23, 2006 8:32:48 PM]
[Nov 23, 2006 8:30:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

Sometimes you get logged out and can't edit. Log back in, and the pencil icon will be back. When you edit a post, it's the "Edit a post" button that submits your edit. Weird, but there you go.
[Nov 23, 2006 8:32:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

Meshmesh,

The reason these are showing invalids could be for a number of reasons. It could be that when uploading the result files from those machines the files didn't get transfered properly. With HDC all 8 output files need to match all 8 from another to get proper validation. If one does not match, the entire result is considered invalid.

Also, when you get more information on those machines we'll look into it more. I looked at the host associated with your member name and it has 100% validation.

-Uplinger
[Nov 23, 2006 8:38:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

hi mike, just a few thoughts to add to yours

This unit had previously run QMC/Leiden for several months without issue. It is an Opteron on a good quality board.


I know nothing about Leiden but I am pretty sure QMC does not use a quorum to validate results. In other words, that computer might have been returning an occassional "bad" result but QMC never told you and just gave you the credits you claimed. By "bad" I mean a result that processed without BOINC detecting any errors on your end but which had errors nevertheless. BOINC cannot detect all errors and that's why some projects use a quorum. Some projects, and I think QMC is one, are able to use methods other than a quorum and some are simply able to use results that have minor errors. Perhaps WCG has more stringent error checking than you are used to so now you're seeing more invalid results than you are used to.

There is no doubt you're using good quality gear but even good quality stuff can deteriorate to the point where fails occassionally under extreme conditions. Crunching FAAH is definitely an acid test.

So, it seems the wu is actually "defective" and is not invalid due to points claim. Am I understanding the process properly??


Defective FAAH workunits have been very few and very far between. By defective I mean a work unit that fails on every machine that crunches it. The work units you are asking about are not defective, there was an error during crunching, an error BOINC did not detect, an error that showed up when compared with the other results in the quorum.

Just trying to get a handle on this to better understand and maybe fix the problem.


That's the spirit. In my experience some problems just go away on their own and you never find the cause. Other problems take a long time and special tools to track down. Some can be fixed by a lengthy and sometimes expensive process of elimination where you swap out various components until you get lucky and swap out the one causing the problem and then you get to say, "Aha! It was the @$#! torque rotor. I had a hunch it was the torque rotor." The worst is where you actually have 2 bad components but you're convinced you have only 1 bad.

I have 3 boxes with invalid units, one on each and two on the example, out of 43 boxes that is fairly good smile


Yah, it is a pretty low invalid rate, if I'm reading your numbers correctly. And it might get even better if you wait a bit.

Would a power outage cause a wu completion to be invalid?? The power has gone out several times lately.


That is definitely a possible cause. The best quality system in the world will not crunch properly if it isn't powered properly. It's not just power outages you need to worry about, you also need to handle fluctuations in the voltage. All of your machines need to be protected by a good quality spike protector and noise filter. If they are not then the components inside the box will deteriorate and you'll get errors. If you can't afford a UPS then at least get spike protection and noise filters and make sure all of your mains receptacles are grounded properly else the spike protection and noise filters will not work optimally.

---
----------------------------------------
[Edit 2 times, last edit by Former Member at Nov 23, 2006 8:59:19 PM]
[Nov 23, 2006 8:53:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mike047
Senior Cruncher
Joined: Aug 22, 2006
Post Count: 262
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Problem: Invalid Working Units in large numbers. Please help.

Hi Roan,

Thanks for your input, I use spike supressors on all my units but UPS for 43 would build another 5-6 boxes or more biggrin All my wiring is up to par, but I can't contol the power company sad

I am sure in my case it will not be a long term thing but I wanted to know more and be sure.

I've received an abundance of very good help and information from all that posted, and appreciate it.

Hopefully it will help some others, also.
----------------------------------------
mike
Crunch Hard, Crunch Often


[Nov 23, 2006 9:11:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 68   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread