World Community Grid - View Thread - Problem: Invalid Working Units in large numbers. Please help.

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Problem: Invalid Working Units in large numbers. Please help.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 68

[ ]

Author

This topic has been viewed 7027 times and has 67 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

DCManiak,

I just looked at some of your computers. Most are doing just fine and returning lots of valid results. However, you have at least one that is returning almost entirely invalid resuts such as: 'dfiado'.

Use the filter by device name to check each computer.

Kevin

Thanks for looking into it some, Kevin. I appreciate it. It appears I have two main "offenders".

176 total invalids, as of now, not 298 as I had stated earlier ( bad math on my part... :) )

I also had noticed that "dfiado" had alot;
95 to be exact.

"opty175" has 66 invalids.

"24xeon" had 2
"biox23800" had 1
"ttclover" had 5
"pcdljeff" had 3
"p5bWL" had 1
"DOTHAN" had 1
"machultd" had 1
"zippy" had 1

So, it looks like I have two main offenders. "dfiado" and the machine I'm on right now, "opty175".

What can be done to get these two machines to get with the program?

In looking at some the quorems that these two machines are in, the rigs don't seem to be overclaiming in relation to the other members of the quorem, however they often finish WUs faster, but not always. Actually, in some cases the times and claims are very close. They are pretty fast machines, but not extremely so.

What can I do to change things so these two machines don't get "invalid"s? What can be done?
I mean what do you see in their results that would indicate invalid? The results are good or they wouldn't be being accepted. However they aren't getting full credit. What do you suggest? Should I just put them on another project or is their a way to get them inline? Point me in the right direction, if you would.

Sincerely,
Scott

[Nov 28, 2006 10:17:35 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

Would it be possible for you to give us a little insight how invalid claims are being determined?
We are running a lot of PCs that are faster than the average, so that might be an issue for the algorithm.

[Nov 28, 2006 10:24:20 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

As a rule of thumb, your computer is only likely to be penalised if it is claiming considerably more than twice the granted credit. The actual algorithm is a lot more complicated, and I don't believe WCG want to publish full details, to discourage people from trying to game the system.

Really fast computers shouldn't claim excessively. If anything, they sometimes underclaim a little. If the benchmark is accurate, then each computer will claim exactly the same no matter how fast or slow a particular work unit was. Weaknesses in the benchmarking does cause some variation, but it doesn't normally affect the penalty system, and isn't generally worth worrying about.

Computers that are producing large numbers of invalid results nearly always have a hardware issue, or sometimes a software conflict. The first thing to do is a general health check - remove any overclocking, check that the memory is matched, look for known software conflicts.

[Nov 28, 2006 11:24:20 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

I mean what do you see in their results that would indicate invalid?

Human eyes have considerable trouble seeing the thing that is causing the invalid status. It's spotted by the validator software at WCG. If you could compare your invalid results with the 3 that were declared valid you would see your results do not match the other 3. The valid results all match each other so yours is considered to have errors and is declared invalid.

The results are good or they wouldn't be being accepted.

It depends what you mean by "good" and "accepted".

When WCG "accepts" our results that simply means our end did not detect any crunch errors. Unfortunately, our end cannot detect all errors. So when WCG "accepts" a result, the result is considered to be "good" to the extent that it might prove to be error free when compared with 2 other results. If it does match 2 others then and only then does it receive valid status.

Should I just put them on another project or is their a way to get them inline?

Putting them on another project that uses a 3 quorum like WCG will likely just get you more invalid results. On the other hand, it depends how closely they compare the 3 results. Some projects do a bitwise compare while others do a less stringent compare.

As Didactylos said, run the diagnostics and see what comes up. It could be any one of a number of things and it may take a lot of work to find the problem. The problem is not discernible from error reports because your results are not showing errors. Yet we know there are errors because they do not match 3 other results. It could even be something as tricky as an intermittent fault in a NIC or cable which introduces an error into the result when it uploads to WCG..

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 29, 2006 3:02:39 AM]

[Nov 29, 2006 2:58:42 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

I was only wondering because the majority of invalid claims came up after the credit systems was changed. And I dont think decreasing ambiant temps made all the PCs unstable over night ;)

I am trying to understand how the credits are being calculated. The boinc benchmark seems to have a main part in this calculation, which is bad as it doesnt make use of modern's cpus extensions.
Furthermore you cant see what CPU someone is running from the WCG stats, which is why I am wondering if the CPU type is somehow involved anyway or if its based on "CPU A computed the WU in x hours and claimed y points and CPU B needed x hours too and claimed 2y points".

If the CPU type is being looked at for the calculation, there are also some issues with overclocked CPUs, because all you get to know via Boinc is the CPU type and its stock speed.
I myself am running a Sempron 2800+, which has 1.6Ghz at stock. I am running it with 2.8Ghz for more than a year now, crunching 24/7.
Together with a tweaked OS and Ram settings, that little thing is a lot faster than it was at stock speed. Now if the claims of my PC are being compared to a 2800+, running at stock, its not unlikely my PC could claim 100% higher points (using the standard client of course) every now and then, which would , accordig to your statement, result in a invalid claim, although there is no faulty calculation, nor optimised clients being used.

I hope you got the point I wanted to make.

Furthermore it would be interesting to know if WCG makes uses of at least SSE2 or any other enhancements, not for the sake of points, but science.

[Nov 29, 2006 7:43:10 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

Furthermore it would be interesting to know if WCG makes uses of at least SSE2 or any other enhancements, not for the sake of points, but science.

On the present generation there is to my knowledge no special CPU registry recognition, which the future BOiNC has. Since it is important that all calculations in a quorum are performed in an identical way and absent the extra feature recognition, it cannot be risked to use SSE2 on one in the quorum and not having SSE2 on another machine in the quorum with a distribution going like one to P3, second to P4 and third to C2D.

Rom Walton wrote an article about the blessing/curse of this: Part 1 & 2

http://www.romwnet.org/dasblogce/PermaLink,gu...2d-ba21-5bb6b310b295.aspx
http://www.romwnet.org/dasblogce/PermaLink,gu...55-824d-057757002768.aspx

Here a quote of a post in June of Lawrencehardin:

Rom Walton has recently posted some of the upgrades under development for BOINC here: http://www.romwnet.org/dasblogce/

People who are always wondering about program optimization will be particularly happy about:

BOINC gains CPU capability detection
Starting in the next version of the BOINC client we'll be able to detect CPU capabilities.

It is important to note that the capability detection is actually done by the operating system and BOINC just queries the operating system for the supported instruction sets. I bring this up because not all operating systems fully support all additional instruction sets supported by the processor. We are being conservative here to avoid illegal instruction exceptions or privileged instruction exceptions.

For Windows the following instruction sets or capabilities can be detected:

* fpu
* tsc
* pae
* nx
* sse
* sse2
* sse3
* 3dnow
* mmx

On Linux we read the data out of /proc/cpuinfo.

I still need to write the code for the Mac OS.

The processor information will be passed to both the science applications and the scheduling server.

The problem with optimizing for a particular processor type is that it can alter the results a small amount, which is a no-no for quorum validation, unless all the results are run on the same processor type. So this gives projects some options.

Of course, whether or not a particular optimization is reasonable or not is a different question. But at least the possibility is there.

And look at the simplified GUI. biggrin

Lawrence

The BOiNC development is going towards making quorums to go e.g. all to C2D. I don't know, but the logic and management at the server level will be a whole lot more complicated as it already needs to manage different sets for different OSses like Windows, Linux, Mac.... hence the occasional message like "No work available for your platform"

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Nov 29, 2006 9:27:53 AM]

[Nov 29, 2006 9:26:31 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

I must stress that WCG do not use the CPU type or speed in the credit calculation. Claimed credit is based solely on the benchmark and CPU time. A faster computer should have a higher benchmark and lower CPU time. On average, with some known anomalies, this is true.

Known issues with the benchmark are:
- under estimation on Linux
- a small AMD/Intel inconsistency
- no allowance for memory speed or paging.

BOINC are working on improving the benchmark, and WCG are looking at ways to bypass it entirely if possible.

[Nov 29, 2006 9:38:47 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

Hi XS_Fr3ak

There is another "issue" not covered in the known problems above and that is generally speaking, machines with slow processors and/or low ram will claim much higher than fast machines. The slow machines always take longer to complete than their benchmark would indicate.

Observation, lately, indicates machines of similar processor speed have been placed into the same quorum in an effort to reduce outlier invalids. That's what seems to have happened with my slow machines at any rate.

In this example the second machine clearly couldn't produce the points claimed in the time taken to complete:

faah0988_ d143n715_ x1AJX_ 00 Valid 11/25/2006 05:42:05 11/26/2006 18:26:43 12.67 88 / 80
faah0988_ d143n715_ x1AJX_ 00 Invalid 11/25/2006 05:40:58 11/25/2006 14:10:28 5.88 175 / 40
faah0988_ d143n715_ x1AJX_ 00 Valid 11/25/2006 05:31:44 11/25/2006 17:04:05 8.11 71 / 80

In this example the points claimed vary considerably, but the high claiming machine is only claiming 5PPH while the low claiming machine is nearly 16PPH. All were credited OK:

faah0993_ d153n268_ x1AJX_ 00 Valid 11/27/2006 11:17:02 11/27/2006 22:09:43 4.73 75 / 91
faah0993_ d153n268_ x1AJX_ 00 Valid 11/27/2006 11:13:17 11/29/2006 04:44:54 20.16 105 / 91
faah0993_ d153n268_ x1AJX_ 00 Valid 11/27/2006 11:02:39 11/29/2006 01:48:20 19.15 93 / 91

I have even observed machines claiming quorum average points but completing the work much faster than the others being treated as outliers, when in a quorum of slow machines. I haven't seen this happen for more than a week so it may have been changed.

Your over clocked Sempron, theoretically should be low scoring in this case unless it is starved for ram. The other factor, as mentioned earlier is HDC, where if you pause the work and then restart the accumulated points remain and the HDC WU will often start from zero, causing a wildly inflated claim.

If you could supply more information on the type of work, your usage pattern(more than one project?), ram etc,. the answers may be more evident.

Cheers. ozylynx smile

[Nov 29, 2006 1:16:16 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

Ozy,

Lots, but as was already explained, the benchmark itself is flawed in a number of ways. It only uses the CPU and very little ram. When the real work comes along, obviously a project barely fitting in the minimum ram requirements will go much slower due to e.g. substantial VM swapping.

This is though not the original subject of this thread, it being that real 'invalid' WU's are being identified, where a 4th copy was send out (see first meshmesh post). If u see an 'invalid' where only 3 copies were send out, it was identified as an 'extreme' outlier. Getting only half of the remaining median. If it was an outlier, NOT Extreme, it is marked as valid, but gets the median of the remaining in the quorum.

The place to check is the Duration Correction Factor in the client_state.xml file. If the value is above 1 it means the machine runs slower than the benchmark. If lower, like on my machine where it's consistently > 0.8, it is indicating that the work is done faster than the benchmark. That DCF is continuously adjusted and forms a brilliant (still self) control to estimate if the claimed value covers the goods being delivered.

Far as CPU matching, see the SSE2 discussion today, i don't think there is intentional matching on that front.

If i repeated anything u said in different words, sorry... happens.

cheers

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Nov 29, 2006 1:41:36 PM]

[Nov 29, 2006 1:40:26 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Problem: Invalid Working Units in large numbers. Please help.

Hi Sek.

I am aware that my post was somewhat off topic. sad

It was given in reply to some of the questions asked by XS_Fr3ak, in an attempt to clear that away from the discussion, and get back to the topic. i.e. show examples where the points system is working under even extreme situations. I believe that XS_Fr3ak's problems are likely hardware based.

Cheers. ozylynx smile

[Nov 30, 2006 1:29:10 AM]

[ ]