Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 22
Posts: 22   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 22277 times and has 21 replies Next Thread
Peter Ingham
Cruncher
Joined: Jan 23, 2006
Post Count: 2
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

I'm also having a lot of invalids.

In fact most WU's are returning as Invalid, very few errors and very few accepted.


A Sample Invalid:

Result Name: X0930100641201200806191153_ 1--
<core_client_version>7.0.44</core_client_version>
<![CDATA[
<stderr_txt>
Commandline: projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.05_windows_intelx86__ati_hcc1 --zipfile X0930100641201200806191153.zip --imagelist images.txt --device 0
<app_init_data>
<major_version>7</major_version>
<minor_version>0</minor_version>
<release>44</release>
<app_version>705</app_version>
<app_name>hcc1</app_name>
<acct_mgr_url>http://bam.boincstats.com/</acct_mgr_url>
<project_preferences>


<color_scheme>Tahiti Sunset</color_scheme>
<max_frames_sec>7</max_frames_sec>
<max_gfx_cpu_pct>5.0</max_gfx_cpu_pct>
</project_preferences>

<project_dir>C:\ProgramData\BOINC/projects/www.worldcommunitygrid.org</project_dir>
<boinc_dir>C:\ProgramData\BOINC</boinc_dir>
<wu_name>X0930100641201200806191153</wu_name>
<result_name>X0930100641201200806191153_1</result_name>
<comm_obj_name>boinc_0</comm_obj_name>
<slot>4</slot>
<wu_cpu_time>0.000000</wu_cpu_time>
<starting_elapsed_time>0.000000</starting_elapsed_time>
<using_sandbox>0</using_sandbox>
<user_total_credit>2061187.026927</user_total_credit>
<user_expavg_credit>227.315472</user_expavg_credit>
<host_total_credit>374911.291978</host_total_credit>
<host_expavg_credit>227.315487</host_expavg_credit>
<resource_share_fraction>1.000000</resource_share_fraction>
<checkpoint_period>60.000000</checkpoint_period>
<fraction_done_start>0.000000</fraction_done_start>
<fraction_done_end>1.000000</fraction_done_end>
<gpu_type>ATI</gpu_type>
<gpu_device_num>0</gpu_device_num>
<gpu_opencl_dev_index>0</gpu_opencl_dev_index>
<ncpus>1.000000</ncpus>
<rsc_fpops_est>25520135981107.000000</rsc_fpops_est>
<rsc_fpops_bound>510402719622140.000000</rsc_fpops_bound>
<rsc_memory_bound>78643200.000000</rsc_memory_bound>
<rsc_disk_bound>50000000.000000</rsc_disk_bound>
<computation_deadline>1359133609.000000</computation_deadline>
<vbox_window>0</vbox_window>
</app_init_data>
INFO: gpu_type set in init_data.xml to ATI
INFO: gpu_device_num set in init_data.xml to 0
Boinc requested ATI gpu device number0
Unzipping input images ../../projects/www.worldcommunitygrid.org/X0930100641201200806191153_X0930100641201200806191153.zip
Processing jobdescription
Number of Images defined in image list is 2
Found compute platform Advanced Micro Devices, Inc.
Selecting this platform
CL_DEVICE_NAME: Cypress
CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.
CL_DEVICE_VERSION: 1084.4 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS: 
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_CLOCK_FREQUENCY: 600 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_EXTENSIONS:
cl_khr_fp64
cl_amd_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_atomic_counters_32
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_media_ops2
cl_amd_popcnt
cl_khr_d3d10_sharing
Estimated kernel execution time = 0.35996 [sec]
Starting analysis of X0930100641201200806191153.jp2...
Extracting GLCM features...
Total kernel time: 205.902649 (1026 kernel executions)
Total memory transfer time: 1.707382
Average kernel time: 0.200685
Min kernel time: 0.189129 (dx=11 dy=23 sample_dist=24 )
Max kernel time: 0.215030 dx=2 dy=1 sample_dist=1
INFO: GPU calculations complete.
Total time for X0930100641201200806191153.jp2: 297 seconds
Finished Image #0, pctComplete = 0.500000
Starting analysis of X0930100640389200806191206.jp2...
Extracting GLCM features...
Total kernel time: 248.168808 (1026 kernel executions)
Total memory transfer time: 3.387018
Average kernel time: 0.241880
Min kernel time: 0.217186 (dx=25 dy=3 sample_dist=24 )
Max kernel time: 0.257577 dx=2 dy=1 sample_dist=1
INFO: GPU calculations complete.
Total time for X0930100640389200806191206.jp2: 342 seconds
Finished Image #1, pctComplete = 1.000000
CPU time used = 188.402408
15:25:54 (3476): called boinc_finish

</stderr_txt>
]]>


System is i7-920 with ATI 5830.

Win 7/64 Ult with Catalyst 13.1

Nothing is OC'd (in fact, based on suggestions in similar threads, I have reduced the GPU clocks to the lowest values Catalyst supports - 600/900 to no avail).

Any Suggestions?
----------------------------------------
[Edit 1 times, last edit by Peter Ingham at Jan 20, 2013 10:03:16 AM]
[Jan 20, 2013 9:55:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

Not sure about YOUR card but for me; adding GPU core volts made the invalids go away. I still have a few errors every day that happen within 15 seconds of starting a work unit. It does not seem to matter if I run a single or more per card either overclocked and overvolted or stock it just seems to happen with my 5870.
----------------------------------------

[Jan 20, 2013 10:18:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

Is this why my pages of PVs are going up?
----------------------------------------
[Jan 20, 2013 7:12:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

Hello OldChap.

Perhaps the hardware and/or how that hardware is operated is not the cause of occurrence of many Invalids we are seeing. What if -- it is the reference-set that 'judges' doneWUs -- that is the one at fault? We are doing research, that is -- probing the yet-unknown. There is thus a chance that the judge may not know well enough of that unknown to make a firm and truthful determination of what is -, and what is not -, an Invalid.
;
; andzgridPost#813
;
[Jan 20, 2013 10:58:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

Hi andzgrid

I am not sure if that is a critique of my post or you wish to explore other avenues. wink

I would say that in general I tend to post my experiences when running my rigs. I don't intend these posts to do anything more than indicate to others that I found a solution to a similar problem that may, just may, help someone with issues.

I understand that there may be another reason but your theory of the reference set? well in this particular instance I find myself finding it hard to believe that the simple comparison of work done by different computers, well the results anyway, could be prone to error...... Unless, of course, the validation system only matches the first pair of similar results then rejects all others. Is this the case? Are we seeing invalids in sufficiently large numbers to warrant having 3 matching results? Were this the case then surely those that run the system would make the appropriate adjustments

But I am an open minded sort who acknowledges that others with greater skills and knowledge may find other, better, solutions.

In line with the ethic of most who contribute here I just proffer what I can, when I can and hope that I am not seen as doing something other than this. Just in case you feel otherwise..... feel free to criticise. cool
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by OldChap at Jan 20, 2013 11:37:05 PM]
[Jan 20, 2013 11:32:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

I am not sure if that is a critique of my post or you wish to explore other avenues. wink
I definitely didn't intend it to be a critique of your post, but more like me wishing to explore other avenues. I'm sorry if it came across to you in any other way or gave some hint in that direction.

In research, there is no such thing as fixed truth. That is the premise from which I launched my assertion that the reference-set may be in error. That also means, in converse, that what was first 'judged' as a valid doneWU, may not be 'truly' valid. But who or what process is to say? I can't see any hardware connection to the invalids that you described in your post, ergo, there must be something else that should account for the invalids that you are getting. As to why I addressed my response post to you and not just leave it anonymous is because your depiction of your case turned out to be a perfect material to launch my assertion that something else, and probably not the hardware (not specifically your hardware) that is the cause of invalids (not necessarily your invalids).

P.S.
I don't see discussions as a critique or non-critique of persons. I see discussions as a battle waged by ideas against each other .
;
; andzgridPost#814
;
----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 21, 2013 12:36:04 AM]
[Jan 21, 2013 12:06:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

From what has been said in the past, if I remember correctly, the validity of a work unit is set by some parameters set by the researchers. For some sciences of more than a quorum of one, the reference set, as I understand it, is how closely the two(or more) results match, provided they both conform to originally set parameters. From personal experience I have noticed the invalids for me have come from either malformed work units or a hardware glitch on my end. I do not overclock, but overclocking too much will cause an invalid condition to occur (as many have attested in the forums.) Another cause is a failing power supply, which I have personally experienced. I recall seeing only a few references to faulty memory, but it too can cause this condition as well as overheating. There are undoubtedly other causes. Perhaps one of the techs or researchers could chime in to further illuminate this topic.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jan 21, 2013 1:21:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

Ok. The bit that made me wonder was directing the post at me.

My theory on the possible hardware connection goes like this:

When overclocking a CPU there are a number of tests one can run afterwards to confirm that the new frequency is stable. Perhaps the most widely used is Prime 95. Proponents of this test for undesirable results from your cpu recommend rather long test times to find errors. Often, if an error is found, it is possible to resolve the problem by increasing the core voltage.

I view the GPU as a similar animal but one which by design does not have to be so precise. An otherwise good GPU may produce simple errors that in the normal scheme of things would result in a single pixel having the wrong colour perhaps. One would not think of this as particularly bothersome when gaming, in fact I doubt a normal user would notice.

Even running stock speeds and voltages the same GPU cannot have a single error when used for HCC GPU for fear of causing a wrong calculation and thus producing invalid work.

The speed and voltage that worked perfectly well for gaming may have to be adjusted for our purposes and, given that the hardware is sound, a small increase in voltage may resolve the issue in much the same way that a similar action can make a CPU pass Prime 95 with no errors.

This is enough to cause invalids but not close to enough to cause errors I feel.
----------------------------------------

[Jan 21, 2013 1:46:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

Validation, even for invalids on zero redundant sciences are often 3 copies, where the agreement of 2 singles out the 3rd. For quorum 2 which HCC1 is *of course*, there's a 3 way distribution at minimum to determine which of the 3 is invalid. This cycle can actually go up to 5 or 7 before a task is considered an out-take [put aside to the review list].

Validation rules have a few basics:

1) Must meet a minimum set of output conditions and pass an included minitest at start of task [zero redundancy sciences and for GPU a performance test], lets call that first level quality control.
2) Must closely match wingman, by again another set of checks.

Errors are outright during execution, rarely will they get to a Pending Validation state. Invalids always pass through PVal and PVer... Pending Verification is a second /subsequent phase waiting for revalidation.

In a nutshell.
[Jan 21, 2013 8:45:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 26 pages of invalids

When overclocking a CPU there are a number of tests one can run afterwards to confirm that the new frequency is stable. Perhaps the most widely used is Prime 95. Proponents of this test for undesirable results from your cpu recommend rather long test times to find errors. Often, if an error is found, it is possible to resolve the problem by increasing the core voltage.

The problem with this is that while getting an error in Prime95 is a conclusive test showing the computer has a problem, running "Prime95-stable" doesn't mean the computer is really "stable", since other DC-applications uses a computer differently than Prime95 does. Meaning, example HCC can use the "wrong" part of the CPU leading to 100% error, even it is "prime95-stable". Or, HCC goes error-free, but CEP2 has a 100% error-rate.

Many has been running a single project for many months, and thinks their computer is "stable", but the moment they tries another project they errors-out, and some of these users starts complaining about "buggy application" and so on...
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
[Jan 21, 2013 11:51:15 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 22   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread