| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 22
|
|
| Author |
|
|
Peter Ingham
Cruncher Joined: Jan 23, 2006 Post Count: 2 Status: Offline Project Badges:
|
I'm also having a lot of invalids.
----------------------------------------In fact most WU's are returning as Invalid, very few errors and very few accepted. A Sample Invalid: Result Name: X0930100641201200806191153_ 1-- <core_client_version>7.0.44</core_client_version> <![CDATA[ <stderr_txt> Commandline: projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.05_windows_intelx86__ati_hcc1 --zipfile X0930100641201200806191153.zip --imagelist images.txt --device 0 <app_init_data> <major_version>7</major_version> <minor_version>0</minor_version> <release>44</release> <app_version>705</app_version> <app_name>hcc1</app_name> <acct_mgr_url>http://bam.boincstats.com/</acct_mgr_url> <project_preferences> <color_scheme>Tahiti Sunset</color_scheme> <max_frames_sec>7</max_frames_sec> <max_gfx_cpu_pct>5.0</max_gfx_cpu_pct> </project_preferences> <project_dir>C:\ProgramData\BOINC/projects/www.worldcommunitygrid.org</project_dir> <boinc_dir>C:\ProgramData\BOINC</boinc_dir> <wu_name>X0930100641201200806191153</wu_name> <result_name>X0930100641201200806191153_1</result_name> <comm_obj_name>boinc_0</comm_obj_name> <slot>4</slot> <wu_cpu_time>0.000000</wu_cpu_time> <starting_elapsed_time>0.000000</starting_elapsed_time> <using_sandbox>0</using_sandbox> <user_total_credit>2061187.026927</user_total_credit> <user_expavg_credit>227.315472</user_expavg_credit> <host_total_credit>374911.291978</host_total_credit> <host_expavg_credit>227.315487</host_expavg_credit> <resource_share_fraction>1.000000</resource_share_fraction> <checkpoint_period>60.000000</checkpoint_period> <fraction_done_start>0.000000</fraction_done_start> <fraction_done_end>1.000000</fraction_done_end> <gpu_type>ATI</gpu_type> <gpu_device_num>0</gpu_device_num> <gpu_opencl_dev_index>0</gpu_opencl_dev_index> <ncpus>1.000000</ncpus> <rsc_fpops_est>25520135981107.000000</rsc_fpops_est> <rsc_fpops_bound>510402719622140.000000</rsc_fpops_bound> <rsc_memory_bound>78643200.000000</rsc_memory_bound> <rsc_disk_bound>50000000.000000</rsc_disk_bound> <computation_deadline>1359133609.000000</computation_deadline> <vbox_window>0</vbox_window> </app_init_data> INFO: gpu_type set in init_data.xml to ATI INFO: gpu_device_num set in init_data.xml to 0 Boinc requested ATI gpu device number0 Unzipping input images ../../projects/www.worldcommunitygrid.org/X0930100641201200806191153_X0930100641201200806191153.zip Processing jobdescription Number of Images defined in image list is 2 Found compute platform Advanced Micro Devices, Inc. Selecting this platform CL_DEVICE_NAME: Cypress CL_DEVICE_VENDOR: Advanced Micro Devices, Inc. CL_DEVICE_VERSION: 1084.4 (VM) CL_DEVICE_MAX_COMPUTE_UNITS: CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_CLOCK_FREQUENCY: 600 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_EXTENSIONS: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing Estimated kernel execution time = 0.35996 [sec] Starting analysis of X0930100641201200806191153.jp2... Extracting GLCM features... Total kernel time: 205.902649 (1026 kernel executions) Total memory transfer time: 1.707382 Average kernel time: 0.200685 Min kernel time: 0.189129 (dx=11 dy=23 sample_dist=24 ) Max kernel time: 0.215030 dx=2 dy=1 sample_dist=1 INFO: GPU calculations complete. Total time for X0930100641201200806191153.jp2: 297 seconds Finished Image #0, pctComplete = 0.500000 Starting analysis of X0930100640389200806191206.jp2... Extracting GLCM features... Total kernel time: 248.168808 (1026 kernel executions) Total memory transfer time: 3.387018 Average kernel time: 0.241880 Min kernel time: 0.217186 (dx=25 dy=3 sample_dist=24 ) Max kernel time: 0.257577 dx=2 dy=1 sample_dist=1 INFO: GPU calculations complete. Total time for X0930100640389200806191206.jp2: 342 seconds Finished Image #1, pctComplete = 1.000000 CPU time used = 188.402408 15:25:54 (3476): called boinc_finish </stderr_txt> ]]> System is i7-920 with ATI 5830. Win 7/64 Ult with Catalyst 13.1 Nothing is OC'd (in fact, based on suggestions in similar threads, I have reduced the GPU clocks to the lowest values Catalyst supports - 600/900 to no avail). Any Suggestions? [Edit 1 times, last edit by Peter Ingham at Jan 20, 2013 10:03:16 AM] |
||
|
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges:
|
Not sure about YOUR card but for me; adding GPU core volts made the invalids go away. I still have a few errors every day that happen within 15 seconds of starting a work unit. It does not seem to matter if I run a single or more per card either overclocked and overvolted or stock it just seems to happen with my 5870.
----------------------------------------![]() |
||
|
|
BladeD
Ace Cruncher USA Joined: Nov 17, 2004 Post Count: 28976 Status: Offline Project Badges:
|
Is this why my pages of PVs are going up?
---------------------------------------- |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello OldChap.
Perhaps the hardware and/or how that hardware is operated is not the cause of occurrence of many Invalids we are seeing. What if -- it is the reference-set that 'judges' doneWUs -- that is the one at fault? We are doing research, that is -- probing the yet-unknown. There is thus a chance that the judge may not know well enough of that unknown to make a firm and truthful determination of what is -, and what is not -, an Invalid. ; ; andzgridPost#813 ; |
||
|
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges:
|
Hi andzgrid
----------------------------------------I am not sure if that is a critique of my post or you wish to explore other avenues. I would say that in general I tend to post my experiences when running my rigs. I don't intend these posts to do anything more than indicate to others that I found a solution to a similar problem that may, just may, help someone with issues. I understand that there may be another reason but your theory of the reference set? well in this particular instance I find myself finding it hard to believe that the simple comparison of work done by different computers, well the results anyway, could be prone to error...... Unless, of course, the validation system only matches the first pair of similar results then rejects all others. Is this the case? Are we seeing invalids in sufficiently large numbers to warrant having 3 matching results? Were this the case then surely those that run the system would make the appropriate adjustments But I am an open minded sort who acknowledges that others with greater skills and knowledge may find other, better, solutions. In line with the ethic of most who contribute here I just proffer what I can, when I can and hope that I am not seen as doing something other than this. Just in case you feel otherwise..... feel free to criticise. ![]() ![]() [Edit 1 times, last edit by OldChap at Jan 20, 2013 11:37:05 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I am not sure if that is a critique of my post or you wish to explore other avenues. I definitely didn't intend it to be a critique of your post, but more like me wishing to explore other avenues. I'm sorry if it came across to you in any other way or gave some hint in that direction. In research, there is no such thing as fixed truth. That is the premise from which I launched my assertion that the reference-set may be in error. That also means, in converse, that what was first 'judged' as a valid doneWU, may not be 'truly' valid. But who or what process is to say? I can't see any hardware connection to the invalids that you described in your post, ergo, there must be something else that should account for the invalids that you are getting. As to why I addressed my response post to you and not just leave it anonymous is because your depiction of your case turned out to be a perfect material to launch my assertion that something else, and probably not the hardware (not specifically your hardware) that is the cause of invalids (not necessarily your invalids). P.S. I don't see discussions as a critique or non-critique of persons. I see discussions as a battle waged by ideas against each other . ; ; andzgridPost#814 ; [Edit 1 times, last edit by Former Member at Jan 21, 2013 12:36:04 AM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
From what has been said in the past, if I remember correctly, the validity of a work unit is set by some parameters set by the researchers. For some sciences of more than a quorum of one, the reference set, as I understand it, is how closely the two(or more) results match, provided they both conform to originally set parameters. From personal experience I have noticed the invalids for me have come from either malformed work units or a hardware glitch on my end. I do not overclock, but overclocking too much will cause an invalid condition to occur (as many have attested in the forums.) Another cause is a failing power supply, which I have personally experienced. I recall seeing only a few references to faulty memory, but it too can cause this condition as well as overheating. There are undoubtedly other causes. Perhaps one of the techs or researchers could chime in to further illuminate this topic.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges:
|
Ok. The bit that made me wonder was directing the post at me.
----------------------------------------My theory on the possible hardware connection goes like this: When overclocking a CPU there are a number of tests one can run afterwards to confirm that the new frequency is stable. Perhaps the most widely used is Prime 95. Proponents of this test for undesirable results from your cpu recommend rather long test times to find errors. Often, if an error is found, it is possible to resolve the problem by increasing the core voltage. I view the GPU as a similar animal but one which by design does not have to be so precise. An otherwise good GPU may produce simple errors that in the normal scheme of things would result in a single pixel having the wrong colour perhaps. One would not think of this as particularly bothersome when gaming, in fact I doubt a normal user would notice. Even running stock speeds and voltages the same GPU cannot have a single error when used for HCC GPU for fear of causing a wrong calculation and thus producing invalid work. The speed and voltage that worked perfectly well for gaming may have to be adjusted for our purposes and, given that the hardware is sound, a small increase in voltage may resolve the issue in much the same way that a similar action can make a CPU pass Prime 95 with no errors. This is enough to cause invalids but not close to enough to cause errors I feel. ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Validation, even for invalids on zero redundant sciences are often 3 copies, where the agreement of 2 singles out the 3rd. For quorum 2 which HCC1 is *of course*, there's a 3 way distribution at minimum to determine which of the 3 is invalid. This cycle can actually go up to 5 or 7 before a task is considered an out-take [put aside to the review list].
Validation rules have a few basics: 1) Must meet a minimum set of output conditions and pass an included minitest at start of task [zero redundancy sciences and for GPU a performance test], lets call that first level quality control. 2) Must closely match wingman, by again another set of checks. Errors are outright during execution, rarely will they get to a Pending Validation state. Invalids always pass through PVal and PVer... Pending Verification is a second /subsequent phase waiting for revalidation. In a nutshell. |
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
When overclocking a CPU there are a number of tests one can run afterwards to confirm that the new frequency is stable. Perhaps the most widely used is Prime 95. Proponents of this test for undesirable results from your cpu recommend rather long test times to find errors. Often, if an error is found, it is possible to resolve the problem by increasing the core voltage. The problem with this is that while getting an error in Prime95 is a conclusive test showing the computer has a problem, running "Prime95-stable" doesn't mean the computer is really "stable", since other DC-applications uses a computer differently than Prime95 does. Meaning, example HCC can use the "wrong" part of the CPU leading to 100% error, even it is "prime95-stable". Or, HCC goes error-free, but CEP2 has a 100% error-rate. Many has been running a single project for many months, and thinks their computer is "stable", but the moment they tries another project they errors-out, and some of these users starts complaining about "buggy application" and so on... ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
||
|
|
|