| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 21
|
|
| Author |
|
|
David Autumns
Ace Cruncher UK Joined: Nov 16, 2004 Post Count: 11062 Status: Offline Project Badges:
|
My room might be too warm but just a heads up
----------------------------------------Regards Dave ![]() [Edit 1 times, last edit by David Autumns at Nov 18, 2012 10:05:31 AM] |
||
|
|
BladeD
Ace Cruncher USA Joined: Nov 17, 2004 Post Count: 28976 Status: Offline Project Badges:
|
Fail how? Any messages?
---------------------------------------- |
||
|
|
David Autumns
Ace Cruncher UK Joined: Nov 16, 2004 Post Count: 11062 Status: Offline Project Badges:
|
No failure messages but looking at the timings I now suspect they are failing at the end of the first image in the 2 image current batch
----------------------------------------It's about 10% of work units that don't make it past this point They look to be uploading normally just too early (after the first image) hence they are recorded as errors as they haven't been completed I have 6 HCC's running concurrently on the same card (no GPU Memory Issues) as I have a 6 core Phenom II. They are all failing around the 20 minute mark which with my current GPU is where it sits with 49.707% paused on the clock as the CPU Core zips up the data for return before progressing to the second image. Dave ![]() |
||
|
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges:
|
I've just found six of these too (runtime of over 3 hours and going nowhere).
----------------------------------------I've got them to finish by SUSPENDING then RESUMING. Suggest you do the same if you get more. Good luck! ATI 12.11b4 BOINC 7.0.36 ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
|
David Autumns
Ace Cruncher UK Joined: Nov 16, 2004 Post Count: 11062 Status: Offline Project Badges:
|
Hi
----------------------------------------They don't get stuck they get returned to WCG half way through with computation error ![]() |
||
|
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges:
|
That's interesting and obviously a different situation to what I have found. Is there anything in the logs that says why they are being returned? How long after they reach 50% do they get sent back? Are all tasks behaving this way, or like mine, is it only a few?
----------------------------------------![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
|
David Autumns
Ace Cruncher UK Joined: Nov 16, 2004 Post Count: 11062 Status: Offline Project Badges:
|
I estimate around 10% don't make it past the half way point of the work unit. So the majority are successful
----------------------------------------I backed off the graphics card and it still happens so it may be some checksum error in the new dual image WU's Maybe some are sliced up thinking they are only 1 Image long A bit like the Ariane 5 bug - it just wasn't expecting to find itself that far down range already (re-using the Ariane 4 code) so it figured it must be faulty Here's the result http://www.youtube.com/watch?v=gp_D8r-2hwk In my case I lose about 20mins of 1/6th of a GTX 560Ti plus a 3Ghz Phenom II core. Not quite so dramatic but wasteful nonetheless if this is not just happening to me The trajectory is normal Dave ![]() [Edit 1 times, last edit by David Autumns at Nov 19, 2012 10:26:13 PM] |
||
|
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18667 Status: Offline Project Badges:
|
Is the computation error at the end of the first image or at the beginning of the second? If it's the first image, would it try to work on the second image? If it's the second, then it finished the first but failed on the second so the whole WU fails? I wonder if this situation got considered when WCG decided to try two images in one WU? If either image fails/errors, the whole WU fails? Then again, trying to give credit for one image but not both may be too much trouble for the effort.
---------------------------------------- |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I wonder if this situation got considered when WCG decided to try two images in one WU? Interesting... Why oh why can't I find a free lunch; there are apparently things that need to be worked out from an otherwise 'group two in one package and get twice the bang for the price of one' deal. Oh well, I expect WCG to have worked out the logic for the 2-in-1 before hand. May we have the WCG CA's/Tech's chime in on this one? ; [Edit 1 times, last edit by Former Member at Nov 20, 2012 2:35:24 AM] |
||
|
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges:
|
I have found a few with Computation error. I'll go and check the logs, but in the meantime, checking properties shows that they completed 100%.
----------------------------------------Application Help Conquer Cancer 7.05 (ati_hcc1) Workunit name X0900077930512200611091607 State Computation error Received 20/11/2012 01:28:00 Report deadline 27/11/2012 01:26:48 Estimated app speed 14.03 GFLOPs/sec Estimated task size 25'551 GFLOPs Resources 1 CPUs + 0.333 ATI GPUs CPU time at last checkpoint 00:00:00 CPU time 10:00:49 Elapsed time 10:06:58 Estimated time remaining 00:00:00 Fraction done 100% Virtual memory size 0.00 MB Working set size 0.00 MB EDIT: More Info I aniticipated that this might be the case, and have found this in the event log 20/11/2012 12:36:55 | World Community Grid | Aborting task X0900077930512200611091607_0: exceeded elapsed time limit 36417.69 (511028.87G/14.03G) stderr: Result Log Result Name: X0900077930512200611091607_ 0-- <core_client_version>7.0.36</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> Commandline: projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.05_windows_intelx86__ati_hcc1 --zipfile X0900077930512200611091607.zip --imagelist images.txt --device 2 <app_init_data> <major_version>7</major_version> <minor_version>0</minor_version> <release>36</release> <app_version>705</app_version> <app_name>hcc1</app_name> <project_preferences> <color_scheme>Tahiti Sunset</color_scheme> <max_frames_sec>7</max_frames_sec> <max_gfx_cpu_pct>5.0</max_gfx_cpu_pct> </project_preferences> <project_dir>C:\ProgramData\BOINC/projects/www.worldcommunitygrid.org</project_dir> <boinc_dir>C:\ProgramData\BOINC</boinc_dir> <wu_name>X0900077930512200611091607</wu_name> <result_name>X0900077930512200611091607_0</result_name> <comm_obj_name>boinc_5</comm_obj_name> <slot>5</slot> <wu_cpu_time>0.000000</wu_cpu_time> <starting_elapsed_time>0.000000</starting_elapsed_time> <using_sandbox>0</using_sandbox> <user_total_credit>25470163.422069</user_total_credit> <user_expavg_credit>105263.986433</user_expavg_credit> <host_total_credit>1658415.662183</host_total_credit> <host_expavg_credit>43744.378137</host_expavg_credit> <resource_share_fraction>1.000000</resource_share_fraction> <checkpoint_period>60.000000</checkpoint_period> <fraction_done_start>0.000000</fraction_done_start> <fraction_done_end>1.000000</fraction_done_end> <gpu_type>ATI</gpu_type> <gpu_device_num>2</gpu_device_num> <gpu_opencl_dev_index>2</gpu_opencl_dev_index> <ncpus>1.000000</ncpus> <rsc_fpops_est>25551443449394.000000</rsc_fpops_est> <rsc_fpops_bound>511028868987880.000000</rsc_fpops_bound> <rsc_memory_bound>78643200.000000</rsc_memory_bound> <rsc_disk_bound>50000000.000000</rsc_disk_bound> <computation_deadline>1353970968.000000</computation_deadline> <vbox_window>0</vbox_window> </app_init_data> INFO: gpu_type set in init_data.xml to ATI INFO: gpu_device_num set in init_data.xml to 2 Boinc requested ATI gpu device number2 Unzipping input images ../../projects/www.worldcommunitygrid.org/X0900077930512200611091607_X0900077930512200611091607.zip Processing jobdescription Number of Images defined in image list is 2 Found compute platform Advanced Micro Devices, Inc. Selecting this platform CL_DEVICE_NAME: Cypress CL_DEVICE_VENDOR: Advanced Micro Devices, Inc. CL_DEVICE_VERSION: 1084.2 (VM) CL_DEVICE_MAX_COMPUTE_UNITS: CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_CLOCK_FREQUENCY: 725 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_EXTENSIONS: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_dx9_media_sharing Estimated kernel execution time = 0.30528 [sec] Starting analysis of X0900077930512200611091607.jp2... Extracting GLCM features... Total kernel time: 148.454742 (1026 kernel executions) Total memory transfer time: 65.668259 Average kernel time: 0.144693 Min kernel time: 0.135556 (dx=23 dy=11 sample_dist=24 ) Max kernel time: 0.155186 dx=1 dy=2 sample_dist=1 INFO: GPU calculations complete. Total time for X0900077930512200611091607.jp2: 545 seconds Finished Image #0, pctComplete = 0.500000 Starting analysis of X0900077930792200611091602.jp2... Extracting GLCM features... </stderr_txt> ]]> So it looks to me as if Image 1 completed but WU was aborted due to taking too long. My question is why is the time limit so long (10hrs+)? Wouldn't it make more sense for 'Maximum elapsed time' to be more realistic so that less time is wasted by GPUs and CPUs sitting idle? ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. [Edit 1 times, last edit by coolstream at Nov 20, 2012 2:15:49 PM] |
||
|
|
|