Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: Beta Test for Help Conquer Cancer - GPU v6.51 |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 63
|
Author |
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
This is the discussion thread for: https://secure.worldcommunitygrid.org/forums/...33869_lastpage,yes#393038
|
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: |
One of the things we are looking for in this beta is to get more info from users who are experiencing hung workunits at 99%. A couple of things that would be helpful if users who see this can provide:
----------------------------------------
Thanks, armstrdj [Edit 1 times, last edit by armstrdj at Sep 21, 2012 7:26:45 PM] |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: |
I'd be willing to bet many of those people with hung work units are running their CPUs at less than 100%. JMHO
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------[Edit 1 times, last edit by nanoprobe at Sep 21, 2012 8:00:40 PM] |
||
|
grumpygrampy
Senior Cruncher USA Joined: Oct 8, 2011 Post Count: 223 Status: Offline Project Badges: |
Yep, i'm one of those running Boinc defaults (60% cpu).
Here's the data from the running (failing) beta: cl_amd_popcnt cl_khr_d3d10_sharing Estimated kernel execution time = 0.44531 [sec] Starting analysis of ../../projects/www.worldcommunitygrid.org/478b560a2bcdd854e7f68e7fc2cb6720.jp2... Extracting GLCM features... Total kernel time: 214.246628 (1026 kernel executions) Total memory transfer time: 2.057441 Average kernel time: 0.208817 Min kernel time: 0.194183 (dx=3 dy=25 sample_dist=24 ) Max kernel time: 0.230991 dx=4 dy=6 sample_dist=6 INFO: Flushing queue...Waiting for all commands to finish...Releasing kernel...Releasing program...Freeing strings...Releasing queue...Releasing context...Releasing Memory Objects INFO: GPU calculations complete. Radon blur Level 0 Currently passing 6:00 minutes - good ones finish in 3:30 intel i7-860 hd radeon 5770 boinc 7.0.28 |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: |
Try pausing, set CPU to 100% then resume and see what happens.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
|
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: |
I think it is related to any settings that result in suspend/resume of the workunit. At this point I would be interested if any one is seeing this issue that doesn't have one of those settings and would also be interested to know if anyone who doesn't see "INFO: GPU calculations complete." in stderr but still has a hung workunit. Now off to track down this bug. As always thanks to our wonderful beta testers.
Thanks, armstrdj |
||
|
Filip Falta
Cruncher Joined: Sep 2, 2010 Post Count: 12 Status: Offline Project Badges: |
How many work units will be sent in total this beta?
|
||
|
slakin
Advanced Cruncher Joined: Jul 4, 2008 Post Count: 79 Status: Offline Project Badges: |
I can reproduce the problem by limiting processor usage to 90% ..they run fine at 100% ..I shut down Throttle which was simply monitoring CPU temp, not active but wanted to remove it from the equation.
INFO: gpu_type not found in init_data.xml. INFO: GPU device not specified in init_data.xml. Checking Commandline. Boinc requested ATI gpu device number0 Found compute platform Advanced Micro Devices, Inc. Selecting this platform CL_DEVICE_NAME: Tahiti CL_DEVICE_VENDOR: Advanced Micro Devices, Inc. CL_DEVICE_VERSION: CAL 1.4.1720 (VM) CL_DEVICE_MAX_COMPUTE_UNITS: CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_CLOCK_FREQUENCY: 880 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_EXTENSIONS: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing Estimated kernel execution time = 0.21945 [sec] Starting analysis of ../../projects/www.worldcommunitygrid.org/76002f305091cb70eb369f712b561b54.jp2... Extracting GLCM features... Total kernel time: 27.619987 (1026 kernel executions) Total memory transfer time: 1.052173 Average kernel time: 0.026920 Min kernel time: 0.025538 (dx=5 dy=25 sample_dist=24 ) Max kernel time: 0.028436 dx=1 dy=0 sample_dist=0 INFO: Flushing queue...Waiting for all commands to finish...Releasing kernel...Releasing program...Freeing strings...Releasing queue...Releasing context...Releasing Memory Objects INFO: GPU calculations complete. Radon blur Level 0 |
||
|
captainjack
Advanced Cruncher Joined: Apr 14, 2008 Post Count: 144 Status: Offline Project Badges: |
Mine normally run fine. I changed the CPU utilization on one machine to 60% and now one of the jobs is stuck. It just passed 11 minutes and is still sitting there. Network activity is suspended and all other jobs are suspended. Here is the output from the stderr.txt. Let me know if you want me to send you anything else. I will leave machine sitting suspended for a while.
----------------------------------------Estimated kernel execution time = 0.37663 [sec] Starting analysis of ../../projects/www.worldcommunitygrid.org/2c9ddd54882cc7ae5b3133d295e9480e.jp2... Extracting GLCM features... Total kernel time: 199.893204 (1026 kernel executions) Total memory transfer time: 6.877513 Average kernel time: 0.194828 Min kernel time: 0.182141 (dx=5 dy=25 sample_dist=24 ) Max kernel time: 0.208647 dx=1 dy=1 sample_dist=0 INFO: Flushing queue...Waiting for all commands to finish...Releasing kernel...Releasing program...Freeing strings...Releasing queue...Releasing context...Releasing Memory Objects INFO: GPU calculations complete. Radon blur Level 0 [Edit] One other thing that I noticed that the GPU was running a steady 95% while the CPU was starting and stopping as it normally would when set to 60% utilization. When GPU work finished, CPU utilization went to 1-2% and job is still stuck. [Edit 1 times, last edit by captainjack at Sep 21, 2012 8:57:31 PM] |
||
|
grumpygrampy
Senior Cruncher USA Joined: Oct 8, 2011 Post Count: 223 Status: Offline Project Badges: |
....Try pausing, set CPU to 100% then resume and see what happens.
Thanks, nanoprobe... this worked (but it also worked occasionally at 60% with s/r) You can tell a failure immediately, by watching the process - the cpu runs at ~10% for 20 secs, hands off to the gpu, which runs at 100% for a few minutes, and then hands back to the cpu, which either ramps up to ~10% to finish and upload (or not). If the cpu doesn't ramp up, the unit failed. |
||
|
|