Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 63
Posts: 63   Pages: 7   [ 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 19239 times and has 62 replies Next Thread
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Beta Test for Help Conquer Cancer - GPU v6.51

[Sep 21, 2012 7:03:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

One of the things we are looking for in this beta is to get more info from users who are experiencing hung workunits at 99%. A couple of things that would be helpful if users who see this can provide:

  • Are you using the boinc CPU throttle? (Use at most N% of CPU)
  • Do you have a value set "While processor usage is less than <blank>" that is not 0?
  • Can you paste the last 3 -4 lines from stderr.txt? This can be found in the slot directory where the gpu workunit is running whcih is in the boinc data dir typically in ProgramData\BOINC\slots\. There are numbered directories in there and the slot where the gpu workunit is running should have the opencl binary.


Thanks,
armstrdj
----------------------------------------
[Edit 1 times, last edit by armstrdj at Sep 21, 2012 7:26:45 PM]
[Sep 21, 2012 7:14:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

I'd be willing to bet many of those people with hung work units are running their CPUs at less than 100%. JMHO
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


----------------------------------------
[Edit 1 times, last edit by nanoprobe at Sep 21, 2012 8:00:40 PM]
[Sep 21, 2012 7:32:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
grumpygrampy
Senior Cruncher
USA
Joined: Oct 8, 2011
Post Count: 223
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

Yep, i'm one of those running Boinc defaults (60% cpu).
Here's the data from the running (failing) beta:

cl_amd_popcnt
cl_khr_d3d10_sharing
Estimated kernel execution time = 0.44531 [sec]
Starting analysis of ../../projects/www.worldcommunitygrid.org/478b560a2bcdd854e7f68e7fc2cb6720.jp2...
Extracting GLCM features...
Total kernel time: 214.246628 (1026 kernel executions)
Total memory transfer time: 2.057441
Average kernel time: 0.208817
Min kernel time: 0.194183 (dx=3 dy=25 sample_dist=24 )
Max kernel time: 0.230991 dx=4 dy=6 sample_dist=6
INFO: Flushing queue...Waiting for all commands to finish...Releasing kernel...Releasing program...Freeing strings...Releasing queue...Releasing context...Releasing Memory Objects
INFO: GPU calculations complete.
Radon blur Level 0

Currently passing 6:00 minutes - good ones finish in 3:30

intel i7-860 hd radeon 5770 boinc 7.0.28
[Sep 21, 2012 7:44:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

Try pausing, set CPU to 100% then resume and see what happens.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Sep 21, 2012 8:02:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

I think it is related to any settings that result in suspend/resume of the workunit. At this point I would be interested if any one is seeing this issue that doesn't have one of those settings and would also be interested to know if anyone who doesn't see "INFO: GPU calculations complete." in stderr but still has a hung workunit. Now off to track down this bug. As always thanks to our wonderful beta testers.

Thanks,
armstrdj
[Sep 21, 2012 8:32:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Filip Falta
Cruncher
Joined: Sep 2, 2010
Post Count: 12
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

How many work units will be sent in total this beta?
[Sep 21, 2012 8:44:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
slakin
Advanced Cruncher
Joined: Jul 4, 2008
Post Count: 79
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

I can reproduce the problem by limiting processor usage to 90% ..they run fine at 100% ..I shut down Throttle which was simply monitoring CPU temp, not active but wanted to remove it from the equation.

INFO: gpu_type not found in init_data.xml.
INFO: GPU device not specified in init_data.xml. Checking Commandline.
Boinc requested ATI gpu device number0
Found compute platform Advanced Micro Devices, Inc.
Selecting this platform
CL_DEVICE_NAME: Tahiti
CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.
CL_DEVICE_VERSION: CAL 1.4.1720 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS:
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_CLOCK_FREQUENCY: 880 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_EXTENSIONS:
cl_khr_fp64
cl_amd_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_atomic_counters_32
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_popcnt
cl_khr_d3d10_sharing
Estimated kernel execution time = 0.21945 [sec]
Starting analysis of ../../projects/www.worldcommunitygrid.org/76002f305091cb70eb369f712b561b54.jp2...
Extracting GLCM features...
Total kernel time: 27.619987 (1026 kernel executions)
Total memory transfer time: 1.052173
Average kernel time: 0.026920
Min kernel time: 0.025538 (dx=5 dy=25 sample_dist=24 )
Max kernel time: 0.028436 dx=1 dy=0 sample_dist=0
INFO: Flushing queue...Waiting for all commands to finish...Releasing kernel...Releasing program...Freeing strings...Releasing queue...Releasing context...Releasing Memory Objects
INFO: GPU calculations complete.
Radon blur Level 0
[Sep 21, 2012 8:47:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
captainjack
Advanced Cruncher
Joined: Apr 14, 2008
Post Count: 144
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

Mine normally run fine. I changed the CPU utilization on one machine to 60% and now one of the jobs is stuck. It just passed 11 minutes and is still sitting there. Network activity is suspended and all other jobs are suspended. Here is the output from the stderr.txt. Let me know if you want me to send you anything else. I will leave machine sitting suspended for a while.

Estimated kernel execution time = 0.37663 [sec]
Starting analysis of ../../projects/www.worldcommunitygrid.org/2c9ddd54882cc7ae5b3133d295e9480e.jp2...
Extracting GLCM features...
Total kernel time: 199.893204 (1026 kernel executions)
Total memory transfer time: 6.877513
Average kernel time: 0.194828
Min kernel time: 0.182141 (dx=5 dy=25 sample_dist=24 )
Max kernel time: 0.208647 dx=1 dy=1 sample_dist=0
INFO: Flushing queue...Waiting for all commands to finish...Releasing kernel...Releasing program...Freeing strings...Releasing queue...Releasing context...Releasing Memory Objects
INFO: GPU calculations complete.
Radon blur Level 0

[Edit] One other thing that I noticed that the GPU was running a steady 95% while the CPU was starting and stopping as it normally would when set to 60% utilization. When GPU work finished, CPU utilization went to 1-2% and job is still stuck.
----------------------------------------
[Edit 1 times, last edit by captainjack at Sep 21, 2012 8:57:31 PM]
[Sep 21, 2012 8:53:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
grumpygrampy
Senior Cruncher
USA
Joined: Oct 8, 2011
Post Count: 223
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Beta Test for Help Conquer Cancer - GPU v6.51

....Try pausing, set CPU to 100% then resume and see what happens.

Thanks, nanoprobe... this worked (but it also worked occasionally at 60% with s/r)

You can tell a failure immediately, by watching the process - the cpu runs at ~10% for 20 secs, hands off to the gpu, which runs at 100% for a few minutes,
and then hands back to the cpu, which either ramps up to ~10% to finish and upload (or not).

If the cpu doesn't ramp up, the unit failed.
[Sep 21, 2012 8:54:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 63   Pages: 7   [ 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread