Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1746 times and has 9 replies Next Thread
SJC_Steve
Cruncher
Joined: Nov 10, 2012
Post Count: 6
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

I'm having a few errors every day, most WUs finish successfully. I'm running 1 WU/GPU and most of the errors are happening on a non-OC'd GT 430. Any suggestions as to how to eliminate these errors? Here's a copy of the log outputs;


Result Log

Result Name: X0930101090860200807031725_ 0--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1: /usr/lib/nvidia-experimental-310/libOpenCL.so.1: no version information available (required by ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1)
Commandline: ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1 --zipfile X0930101090860200807031725.zip --imagelist images.txt --device 1
<app_init_data>
<major_version>7</major_version>
<minor_version>0</minor_version>
<release>27</release>
<app_version>708</app_version>
<app_name>hcc1</app_name>
<project_preferences>


<color_scheme>Tahiti Sunset</color_scheme>
<max_frames_sec>7</max_frames_sec>
<max_gfx_cpu_pct>5.0</max_gfx_cpu_pct>
</project_preferences>

<project_dir>/var/lib/boinc-client/projects/www.worldcommunitygrid.org</project_dir>
<boinc_dir>/var/lib/boinc-client</boinc_dir>
<wu_name>X0930101090860200807031725</wu_name>
<result_name>X0930101090860200807031725_0</result_name>
<shm_key>-1</shm_key>
<slot>3</slot>
<wu_cpu_time>0.000000</wu_cpu_time>
<starting_elapsed_time>0.000000</starting_elapsed_time>
<using_sandbox>0</using_sandbox>
<user_total_credit>216926.602781</user_total_credit>
<user_expavg_credit>5145.617331</user_expavg_credit>
<host_total_credit>65522.050866</host_total_credit>
<host_expavg_credit>4045.726912</host_expavg_credit>
<resource_share_fraction>0.500000</resource_share_fraction>
<checkpoint_period>300.000000</checkpoint_period>
<fraction_done_start>0.000000</fraction_done_start>
<fraction_done_end>1.000000</fraction_done_end>
<gpu_type>NVIDIA</gpu_type>
<gpu_device_num>1</gpu_device_num>
<gpu_opencl_dev_index>1</gpu_opencl_dev_index>
<ncpus>1.000000</ncpus>
<rsc_fpops_est>25495009587324.000000</rsc_fpops_est>
<rsc_fpops_bound>509900191746480.000000</rsc_fpops_bound>
<rsc_memory_bound>78643200.000000</rsc_memory_bound>
<rsc_disk_bound>50000000.000000</rsc_disk_bound>
<computation_deadline>1359090173.000000</computation_deadline>
</app_init_data>
INFO: gpu_type set in init_data.xml to NVIDIA
INFO: gpu_device_num set in init_data.xml to 1
Boinc requested NVIDIA gpu device number 1
Unzipping input images ../../projects/www.worldcommunitygrid.org/X0930101090860200807031725_X0930101090860200807031725.zip
Processing jobdescription
Number of Images defined in image list is 2
Found compute platform NVIDIA Corporation
Selecting this platform
CL_DEVICE_NAME: GeForce GT 430
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DEVICE_VERSION: 310.14
CL_DEVICE_MAX_COMPUTE_UNITS: 
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1400 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 255 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1023 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_EXTENSIONS:
cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_fp64
CL_DEVICE_COMPUTE_CAPABILITY_NV: 2.1
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 32768
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
Estimated kernel execution time = 1.67292 [sec]
Starting analysis of X0930101090860200807031725.jp2...
Extracting GLCM features...
Total kernel time: 1055.643188 (1026 kernel executions)
Total memory transfer time: 4.714205
Average kernel time: 1.028892
Min kernel time: 0.891188 (dx=17 dy=19 sample_dist=24 )
Max kernel time: 1.259797 dx=1 dy=1 sample_dist=0
INFO: GPU calculations complete.
Total time for X0930101090860200807031725.jp2: 1140 seconds
Finished Image #0, pctComplete = 0.500000
Starting analysis of X0930101090284200807031733.jp2...
Extracting GLCM features...

</stderr_txt>
]]>
[Jan 19, 2013 4:39:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
dskagcommunity
Senior Cruncher
Austria
Joined: May 10, 2011
Post Count: 219
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

I think your running without app_info or?
----------------------------------------
http://www.research.dskag.at
Crunching for my Dog who had "good" Braincancer.


[Jan 19, 2013 6:06:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

Maximum Elapsed time is the limit set on the amount of wallclock, which is the measured active time for a GPU task. If they're sporadic, then maybe tell us if there are any conditions that may explain this such as is GPU computing done during use?

1140 seconds is 19 minutes when the max time exceed was called. How long is the average runtime of successful GPU tasks for this host? If there's a small difference, between the success and the failed, the device is probably borderline [the sum of CPU speed + GPU counts] and is likely getting into trouble when the somewhat harder to analyze images arrive.

Let us know.
[Jan 19, 2013 6:24:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SJC_Steve
Cruncher
Joined: Nov 10, 2012
Post Count: 6
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

@dskagcommunity
I'm just running the app as it comes to me, I'm not sure what your question is regarding "without app_info". If there is something I'm supposed to add or change, let me know.

@SekeRob
This computer is only used for BOINC projects and the GPUs (2) are only used for Help Conquer Cancer so no conflicts on GPU time. Here's a read from a successful WU on this same GPU. Looks to me like it is even longer then the one that completed with an error, which doesn't make sense?

Thanks for your all your help.
Steve


Result Log

Result Name: X0930103170809200808291509_ 0--
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<stderr_txt>
../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1: /usr/lib/nvidia-experimental-310/libOpenCL.so.1: no version information available (required by ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1)
Commandline: ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1 --zipfile X0930103170809200808291509.zip --imagelist images.txt --device 1
<app_init_data>
<major_version>7</major_version>
<minor_version>0</minor_version>
<release>27</release>
<app_version>708</app_version>
<app_name>hcc1</app_name>
<project_preferences>


<color_scheme>Tahiti Sunset</color_scheme>
<max_frames_sec>7</max_frames_sec>
<max_gfx_cpu_pct>5.0</max_gfx_cpu_pct>
</project_preferences>

<project_dir>/var/lib/boinc-client/projects/www.worldcommunitygrid.org</project_dir>
<boinc_dir>/var/lib/boinc-client</boinc_dir>
<wu_name>X0930103170809200808291509</wu_name>
<result_name>X0930103170809200808291509_0</result_name>
<shm_key>-1</shm_key>
<slot>1</slot>
<wu_cpu_time>0.000000</wu_cpu_time>
<starting_elapsed_time>0.000000</starting_elapsed_time>
<using_sandbox>0</using_sandbox>
<user_total_credit>230741.978078</user_total_credit>
<user_expavg_credit>5502.802115</user_expavg_credit>
<host_total_credit>79337.426163</host_total_credit>
<host_expavg_credit>4583.953582</host_expavg_credit>
<resource_share_fraction>0.500000</resource_share_fraction>
<checkpoint_period>300.000000</checkpoint_period>
<fraction_done_start>0.000000</fraction_done_start>
<fraction_done_end>1.000000</fraction_done_end>
<gpu_type>NVIDIA</gpu_type>
<gpu_device_num>1</gpu_device_num>
<gpu_opencl_dev_index>1</gpu_opencl_dev_index>
<ncpus>1.000000</ncpus>
<rsc_fpops_est>25520135981107.000000</rsc_fpops_est>
<rsc_fpops_bound>510402719622140.000000</rsc_fpops_bound>
<rsc_memory_bound>78643200.000000</rsc_memory_bound>
<rsc_disk_bound>50000000.000000</rsc_disk_bound>
<computation_deadline>1359231083.000000</computation_deadline>
</app_init_data>
INFO: gpu_type set in init_data.xml to NVIDIA
INFO: gpu_device_num set in init_data.xml to 1
Boinc requested NVIDIA gpu device number 1
Unzipping input images ../../projects/www.worldcommunitygrid.org/X0930103170809200808291509_X0930103170809200808291509.zip
Processing jobdescription
Number of Images defined in image list is 2
Found compute platform NVIDIA Corporation
Selecting this platform
CL_DEVICE_NAME: GeForce GT 430
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DEVICE_VERSION: 310.14
CL_DEVICE_MAX_COMPUTE_UNITS: 
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1400 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 255 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1023 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_EXTENSIONS:
cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_fp64
CL_DEVICE_COMPUTE_CAPABILITY_NV: 2.1
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 32768
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
Estimated kernel execution time = 1.52864 [sec]
Starting analysis of X0930103170809200808291509.jp2...
Extracting GLCM features...
Total kernel time: 1004.705811 (1026 kernel executions)
Total memory transfer time: 4.453445
Average kernel time: 0.979245
Min kernel time: 0.854081 (dx=25 dy=3 sample_dist=24 )
Max kernel time: 1.178735 dx=1 dy=1 sample_dist=0
INFO: GPU calculations complete.
Total time for X0930103170809200808291509.jp2: 1086 seconds
Finished Image #0, pctComplete = 0.500000
Starting analysis of X0930103171440200808291501.jp2...
Extracting GLCM features...
Total kernel time: 1106.255127 (1026 kernel executions)
Total memory transfer time: 9.057861
Average kernel time: 1.078221
Min kernel time: 0.949605 (dx=25 dy=5 sample_dist=24 )
Max kernel time: 1.260979 dx=2 dy=1 sample_dist=1
INFO: GPU calculations complete.
Total time for X0930103171440200808291501.jp2: 1190 seconds
Finished Image #1, pctComplete = 1.000000
CPU time used = 1668.424269
18:26:20 (7295): called boinc_finish

</stderr_txt>
]]>
[Jan 20, 2013 10:06:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SJC_Steve
Cruncher
Joined: Nov 10, 2012
Post Count: 6
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

I have another question. Why is the time set so low for these jobs? If my possibly marginal GPU can successfully complete the computation in slightly longer time, why would program terminate it early. Why not just give it a bit more time?

Thanks,
Steve
[Jan 21, 2013 3:49:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

Actually, the first log only showed part of the logged times... there's CPU time, GPU Kernel time, the time I was referring to, and overal Elapsed time [that what is credited in the stats for a successful job]. The jobs are allowed to run like 5-10 times the original estimated runtime, say 1.5 hours or so. If your task gets to that, something is [temporarily] slugging on your system, or the task is stuck in a loop, but the 1140 seconds was not suggesting that. Your stdoutdae.txt file would tell when a task was started and when the time out hit, on the wallclock.
[Jan 21, 2013 4:06:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

Some users have reported seeing GPU tasks getting hung if CPU benchmarks are started while a GPU process is running. We are looking into this issue. This could be what is causing the maximum elapsed time to be exceeded. You should be able to check stdoutdae.txt and see if a benchmark is started shortly after one of the tasks that exceeded the time is started. It will be labeled "Running CPU benchmarks".

Thanks,
armstrdj
[Jan 22, 2013 5:26:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

Oddly it has disappeared from the manual as a cc_config.xml <options> tag, it though still showing as a boinc.exe command line parm, but adding

<skip_cpu_benchmarks>1</skip_cpu_benchmarks>

will indefinitely postpone benchmarking... still works for me without incurring a tag error warning. [Just in case there's no fix]. Server 700 of WCG ignores CPU benchmark information anyhow, which is what the BOINC benchmark tests every 5 days in old clients and on upgrade and boot in new clients.
[Jan 22, 2013 5:38:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SJC_Steve
Cruncher
Joined: Nov 10, 2012
Post Count: 6
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

Here's a readout of the stdoutdae.txt for a failed work unit. No CPU benchmarks present. Top of readout is the start and bottom is the error message.


20-Jan-2013 13:05:48 [World Community Grid] Starting task X0960100200635200806271642_0 using hcc1 version 708 (nvidia_hcc1) in slot 1
20-Jan-2013 13:05:50 [World Community Grid] Started upload of X0960100200686200806271642_1_0
20-Jan-2013 13:05:50 [World Community Grid] Started upload of X0960100200686200806271642_1_1
20-Jan-2013 13:05:52 [World Community Grid] Finished upload of X0960100200686200806271642_1_1
20-Jan-2013 13:05:52 [World Community Grid] Started upload of X0960100200686200806271642_1_2
20-Jan-2013 13:05:54 [World Community Grid] Finished upload of X0960100200686200806271642_1_0
20-Jan-2013 13:05:54 [World Community Grid] Finished upload of X0960100200686200806271642_1_2
20-Jan-2013 13:16:47 [World Community Grid] Computation for task X0960100200703200806271642_1 finished
20-Jan-2013 13:16:47 [World Community Grid] Starting task X0960100200699200806271642_1 using hcc1 version 708 (nvidia_hcc1) in slot 0
20-Jan-2013 13:16:49 [World Community Grid] Started upload of X0960100200703200806271642_1_0
20-Jan-2013 13:16:49 [World Community Grid] Started upload of X0960100200703200806271642_1_1
20-Jan-2013 13:16:50 [World Community Grid] Finished upload of X0960100200703200806271642_1_1
20-Jan-2013 13:16:50 [World Community Grid] Started upload of X0960100200703200806271642_1_2
20-Jan-2013 13:16:53 [World Community Grid] Finished upload of X0960100200703200806271642_1_0
20-Jan-2013 13:16:53 [World Community Grid] Finished upload of X0960100200703200806271642_1_2
20-Jan-2013 13:41:26 [World Community Grid] Computation for task faah37727_ZINC32771097_xh2_xtal_02_0 finished
20-Jan-2013 13:41:26 [World Community Grid] Resuming task faah37739_ZINC13097718_xh2_xtal_01_0 using faah version 640 in slot 4
20-Jan-2013 13:41:30 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_0
20-Jan-2013 13:41:30 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_1
20-Jan-2013 13:41:31 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_0
20-Jan-2013 13:41:31 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_2
20-Jan-2013 13:41:33 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_1
20-Jan-2013 13:41:33 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_2
20-Jan-2013 13:41:33 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_3
20-Jan-2013 13:41:34 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_3
20-Jan-2013 13:48:22 [World Community Grid] Aborting task X0960100200635200806271642_0: exceeded elapsed time limit 2552.36 (510402.72G/199.97G)
20-Jan-2013 13:48:24 [World Community Grid] Computation for task X0960100200635200806271642_0 finished
[Jan 23, 2013 8:32:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer 7.08 (nvidia_hcc1) "Maximum elapsed time exceeded" errors

We are increasing the limit for GPU workunits. The new workunits should start showing up in a couple of days.
Thanks,
armstrdj
[Jan 31, 2013 2:14:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread