Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 10
|
![]() |
Author |
|
SJC_Steve
Cruncher Joined: Nov 10, 2012 Post Count: 6 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'm having a few errors every day, most WUs finish successfully. I'm running 1 WU/GPU and most of the errors are happening on a non-OC'd GT 430. Any suggestions as to how to eliminate these errors? Here's a copy of the log outputs;
Result Log Result Name: X0930101090860200807031725_ 0-- <core_client_version>7.0.27</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1: /usr/lib/nvidia-experimental-310/libOpenCL.so.1: no version information available (required by ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1) Commandline: ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1 --zipfile X0930101090860200807031725.zip --imagelist images.txt --device 1 <app_init_data> <major_version>7</major_version> <minor_version>0</minor_version> <release>27</release> <app_version>708</app_version> <app_name>hcc1</app_name> <project_preferences> <color_scheme>Tahiti Sunset</color_scheme> <max_frames_sec>7</max_frames_sec> <max_gfx_cpu_pct>5.0</max_gfx_cpu_pct> </project_preferences> <project_dir>/var/lib/boinc-client/projects/www.worldcommunitygrid.org</project_dir> <boinc_dir>/var/lib/boinc-client</boinc_dir> <wu_name>X0930101090860200807031725</wu_name> <result_name>X0930101090860200807031725_0</result_name> <shm_key>-1</shm_key> <slot>3</slot> <wu_cpu_time>0.000000</wu_cpu_time> <starting_elapsed_time>0.000000</starting_elapsed_time> <using_sandbox>0</using_sandbox> <user_total_credit>216926.602781</user_total_credit> <user_expavg_credit>5145.617331</user_expavg_credit> <host_total_credit>65522.050866</host_total_credit> <host_expavg_credit>4045.726912</host_expavg_credit> <resource_share_fraction>0.500000</resource_share_fraction> <checkpoint_period>300.000000</checkpoint_period> <fraction_done_start>0.000000</fraction_done_start> <fraction_done_end>1.000000</fraction_done_end> <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> <ncpus>1.000000</ncpus> <rsc_fpops_est>25495009587324.000000</rsc_fpops_est> <rsc_fpops_bound>509900191746480.000000</rsc_fpops_bound> <rsc_memory_bound>78643200.000000</rsc_memory_bound> <rsc_disk_bound>50000000.000000</rsc_disk_bound> <computation_deadline>1359090173.000000</computation_deadline> </app_init_data> INFO: gpu_type set in init_data.xml to NVIDIA INFO: gpu_device_num set in init_data.xml to 1 Boinc requested NVIDIA gpu device number 1 Unzipping input images ../../projects/www.worldcommunitygrid.org/X0930101090860200807031725_X0930101090860200807031725.zip Processing jobdescription Number of Images defined in image list is 2 Found compute platform NVIDIA Corporation Selecting this platform CL_DEVICE_NAME: GeForce GT 430 CL_DEVICE_VENDOR: NVIDIA Corporation CL_DEVICE_VERSION: 310.14 CL_DEVICE_MAX_COMPUTE_UNITS: CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1400 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 255 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 1023 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 CL_DEVICE_COMPUTE_CAPABILITY_NV: 2.1 CL_DEVICE_REGISTERS_PER_BLOCK_NV: 32768 CL_DEVICE_WARP_SIZE_NV: 32 CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE Estimated kernel execution time = 1.67292 [sec] Starting analysis of X0930101090860200807031725.jp2... Extracting GLCM features... Total kernel time: 1055.643188 (1026 kernel executions) Total memory transfer time: 4.714205 Average kernel time: 1.028892 Min kernel time: 0.891188 (dx=17 dy=19 sample_dist=24 ) Max kernel time: 1.259797 dx=1 dy=1 sample_dist=0 INFO: GPU calculations complete. Total time for X0930101090860200807031725.jp2: 1140 seconds Finished Image #0, pctComplete = 0.500000 Starting analysis of X0930101090284200807031733.jp2... Extracting GLCM features... </stderr_txt> ]]> |
||
|
dskagcommunity
Senior Cruncher Austria Joined: May 10, 2011 Post Count: 219 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think your running without app_info or?
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Maximum Elapsed time is the limit set on the amount of wallclock, which is the measured active time for a GPU task. If they're sporadic, then maybe tell us if there are any conditions that may explain this such as is GPU computing done during use?
1140 seconds is 19 minutes when the max time exceed was called. How long is the average runtime of successful GPU tasks for this host? If there's a small difference, between the success and the failed, the device is probably borderline [the sum of CPU speed + GPU counts] and is likely getting into trouble when the somewhat harder to analyze images arrive. Let us know. |
||
|
SJC_Steve
Cruncher Joined: Nov 10, 2012 Post Count: 6 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@dskagcommunity
I'm just running the app as it comes to me, I'm not sure what your question is regarding "without app_info". If there is something I'm supposed to add or change, let me know. @SekeRob This computer is only used for BOINC projects and the GPUs (2) are only used for Help Conquer Cancer so no conflicts on GPU time. Here's a read from a successful WU on this same GPU. Looks to me like it is even longer then the one that completed with an error, which doesn't make sense? Thanks for your all your help. Steve Result Log Result Name: X0930103170809200808291509_ 0-- <core_client_version>7.0.27</core_client_version> <![CDATA[ <stderr_txt> ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1: /usr/lib/nvidia-experimental-310/libOpenCL.so.1: no version information available (required by ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1) Commandline: ../../projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.08_x86_64-pc-linux-gnu__nvidia_hcc1 --zipfile X0930103170809200808291509.zip --imagelist images.txt --device 1 <app_init_data> <major_version>7</major_version> <minor_version>0</minor_version> <release>27</release> <app_version>708</app_version> <app_name>hcc1</app_name> <project_preferences> <color_scheme>Tahiti Sunset</color_scheme> <max_frames_sec>7</max_frames_sec> <max_gfx_cpu_pct>5.0</max_gfx_cpu_pct> </project_preferences> <project_dir>/var/lib/boinc-client/projects/www.worldcommunitygrid.org</project_dir> <boinc_dir>/var/lib/boinc-client</boinc_dir> <wu_name>X0930103170809200808291509</wu_name> <result_name>X0930103170809200808291509_0</result_name> <shm_key>-1</shm_key> <slot>1</slot> <wu_cpu_time>0.000000</wu_cpu_time> <starting_elapsed_time>0.000000</starting_elapsed_time> <using_sandbox>0</using_sandbox> <user_total_credit>230741.978078</user_total_credit> <user_expavg_credit>5502.802115</user_expavg_credit> <host_total_credit>79337.426163</host_total_credit> <host_expavg_credit>4583.953582</host_expavg_credit> <resource_share_fraction>0.500000</resource_share_fraction> <checkpoint_period>300.000000</checkpoint_period> <fraction_done_start>0.000000</fraction_done_start> <fraction_done_end>1.000000</fraction_done_end> <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> <ncpus>1.000000</ncpus> <rsc_fpops_est>25520135981107.000000</rsc_fpops_est> <rsc_fpops_bound>510402719622140.000000</rsc_fpops_bound> <rsc_memory_bound>78643200.000000</rsc_memory_bound> <rsc_disk_bound>50000000.000000</rsc_disk_bound> <computation_deadline>1359231083.000000</computation_deadline> </app_init_data> INFO: gpu_type set in init_data.xml to NVIDIA INFO: gpu_device_num set in init_data.xml to 1 Boinc requested NVIDIA gpu device number 1 Unzipping input images ../../projects/www.worldcommunitygrid.org/X0930103170809200808291509_X0930103170809200808291509.zip Processing jobdescription Number of Images defined in image list is 2 Found compute platform NVIDIA Corporation Selecting this platform CL_DEVICE_NAME: GeForce GT 430 CL_DEVICE_VENDOR: NVIDIA Corporation CL_DEVICE_VERSION: 310.14 CL_DEVICE_MAX_COMPUTE_UNITS: CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_MAX_CLOCK_FREQUENCY: 1400 MHz CL_DEVICE_ADDRESS_BITS: 32 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 255 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 1023 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 CL_DEVICE_COMPUTE_CAPABILITY_NV: 2.1 CL_DEVICE_REGISTERS_PER_BLOCK_NV: 32768 CL_DEVICE_WARP_SIZE_NV: 32 CL_DEVICE_GPU_OVERLAP_NV: CL_TRUE CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE Estimated kernel execution time = 1.52864 [sec] Starting analysis of X0930103170809200808291509.jp2... Extracting GLCM features... Total kernel time: 1004.705811 (1026 kernel executions) Total memory transfer time: 4.453445 Average kernel time: 0.979245 Min kernel time: 0.854081 (dx=25 dy=3 sample_dist=24 ) Max kernel time: 1.178735 dx=1 dy=1 sample_dist=0 INFO: GPU calculations complete. Total time for X0930103170809200808291509.jp2: 1086 seconds Finished Image #0, pctComplete = 0.500000 Starting analysis of X0930103171440200808291501.jp2... Extracting GLCM features... Total kernel time: 1106.255127 (1026 kernel executions) Total memory transfer time: 9.057861 Average kernel time: 1.078221 Min kernel time: 0.949605 (dx=25 dy=5 sample_dist=24 ) Max kernel time: 1.260979 dx=2 dy=1 sample_dist=1 INFO: GPU calculations complete. Total time for X0930103171440200808291501.jp2: 1190 seconds Finished Image #1, pctComplete = 1.000000 CPU time used = 1668.424269 18:26:20 (7295): called boinc_finish </stderr_txt> ]]> |
||
|
SJC_Steve
Cruncher Joined: Nov 10, 2012 Post Count: 6 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have another question. Why is the time set so low for these jobs? If my possibly marginal GPU can successfully complete the computation in slightly longer time, why would program terminate it early. Why not just give it a bit more time?
Thanks, Steve |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Actually, the first log only showed part of the logged times... there's CPU time, GPU Kernel time, the time I was referring to, and overal Elapsed time [that what is credited in the stats for a successful job]. The jobs are allowed to run like 5-10 times the original estimated runtime, say 1.5 hours or so. If your task gets to that, something is [temporarily] slugging on your system, or the task is stuck in a loop, but the 1140 seconds was not suggesting that. Your stdoutdae.txt file would tell when a task was started and when the time out hit, on the wallclock.
|
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Some users have reported seeing GPU tasks getting hung if CPU benchmarks are started while a GPU process is running. We are looking into this issue. This could be what is causing the maximum elapsed time to be exceeded. You should be able to check stdoutdae.txt and see if a benchmark is started shortly after one of the tasks that exceeded the time is started. It will be labeled "Running CPU benchmarks".
Thanks, armstrdj |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Oddly it has disappeared from the manual as a cc_config.xml <options> tag, it though still showing as a boinc.exe command line parm, but adding
<skip_cpu_benchmarks>1</skip_cpu_benchmarks> will indefinitely postpone benchmarking... still works for me without incurring a tag error warning. [Just in case there's no fix]. Server 700 of WCG ignores CPU benchmark information anyhow, which is what the BOINC benchmark tests every 5 days in old clients and on upgrade and boot in new clients. |
||
|
SJC_Steve
Cruncher Joined: Nov 10, 2012 Post Count: 6 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Here's a readout of the stdoutdae.txt for a failed work unit. No CPU benchmarks present. Top of readout is the start and bottom is the error message.
20-Jan-2013 13:05:48 [World Community Grid] Starting task X0960100200635200806271642_0 using hcc1 version 708 (nvidia_hcc1) in slot 1 20-Jan-2013 13:05:50 [World Community Grid] Started upload of X0960100200686200806271642_1_0 20-Jan-2013 13:05:50 [World Community Grid] Started upload of X0960100200686200806271642_1_1 20-Jan-2013 13:05:52 [World Community Grid] Finished upload of X0960100200686200806271642_1_1 20-Jan-2013 13:05:52 [World Community Grid] Started upload of X0960100200686200806271642_1_2 20-Jan-2013 13:05:54 [World Community Grid] Finished upload of X0960100200686200806271642_1_0 20-Jan-2013 13:05:54 [World Community Grid] Finished upload of X0960100200686200806271642_1_2 20-Jan-2013 13:16:47 [World Community Grid] Computation for task X0960100200703200806271642_1 finished 20-Jan-2013 13:16:47 [World Community Grid] Starting task X0960100200699200806271642_1 using hcc1 version 708 (nvidia_hcc1) in slot 0 20-Jan-2013 13:16:49 [World Community Grid] Started upload of X0960100200703200806271642_1_0 20-Jan-2013 13:16:49 [World Community Grid] Started upload of X0960100200703200806271642_1_1 20-Jan-2013 13:16:50 [World Community Grid] Finished upload of X0960100200703200806271642_1_1 20-Jan-2013 13:16:50 [World Community Grid] Started upload of X0960100200703200806271642_1_2 20-Jan-2013 13:16:53 [World Community Grid] Finished upload of X0960100200703200806271642_1_0 20-Jan-2013 13:16:53 [World Community Grid] Finished upload of X0960100200703200806271642_1_2 20-Jan-2013 13:41:26 [World Community Grid] Computation for task faah37727_ZINC32771097_xh2_xtal_02_0 finished 20-Jan-2013 13:41:26 [World Community Grid] Resuming task faah37739_ZINC13097718_xh2_xtal_01_0 using faah version 640 in slot 4 20-Jan-2013 13:41:30 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_0 20-Jan-2013 13:41:30 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_1 20-Jan-2013 13:41:31 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_0 20-Jan-2013 13:41:31 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_2 20-Jan-2013 13:41:33 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_1 20-Jan-2013 13:41:33 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_2 20-Jan-2013 13:41:33 [World Community Grid] Started upload of faah37727_ZINC32771097_xh2_xtal_02_0_3 20-Jan-2013 13:41:34 [World Community Grid] Finished upload of faah37727_ZINC32771097_xh2_xtal_02_0_3 20-Jan-2013 13:48:22 [World Community Grid] Aborting task X0960100200635200806271642_0: exceeded elapsed time limit 2552.36 (510402.72G/199.97G) 20-Jan-2013 13:48:24 [World Community Grid] Computation for task X0960100200635200806271642_0 finished |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We are increasing the limit for GPU workunits. The new workunits should start showing up in a couple of days.
Thanks, armstrdj |
||
|
|
![]() |