World Community Grid - View Thread - The latest batch occasionally fail half way through

World Community Grid Forums

Category: Completed Research

Forum: Help Conquer Cancer

Thread: The latest batch occasionally fail half way through

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 21

[ ]

Author

This topic has been viewed 6248 times and has 20 replies

David Autumns
Ace Cruncher
UK
Joined: Nov 16, 2004
Post Count: 11062
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

45 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


The latest batch occasionally fail half way through

My room might be too warm but just a heads up

Regards

Dave

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by David Autumns at Nov 18, 2012 10:05:31 AM]

[Nov 15, 2012 9:28:38 PM]

BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

5 year badge for Outsmart Ebola Together

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: The latest batch occasionally fail around 2/3rds the way through

Fail how? Any messages?

----------------------------------------

MyCity

[Nov 16, 2012 6:30:59 AM]

David Autumns
Ace Cruncher
UK
Joined: Nov 16, 2004
Post Count: 11062
Status: Offline
Project Badges:


Re: The latest batch occasionally fail around 2/3rds the way through

No failure messages but looking at the timings I now suspect they are failing at the end of the first image in the 2 image current batch

It's about 10% of work units that don't make it past this point

They look to be uploading normally just too early (after the first image) hence they are recorded as errors as they haven't been completed

I have 6 HCC's running concurrently on the same card (no GPU Memory Issues) as I have a 6 core Phenom II. They are all failing around the 20 minute mark which with my current GPU is where it sits with 49.707% paused on the clock as the CPU Core zips up the data for return before progressing to the second image.

Dave

----------------------------------------

[Nov 17, 2012 7:31:06 PM]

coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:

90 day badge for Help Cure Muscular Dystrophy

2 year badge for Nutritious Rice for the World

20 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: The latest batch occasionally fail around 2/3rds the way through

I've just found six of these too (runtime of over 3 hours and going nowhere).

I've got them to finish by SUSPENDING then RESUMING. Suggest you do the same if you get more.

Good luck!

ATI 12.11b4
BOINC 7.0.36

----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.

[Nov 17, 2012 10:27:18 PM]

David Autumns
Ace Cruncher
UK
Joined: Nov 16, 2004
Post Count: 11062
Status: Offline
Project Badges:


Re: The latest batch occasionally fail around 2/3rds the way through

Hi

They don't get stuck they get returned to WCG half way through with computation error

----------------------------------------

[Nov 18, 2012 10:04:46 AM]

coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:


Re: The latest batch occasionally fail around 2/3rds the way through

That's interesting and obviously a different situation to what I have found. Is there anything in the logs that says why they are being returned? How long after they reach 50% do they get sent back? Are all tasks behaving this way, or like mine, is it only a few?

----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.

[Nov 19, 2012 9:19:36 AM]

David Autumns
Ace Cruncher
UK
Joined: Nov 16, 2004
Post Count: 11062
Status: Offline
Project Badges:


Re: The latest batch occasionally fail around 2/3rds the way through

I estimate around 10% don't make it past the half way point of the work unit. So the majority are successful

I backed off the graphics card and it still happens so it may be some checksum error in the new dual image WU's

Maybe some are sliced up thinking they are only 1 Image long

A bit like the Ariane 5 bug - it just wasn't expecting to find itself that far down range already (re-using the Ariane 4 code) so it figured it must be faulty

Here's the result http://www.youtube.com/watch?v=gp_D8r-2hwk

In my case I lose about 20mins of 1/6th of a GTX 560Ti plus a 3Ghz Phenom II core. Not quite so dramatic but wasteful nonetheless if this is not just happening to me

The trajectory is normal blushing

Dave

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by David Autumns at Nov 19, 2012 10:26:13 PM]

[Nov 19, 2012 10:23:18 PM]

keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18667
Status: Offline
Project Badges:

180 day badge for Computing for Sustainable Water


Re: The latest batch occasionally fail around 2/3rds the way through

Is the computation error at the end of the first image or at the beginning of the second? If it's the first image, would it try to work on the second image? If it's the second, then it finished the first but failed on the second so the whole WU fails? I wonder if this situation got considered when WCG decided to try two images in one WU? If either image fails/errors, the whole WU fails? Then again, trying to give credit for one image but not both may be too much trouble for the effort.

----------------------------------------

Join/Website/IMODB

[Nov 20, 2012 12:10:59 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: The latest batch occasionally fail around 2/3rds the way through

I wonder if this situation got considered when WCG decided to try two images in one WU?

Interesting... Why oh why can't I find a free lunch; there are apparently things that need to be worked out from an otherwise 'group two in one package and get twice the bang for the price of one' deal. Oh well, I expect WCG to have worked out the logic for the 2-in-1 before hand. May we have the WCG CA's/Tech's chime in on this one?
;

----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 20, 2012 2:35:24 AM]

[Nov 20, 2012 2:21:32 AM]

coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:


Re: The latest batch occasionally fail around 2/3rds the way through

I have found a few with Computation error. I'll go and check the logs, but in the meantime, checking properties shows that they completed 100%.

Application Help Conquer Cancer 7.05 (ati_hcc1)
Workunit name X0900077930512200611091607
State Computation error
Received 20/11/2012 01:28:00
Report deadline 27/11/2012 01:26:48
Estimated app speed 14.03 GFLOPs/sec
Estimated task size 25'551 GFLOPs
Resources 1 CPUs + 0.333 ATI GPUs
CPU time at last checkpoint 00:00:00
CPU time 10:00:49
Elapsed time 10:06:58
Estimated time remaining 00:00:00
Fraction done 100%
Virtual memory size 0.00 MB
Working set size 0.00 MB

EDIT: More Info

I aniticipated that this might be the case, and have found this in the event log

20/11/2012 12:36:55 | World Community Grid | Aborting task X0900077930512200611091607_0: exceeded elapsed time limit 36417.69 (511028.87G/14.03G)

stderr:

Result Log

Result Name: X0900077930512200611091607_ 0--

<core_client_version>7.0.36</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Commandline: projects/www.worldcommunitygrid.org/wcg_hcc1_img_7.05_windows_intelx86__ati_hcc1 --zipfile X0900077930512200611091607.zip --imagelist images.txt --device 2
<app_init_data>
<major_version>7</major_version>
<minor_version>0</minor_version>
<release>36</release>
<app_version>705</app_version>
<app_name>hcc1</app_name>
<project_preferences>

<color_scheme>Tahiti Sunset</color_scheme>
<max_frames_sec>7</max_frames_sec>
<max_gfx_cpu_pct>5.0</max_gfx_cpu_pct>
</project_preferences>

<project_dir>C:\ProgramData\BOINC/projects/www.worldcommunitygrid.org</project_dir>
<boinc_dir>C:\ProgramData\BOINC</boinc_dir>
<wu_name>X0900077930512200611091607</wu_name>
<result_name>X0900077930512200611091607_0</result_name>
<comm_obj_name>boinc_5</comm_obj_name>
<slot>5</slot>
<wu_cpu_time>0.000000</wu_cpu_time>
<starting_elapsed_time>0.000000</starting_elapsed_time>
<using_sandbox>0</using_sandbox>
<user_total_credit>25470163.422069</user_total_credit>
<user_expavg_credit>105263.986433</user_expavg_credit>
<host_total_credit>1658415.662183</host_total_credit>
<host_expavg_credit>43744.378137</host_expavg_credit>
<resource_share_fraction>1.000000</resource_share_fraction>
<checkpoint_period>60.000000</checkpoint_period>
<fraction_done_start>0.000000</fraction_done_start>
<fraction_done_end>1.000000</fraction_done_end>
<gpu_type>ATI</gpu_type>
<gpu_device_num>2</gpu_device_num>
<gpu_opencl_dev_index>2</gpu_opencl_dev_index>
<ncpus>1.000000</ncpus>
<rsc_fpops_est>25551443449394.000000</rsc_fpops_est>
<rsc_fpops_bound>511028868987880.000000</rsc_fpops_bound>
<rsc_memory_bound>78643200.000000</rsc_memory_bound>
<rsc_disk_bound>50000000.000000</rsc_disk_bound>
<computation_deadline>1353970968.000000</computation_deadline>
<vbox_window>0</vbox_window>
</app_init_data>
INFO: gpu_type set in init_data.xml to ATI
INFO: gpu_device_num set in init_data.xml to 2
Boinc requested ATI gpu device number2
Unzipping input images ../../projects/www.worldcommunitygrid.org/X0900077930512200611091607_X0900077930512200611091607.zip
Processing jobdescription
Number of Images defined in image list is 2
Found compute platform Advanced Micro Devices, Inc.
Selecting this platform
CL_DEVICE_NAME: Cypress
CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.
CL_DEVICE_VERSION: 1084.2 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS:
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_CLOCK_FREQUENCY: 725 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_EXTENSIONS:
cl_khr_fp64
cl_amd_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_atomic_counters_32
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_popcnt
cl_khr_d3d10_sharing
cl_khr_dx9_media_sharing
Estimated kernel execution time = 0.30528 [sec]
Starting analysis of X0900077930512200611091607.jp2...
Extracting GLCM features...
Total kernel time: 148.454742 (1026 kernel executions)
Total memory transfer time: 65.668259
Average kernel time: 0.144693
Min kernel time: 0.135556 (dx=23 dy=11 sample_dist=24 )
Max kernel time: 0.155186 dx=1 dy=2 sample_dist=1
INFO: GPU calculations complete.
Total time for X0900077930512200611091607.jp2: 545 seconds
Finished Image #0, pctComplete = 0.500000
Starting analysis of X0900077930792200611091602.jp2...
Extracting GLCM features...

</stderr_txt>
]]>

So it looks to me as if Image 1 completed but WU was aborted due to taking too long.

My question is why is the time limit so long (10hrs+)? Wouldn't it make more sense for 'Maximum elapsed time' to be more realistic so that less time is wasted by GPUs and CPUs sitting idle?

----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.

----------------------------------------
[Edit 1 times, last edit by coolstream at Nov 20, 2012 2:15:49 PM]

[Nov 20, 2012 1:11:48 PM]

[ ]