World Community Grid - View Thread - A reason for so many third and fourth copies of HCC GPU workunits

World Community Grid Forums

Category: Completed Research

Forum: Help Conquer Cancer

Thread: A reason for so many third and fourth copies of HCC GPU workunits

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 3

[ ]

Author

This topic has been viewed 1878 times and has 2 replies

robertmiles
Senior Cruncher
US
Joined: Apr 16, 2008
Post Count: 445
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

180 day badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


A reason for so many third and fourth copies of HCC GPU workunits

I've noticed that one of my computers always downloads HCC GPU workunits with an initial estimate of 00:01:46 or 00:01:47 to run, but actually takes closer to 12 minutes to run each of those workunits. Such an underestimate of the time required is likely to produce a large percentage of workunits that either run past their deadlines or have to be aborted by the user to prevent this, unless the user sets the queue of workunits to be very short.

Could the next version of this application do more to adjust the estimates of time required to agree with past actual run times on the same computer? I suspect that it will need to maintain both an estimate of the CPU speed and an estimate of the GPU speed in order to do this.

[Feb 26, 2013 3:22:43 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: A reason for so many third and fourth copies of HCC GPU workunits

On the title, FAIU, the techs have set a very high fault allowance, meaning devices are permitted to have many errors **, so it's a consequence that more copies need to be circulated before a quorum is reached.

The time [FLOPS] is build on last few days historical data, so it is in principle pretty accurate from that perspective.

The WCG server has send an instruction to clients to not use the DCF, so projections should in principle be pretty stable for Ready to Start work.

Question: Is this 1:46 a "run it exclusively by itself", or when "run with many concurrent", a setting of e.g. 12 concurrent on one GPU? Also, do you run a mix of CPU and GPU versions of HCC? Finally, what client version exactly?

edit: ** To clarify, high fault allowance is like 80%+ [or was it 85-90%?], and cards still not being blacklisted, which is understandable in a way given that a successful task runs 18 times faster [if having the GPU to itself]. Do not know though for sure if that's any part of the motivation to be highly tolerant of low success-rate devices.

----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 26, 2013 5:30:35 PM]

[Feb 26, 2013 4:37:51 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: A reason for so many third and fourth copies of HCC GPU workunits

I see this occasionally - running multiple concurrent GPU WUs only HCC1

*Sometimes* the time estimates get all goofey in BM, and as robert posted, will show very short estimates .... but .... within a minute or three the estimates return to normal. It looks very odd when it happens and the PC asks for lots and lots of work. Only once did this cause the machine to ask for more work than it could possibly execute (slow POS deserves to be replaced :O) but that's it ... 1 time in about 6 months ... not too bad I think.

[Feb 26, 2013 5:25:39 PM]

[ ]