World Community Grid - View Thread - Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

World Community Grid Forums

Category: Support

Forum: GPU Support Forum

Thread: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 8

[ ]

Author

This topic has been viewed 2284 times and has 7 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

HGW = HCCv7.08-GPU-WUs
CTG = CPU-thread-count to HGW-task-count (ratio)

On my (machine)*, I have observed that if I manually effect a (dovetail)** as tight as possible two (HGW)***, there occurs an elapsed-time under 5-minutes nominal for each HGW in a pair. When the said HGW are not ideally dovetail-ed, that is, if the CPU-phases or the GPU-phases of a pair of HCW are both turned on at the same time, the elapsed-time of each of the HGW in a pair is increased to above 5-minutes nominal. It seems reasonable that better performance would come about when the workload is distributed between PUs (processing-units) spatially and temporally rather than each PU getting hit by twice the load at the same time. It's akin to working 12hrs and next resting for 12hrs as opposed to working 24hrs and next resting for 24hrs. The dovetail condition is lost after some time though and the HGW pair enters into a 'lock' whereby regaining the dovetail condition is difficult under a 1:2 CTG, but less than that under a 2:2 (mathematically reduced to 1:1) CTG.

It would be nice if there is planned an effort to add code so that a dovetail of two HGW is orchestrated programatically.

Notes:
*Ubu12.10-HD7770-AMD1090T(6-cores) running BOINC_v7.0.42.
**When the CPU-phase of a GPU-WU is on, the GPU-phase of the other GPU-WU is also on -- which is the first of two syncs in a dovetail; the 2nd sync is when the CPU-phase of a GPU-WU switches off as close as possible to the time when the GPU-phase of the other WU switches off. In short, at any one time, an ideal GPU-WU-dovetail is one where there is as little a time where either 2-CPU-phases or 2-GPU-phases are simultaneously turned on or off.
***(1 CPUs + 0.5 ATI GPUs) effected via app_config: two cores for two HGW.
;
; andzgridPost#771
;

[Dec 31, 2012 7:47:56 AM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

You might consider a few more experiments in the meantime andzgrid...

I have q6600 and 2600K and 3770K rigs running where on the lesser cards like the 5870 some will allow up to 9 tasks to run concurrently. unfortunately not all cards will do this without errors at the start.

The result of running more wu's concurrently is that at any one time there will be some that are in cpu phase and others that are in gpu phase. It is not perfect as due to the different load of each wu there will still be a tendency toward grouping but I have found that a once a day re- alignment is enough.

In doing this it is not necessary to give each wu a full cpu core.

I think that until there are no mid wu pauses in the wu's it will be difficult to implement any fix to keep the timing the way you want it, so these ideas I offer as a possible way to go forward

----------------------------------------

[Dec 31, 2012 11:53:59 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

Hello OldChap, and thanks for your response.

Two seems to be the magic number to dovetail HCC-GPU7.08-WUs:
1] The CPU-phase to GPU-phase time-ratio for the HCC-GPU7.08-WUs is around 1:1
2] Even numbered GPU-WUs divided by two gives the number of dovetail-ed GPU-WUs.
3] To accommodate odd-numbered GPU-WUs for a dovetail, the CPU-phase to GPU-phase time-ratio of a GPU-WUs needs to fall around the 2:3 ratio. That ratio demands a counterpart 2-CPU threads to 3-GPU tasks.

If a pair of WUs can be made to programmatically sync in a dovetail, the 0.5 CPU-thread is a perfect match. The pauses (mid-WU and between WUs within an HCC-GPU7.08-WU) are needed as a slack to:
1] re-align sync timings to account for the variability in a WU.
2] adjust for the hunt for the optimum timings given the variability of processing-unit performance with loading.

Because we currently don't have such programmatic fine-tuning, while it may not be necessary to give each wu a full cpu core, I found out that a 1:1 CPU-thread to GPU-WU is better able to absorb variances much like what I imagine a programmatic control would have provided. The dovetailed state remains stable and longer under a 1:1 ratio compared to that of a 1:2 ratio (or 0.5 CPU).

In any case...

Happy New Year 2013 every one !
;
; andzgridPost#772
;

[Dec 31, 2012 3:47:10 PM]

kateiacy
Veteran Cruncher
USA
Joined: Jan 23, 2010
Post Count: 1027
Status: Offline
Project Badges:

14 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

180 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

90 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

90 day badge for OpenPandemics - COVID-19


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

I was glad to see this thread as I, too, have been trying to figure out how to keep 2 HCC1 GPU Wus from both entering the CPU phase at the same time, as that slows things down enormously on my power-efficient but not terribly fast hardware (AMD HD 7750 with Phenom II X4 910e). Even if I manually "dovetail" them, to borrow andzgrid's term, they seem to get back into the slow, overlap timing in a few hours.

I have tried 2 GPU WUs, each getting its own CPU core as well as 2 GPU WUs sharing a CPU core. Maybe I'll give the 2:3 ratio a try.

----------------------------------------

[Dec 31, 2012 6:12:12 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

I have tried 2 GPU WUs, each getting its own CPU core as well as 2 GPU WUs sharing a CPU core. Maybe I'll give the 2:3 ratio a try.

Yap, I can confirm the tendency for the phases to lock up together so that getting into a dovetail is made difficult. If so, that leaves a machine needing to have the capacity to adequately handle the number of concurrent GPU-WUs. My experience with my HD7770 is that three concurrent HCC_7.08-GPU-WUs made my HD7770 struggle to provide a responsive UI. Load up some more UI interaction, and that UI risks possibly freezing, else I may have to provide more cooling to the GPU or the CPU or both after I overclock them. While this may be doable and workable, it goes against the 'theme' of the Radeon HD77xx series. The performance theme is where the HD79xx targets; and the efficiency theme is right at home with the HD77xx series, leaving the HD78xx as the middle-ground.

The only way that I see to take the baby-sitting and/or the guesswork out of making a dovetailed OpenCL GPU-WU work is to introduce programmatic control. Or, like a couple, to have the dovetail seamlessly integrated right from the start and therefore there would be no need for programmatic-control.
;
; andzgridPost#788
;

[Jan 4, 2013 10:09:22 PM]

kateiacy
Veteran Cruncher
USA
Joined: Jan 23, 2010
Post Count: 1027
Status: Offline
Project Badges:


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

Right now I'm running 4 HCC GPU WUs on my 7750, giving each half a CPU core. AMD Overdrive shows 96-97% GPU usage. Right now I'm not doing anything with that machine except crunching. The UI isn't unreasonably slow for running BOINC Manager, to my surprise.

----------------------------------------

[Jan 4, 2013 10:38:56 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

Four concurrent on an HD7750 left alone to crunch? Hmm.. ok, that's good. Maybe I got a product-defective HD7770, or I inadvertently damaged it some way, I don't know. I once tried four concurrent on that card and the UI froze. Also, I do some other stuff while crunching so I can't afford to load my GPU to the max. I could have used the onboard-GPU in my Ubu-machine but with Ubu12.10 the UI became mangled, and so that left me needing to use the HD7770. I'll see if I can perhaps downgrade to Ubu12.04 (where the onboard-GPU used to work) or do some research on how to enable the onboard-GPU under Ubu12.10 -- then I'll take a shot at loading four concurrent HCC_7.08-GPU-WUs on my HD7770.
;
; andzgridPost#789
;

[Jan 4, 2013 11:56:26 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Dovetail of two HCCv7.08-GPU-WUs yields fastest performance

Update to andzgridPost#789

I first tried 3 concurrent (0.666 CPU + 0.333 ATI), and I stabilized that, and then I tried 4 concurrent (0.5 CPU + 0.25 ATI) which seemed to hold so far unlike before where my Ubu12.10-machine crashed on 4 concurrent. I guess it was the GPU-cooling, or lack of it to be exact. So, I nudged up the GPU-fan RPM and set all GPU-hardware clocks to default for my HD7770 -- and that seemed to stabilize the machine. However, the runtimes for 4 concurrent are only slightly faster compared to that of 3 concurrent. Its that diminishing-returns I read about. In all cases of 2,3, or 4 concurrent for my HD7770: the better the dovetailing of WUs, the faster the performance, with the 3 or 4 concurrent doing a good job of breaking the lock-step of phases. I guess I need not use another GPU-hardware to render the Ubu12.10 UI after all !
;
; andzgridPost#792
;

[Jan 5, 2013 10:55:14 PM]

[ ]