World Community Grid - View Thread - Technical Progress to Resolve HCC issues on Multi-Core Machines

World Community Grid Forums

Category: Completed Research

Forum: Help Conquer Cancer

Thread: Technical Progress to Resolve HCC issues on Multi-Core Machines

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 83

[ ]

Author

This topic has been viewed 17100 times and has 82 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Technical Progress to Resolve HCC issues on Multi-Core Machines

With reference to the issues highlighted in the thread '[RENAMED] Some concerns regarding the HC... hosts; #cores>2)' - this thread is started to enable the Techs to post any updates in their investigation of the issues, and also for members to post their own observations which may assist them in finding the cause and remedy.

Posted by knreed:

Sorry for the silence. We are investigating this problem. We are currently running some pair-wise tests to determine what is causing the slow down. What I mean by pair-wise is that we manually sending a workunit for FightAIDS@Home and a workunit for Help Conquer Cancer to the same two computers that have the characteristics that we want to check. For example here is the outcome of one of these tests:

Computer #1 is my laptop. It is running a Intel Pentium M running at 2.0GHz with 2GB of DDR Ram
Computer #2 is my home desktop. It is running a AMD Athlon 64 X2 5200+ running at 2.7GHz with 2GB of DDR2 Ram

FightAIDS@Home workunit 'faah2961_ZINC00000480_xMut_md02740_00' had the following results:

Computer #1 ran the workunit in 7.6 hours and claimed 83.238 credits for the workunit.
Computer #2 ran the workunit in 6.1 hours and claimed 94.623 credits for the workunit (20% faster then the pentium M)

Help Conquer Cancer workunit 'X0000046720001200502241630' had the following results:

Computer #1 ran the workunit in 7.8 hours and claimed 85.014 credits for the workunit
Computer #2 ran the workunit in 8.2 hours and claimed 128.109 credits for the workunit (5% slower then the pentium M)

The next step we are going to do repeat this test but force the AMD dual core to only run one workunit at a time to see if eliminates the drop in performance for the dual core machine.

[Jan 28, 2008 1:02:08 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

Good to see some testing is taking place Kevin - it does seem to bear out what others are seeing with the units taking longer to crunch with multi-cores, and I believe it is exagerated the more cores there are, so maybe some comparisons with 4/8 core machines may be worth looking into.

Comparisons have also been made over at XtremeSystems too whilst this has been going on, so I'll throw this in the mix for you to maybe consider whilst investigating:

I've said this in a bunch of other places. Running HCC on 64bit Vista with 64 bit BOINC version 5.10.28 has provided excellent results.

Run time fall very close around an average of 2.7 hours apiece, page faults in the thousands. This page fault number is lower than I see on DDT or FAAH. I'm running it in ten quads........

[Jan 28, 2008 1:04:06 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

Hey Adywebb
Keep the work up. This is really all people want to see! smile

[Jan 30, 2008 1:32:58 AM]

courine
Master Cruncher
Capt., Team In2My.Net Cmd. HQ: San Francisco
Joined: Apr 26, 2007
Post Count: 1794
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for The Clean Energy Project - Phase 2

50 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

45 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

BTW, 2 excelent quotes from the earlier thread:

knreed is looking into this. For the moment the best workaround is to put badly affected systems onto other projects.

Lawrence

And for some backround:

I suppose I'll dip my toe into the water again. Brrr. . . it's chilly!

A page fault occurs when the core wants to access memory that is not loaded into cache. This will slow things down because the kernel will have to load a new page from memory into the cache, while the core waits. Any application with a lot of page faults will run more slowly than one with only a few. But there is a second possible problem with performance. Multiple cores can 'queue up' a series of page faults so that each core has to wait until its own page fault gets serviced. This is called memory contention. If a number of cores are running applications with a high number of page faults, then performance will drop even more because of this memory contention.

How can this performance inefficiency be cured? The normal way is to run a preprocessing step over the data arrays and produce a new array that clusters data together the way that the program will access it. Sometimes this is possible. Unfortunately, sometimes it is not. It all depends on the algorithm. Even when it is possible, it produces unreadable data structures. This need not be a problem but when developing a new program that has to be rapidly changed to match research needs it is almost always a problem.

[A personal reminiscence. A generation ago I spotted a neat 15-25 line section in an image processing assembler routine that I could optimize to speed up the program by 10%-15%. Even with paperwork, this change only cost me 2 or 3 days and we were running it constantly on a number of computers, so I considered it time well spent. I actually congratulated myself about this. (sob..) A little more than 2 years later the new computers changed the cache organization and I suddenly realized that my change was bound to cause problems down the road if the cache changed even more. After thinking it over for an hour, I eliminated the change. Programming to meet specific cache designs is very dangerous practice that has to be considered very suspiciously.]

So what is my estimate of the situation? I don't think that it makes sense to reprogram the application for this. The project scientists should be concentrating on the results and overworking the programmers to change the application to produce better results. Faster should be ignored at this stage.

But how should individual members of the World Community Grid feel about this? The high page-fault count is simply an artifact of the algorithm. It will slow down the flops/second but that will not matter as such. The CPU time spent running the kernel to load in new pages will show up as reduced credit, but for a single core the points impact should not be substantial. Memory contention will be much more substantial, so 4 and 8 core machines would show a much greater drop in points if running more than 1 HCC work unit. The WCG scheduler is sending out the HCC work units so a member can eliminate HCC from these multi-core machines without slowing down progress on HCC. And they could then run other projects such as FAAH and DDT that would otherwise have to run on the single core computers that can handle HCC with the greatest efficiency.

An unrelated note. Some days ago someone posted a work unit awarded 8.3 points in this thread. This was immediately reported to the WCG staff. I don't know what went wrong and we have a number of more urgent issues, but it is an error unrelated to the main problem being addressed in this thread.

Lawrence

Not bad, I like the cut of your jib.

----------------------------------------

[Jan 30, 2008 7:05:37 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

That description of a page fault is completely wrong.

Page faults have nothing to due with cpu cache, nor does an OS kernal load cpu cache.

A page fault occurs when a requested piece of code or data is not in physical memory.

[Jan 30, 2008 4:36:00 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

Questar, thank you for your input.

Now, please go and read about the difference between hard and soft page faults. You will find the subject is less simple than you thought.

[Jan 30, 2008 4:49:13 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

Bravo!

[Jan 30, 2008 7:31:09 PM]

courine
Master Cruncher
Capt., Team In2My.Net Cmd. HQ: San Francisco
Joined: Apr 26, 2007
Post Count: 1794
Status: Offline
Project Badges:


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

I am running an exchange program in the suggestion box. If you have a single core machine and are not currently running HCC, I will exchange for some nice dual core time for your project.

The Exchange

----------------------------------------

[Feb 1, 2008 6:46:52 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

Here's an interesting stat: There were 829 W98 'All time' registrations and they run 5.64 RAC average.... that's 39.48 WCG points per day. Seeing some here still running this actively the word should be spread: Switch it off. The Electricity consumption is not helping any, Berkeley is *NOT* testing for Backward compatibility, thus upgrading BOINC is not suggested.

Propose that WCG consider to remove W98 and ME from the System Requirements list for at least HCC & AC@H.

http://boincstats.com/stats/host_os_stats.php?pr=wcg&st=0

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Feb 1, 2008 11:49:38 AM]

jal2
Senior Cruncher
USA
Joined: Apr 28, 2007
Post Count: 422
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

1 year badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

90 day badge for Uncovering Genome Mysteries

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Technical Progress to Resolve HCC issues on Multi-Core Machines

I'm not sure I understand this statistic Sek. Is this 829 active W98 machines, or one active W98 machine and 828 inactive machines, producing a total of 4,673 RAC? I suspect the number is somewhere in between.

As for W98, it's still a good gaming platform, and will run on my AMD64 3000+ if I wanted to. I think the focus should be on the CPU, not the OS.

Average credit per CPU:
2,301.88 Dual-Core AMD Opteron(tm) Processor 8216 HE
227.88 AMD Athlon(tm) 64 FX-74 Processor

Looks like I need to upgrade. biggrin

----------------------------------------

Team Christians UAG

----------------------------------------
[Edit 1 times, last edit by jal2 at Feb 1, 2008 12:19:12 PM]

[Feb 1, 2008 12:14:10 PM]

[ ]