World Community Grid - View Thread - November Update: Microbiome Immunity Project

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: November Update: Microbiome Immunity Project

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 9

[ ]

Author

This topic has been viewed 1664 times and has 8 replies

WCGAdmin
World Community Grid Admin
Joined: Jun 9, 2020
Post Count: 171
Status: Offline


November Update: Microbiome Immunity Project

The research team and the World Community Grid tech team continue to collaborate on a new type of work unit for the project.

https://www.worldcommunitygrid.org/about_us/viewNewsArticle.do?articleId=663

[Nov 9, 2020 4:30:04 PM]

William Albert
Cruncher
Joined: Apr 5, 2020
Post Count: 39
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

45 day badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: November Update: Microbiome Immunity Project

Still have not heard a word about whether they're going to fix the L3 Cache congestion problem.

L3 cache is managed by the processor itself (with the exception of some recent post-Spectre instructions that programs can call to flush the cache). There's nothing that they can really "fix," because the program has no influence over how the processor manages its L3 cache (or if L3 cache is even present).

[Nov 9, 2020 8:50:39 PM]

William Albert
Cruncher
Joined: Apr 5, 2020
Post Count: 39
Status: Offline
Project Badges:


Re: November Update: Microbiome Immunity Project

Then how did the Baker Lab fix it in their current version of Rosetta???

Fix what?

Again, L3 cache is managed by the processor itself. Rosetta (or any other application) has no control over how L3 cache is managed. Also, since Rosetta has to run on many different microarchitectures, it can't make any assumptions about how much (if any) L3 cache is present.

The best that the Rosetta developers can realistically do is to design the program's in-memory data structures to be small and relatively static, so that they have a higher chance of staying cached. But designing high-performance data structures isn't trivial, and making them smaller isn't necessarily going to improve performance (the whole space-time tradeoff thing).

So I don't really see anything that's "broken" that needs to be "fixed." MIP's workload might benefit from larger amounts of L3 cache, but it's not exactly a surprise that programs run faster on processors with more resources.

[Nov 10, 2020 11:47:14 AM]

katoda
Senior Cruncher
Poland
Joined: Apr 28, 2007
Post Count: 170
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: November Update: Microbiome Immunity Project

Dear William Albert, have you read https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,40374_offset,0?
I guess that Aurum420 refers to the issue reported and analysed there - clogging CPU while running too many MIP1 units simultaneously, resulting in long computing time of MIP1 WUs), a behaviour observed only for this project. The fact that Rosetta version is especially "hungry" in regard of processor's cache was later confirmed by the project scientist
https://www.worldcommunitygrid.org/forums/wcg...ad,40374_offset,60#569786
For the moment the only solution is to limit number of running MIP1 WUs to 1 unit per 4 MB L3 cache. @Aurum420, could you please elaborate (or provide a link) on Baker Lab fix?

----------------------------------------

----------------------------------------
[Edit 2 times, last edit by katoda at Nov 11, 2020 7:35:10 AM]

[Nov 11, 2020 7:33:17 AM]

William Albert
Cruncher
Joined: Apr 5, 2020
Post Count: 39
Status: Offline
Project Badges:


Re: November Update: Microbiome Immunity Project

Here's the comment in that thread from the MIP scientist:

https://www.worldcommunitygrid.org/forums/wcg...ad,40374_offset,70#569786

The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime.

I don't know all of the details about how the points are assigned, and I don't know if/how the credit assignment will be modified. But I believe that issue stems from the fact that a single instance Rosetta is well behaved (very few cache misses) on most consumer chips, but on machines with smaller caches and few memory channels a second (or third or forth) instance cannot fit in to the caches and you see the run time scaling issues which result in fewer points/hour (i.e. if a single instance of Rosetta had these cache issues the scaling from one to multiple instances would not be as dramatic nor would the change in points/hour).**

We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises.

That explanation is pretty in-line with what I said above. Given that it's been several years since that comment, and this issue still exists, it's likely that optimizing it was infeasible.

Also, keep in mind the project's needs from a science standpoint. Results need to be re-producible, and papers analyzing those results will cite the simulation tool and version used. Even if a newer build of Rosetta with optimized caching behavior exists, it's very possible that the MIP team isn't in a position to change versions at this point (or at any point) without spoiling their existing results.

If MIP's behavior is causing problems with common consumer hardware, it may be prudent for the WCG admins to add a notice about it and set a default limit on the number of running MIP WUs per device in the project selection menu, similar to how they handle Africa Rainfall Project's space requirements.

However, as long as MIP WU's are running to completion and producing useful output, they aren't "broken," and implying that cache usage is a problem to be fixed (rather than simply being a performance characteristic of the project's WU's), and that the MIP team is being negligent for not having "fixed it," is inaccurate and unfair.

[Nov 11, 2020 9:58:48 AM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

20 year badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2


Re: November Update: Microbiome Immunity Project

That is a good summary.

To make a long story short, after considerable time running MIP on Linux (with less experience on Windows), I have found:
I can run two MIP on Intel (Haswell, Coffee Lake) machines.
Also two at a time on Ryzen 2000 series.
And four at a time on Ryzen 3000 series.

But that still depends on what other projects you are running. Rosetta itself will reduce those numbers, as will ARP.

[Nov 11, 2020 4:23:40 PM]

[ ]