World Community Grid - View Thread - Project Status (old)

World Community Grid Forums

Category: Community

Forum: Chat Room

Thread: Project Status (old)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 387

[ ]

Author

This topic has been viewed 58133 times and has 386 replies

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2555
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

14 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

MCM is in "tasks are committed to other platforms" mode

ARP has either stopped or slowed way down in sending fresh WUs. I haven't gotten one in a while, and the reference number is decreasing.

I'll be on my backup project in 5 hours unless things change. Is it a Sahara weekend?

Yes, it's totally Sahara now. I can't even get the single MCM I need for tomorrow.

[May 25, 2025 12:33:57 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7854
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Yes, it's totally Sahara now. I can't even get the single MCM I need for tomorrow.

Dry, Dry Dry.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[May 25, 2025 7:02:03 PM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2555
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

Sahara event is over. New work is coming in now.

[May 25, 2025 8:09:26 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7854
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

Sahara event is over. New work is coming in now.

Windows is getting units, but Linux remains dry.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[May 25, 2025 10:28:28 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1341
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

One of my Linux systems started getting [new] MCM1 work again at about 20:00 UTC on 2025-05-25 but it took until after 22:00 UTC for all of my systems to get back to normal cache levels.

The ARP1 drought seems to continue, however -- 2 tasks in the last 48 hours, and those were both missed deadline retries!

It would be nice to know why there are these long intervals without MCM1 work; is it an issue with work generation or is it an attempt to cope with backed up retries or are things dropping out of service?

Roll on the full commissioning of the new data centre. Hopefully things will improve then!

Cheers - Al.

[May 26, 2025 6:07:26 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1341
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

chiara.p of WCG posted a copy of the latest WCG Operational Status report in the MCM1 forum (as it is [mostly] about the MCM1 workflow issues and a possible resolution.) I'll leave it to Unixchick to decide whether to copy it in here as well...

The rest of the report was about the MAM1 beta, where the learning logic is being moved to LibTorch (the core of the popular PyTorch Python Deep Learning environment); one effect of this is that it will allow [some] NVIDIA GPUs to participate!

If they manage to deliver on both of these in a reasonable time frame, that's going to be a massive step forwards (especially if/when they do a LibTorch MCM1 version!)

Cheers - Al.

[May 26, 2025 8:17:29 PM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1316
Status: Recently Active
Project Badges:

180 day badge for Smash Childhood Cancer

45 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

It is nice to have an update on the operational status. I was checking earlier for one, as it has been a month since the last one.

It looks like the team is working on making the system more stable. It is a good update.

I'll copy it below for those who don't want to click to find it. The link to operational status updates are in the first post

May 26, 2025

MCM1 Workunit Availability: The recurring problem where older servers/VMs in our private cloud lose their DHCP lease on all interfaces and effectively go down, has been causing the quorum of coordinator nodes that accept jobs and assign them to workers ("the scheduler"), to stop accepting job submissions when this crash coincides with a second issue that renders the coordinators unable to recover their quorum. Building on the work we did to generate ARP1 and MAM1 workunits locally on WCG servers, we are migrating the MCM1 workunit delivery to Kubernetes, which should permanently resolve the issue, and increase workunit supply overall. Initially we had planned to complete this work after the release of MAM1 7.05, which has been a significant refactor, but given the frequency of failures we are moving it up and will complete the migration this week.
MAM1 7.05 - Why is it Taking so Long? MAM1 is being refactored to use LibTorch and run also on NVIDIA GPUs. LibTorch vastly simplifies the checkpointing logic, and should resolve the unsafe memory access and resume from/checkpointing crashes in previous beta releases as well as significantly improve performance.

----------------------------------------
[Edit 1 times, last edit by Unixchick at May 26, 2025 10:49:21 PM]

[May 26, 2025 10:48:53 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

180 day badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

180 day badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Unixchick:

Thanks for the update!
It is GREATLY appreciated!

[May 26, 2025 11:40:45 PM]

Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 386
Status: Offline
Project Badges:

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Given that MAM is based on MCM, does this imply that all of the work will moving to GPU in the near term?

I will not run a big GPU and my GT710s are not exactly powerful :-)

[May 27, 2025 1:01:00 AM]

Boca Raton Community HS
Senior Cruncher
Joined: Aug 27, 2021
Post Count: 217
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

MAM1 is being refactored to use LibTorch and run also on NVIDIA GPUs.

This is an interesting development! More research completed faster is good news. Based on how it was written, it sounds like there will be both GPU and CPU versions of this work, but I am not 100% sure. Exciting changes! I wonder what the timeline is for these beta tasks?

[May 27, 2025 1:43:55 AM]

[ ]