| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 387
|
|
| Author |
|
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2494 Status: Offline Project Badges:
|
MCM is in "tasks are committed to other platforms" mode Yes, it's totally Sahara now. I can't even get the single MCM I need for tomorrow.ARP has either stopped or slowed way down in sending fresh WUs. I haven't gotten one in a while, and the reference number is decreasing. I'll be on my backup project in 5 hours unless things change. Is it a Sahara weekend? |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
MCM is in "tasks are committed to other platforms" mode Yes, it's totally Sahara now. I can't even get the single MCM I need for tomorrow.ARP has either stopped or slowed way down in sending fresh WUs. I haven't gotten one in a while, and the reference number is decreasing. I'll be on my backup project in 5 hours unless things change. Is it a Sahara weekend? Dry, Dry Dry. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2494 Status: Offline Project Badges:
|
Sahara event is over. New work is coming in now.
|
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
Sahara event is over. New work is coming in now. Windows is getting units, but Linux remains dry. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
One of my Linux systems started getting [new] MCM1 work again at about 20:00 UTC on 2025-05-25 but it took until after 22:00 UTC for all of my systems to get back to normal cache levels.
The ARP1 drought seems to continue, however -- 2 tasks in the last 48 hours, and those were both missed deadline retries! It would be nice to know why there are these long intervals without MCM1 work; is it an issue with work generation or is it an attempt to cope with backed up retries or are things dropping out of service? Roll on the full commissioning of the new data centre. Hopefully things will improve then! Cheers - Al. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
chiara.p of WCG posted a copy of the latest WCG Operational Status report in the MCM1 forum (as it is [mostly] about the MCM1 workflow issues and a possible resolution.) I'll leave it to Unixchick to decide whether to copy it in here as well...
The rest of the report was about the MAM1 beta, where the learning logic is being moved to LibTorch (the core of the popular PyTorch Python Deep Learning environment); one effect of this is that it will allow [some] NVIDIA GPUs to participate! If they manage to deliver on both of these in a reasonable time frame, that's going to be a massive step forwards (especially if/when they do a LibTorch MCM1 version!) Cheers - Al. |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1293 Status: Offline Project Badges:
|
It is nice to have an update on the operational status. I was checking earlier for one, as it has been a month since the last one.
----------------------------------------It looks like the team is working on making the system more stable. It is a good update. I'll copy it below for those who don't want to click to find it. The link to operational status updates are in the first post May 26, 2025 MCM1 Workunit Availability: The recurring problem where older servers/VMs in our private cloud lose their DHCP lease on all interfaces and effectively go down, has been causing the quorum of coordinator nodes that accept jobs and assign them to workers ("the scheduler"), to stop accepting job submissions when this crash coincides with a second issue that renders the coordinators unable to recover their quorum. Building on the work we did to generate ARP1 and MAM1 workunits locally on WCG servers, we are migrating the MCM1 workunit delivery to Kubernetes, which should permanently resolve the issue, and increase workunit supply overall. Initially we had planned to complete this work after the release of MAM1 7.05, which has been a significant refactor, but given the frequency of failures we are moving it up and will complete the migration this week. MAM1 7.05 - Why is it Taking so Long? MAM1 is being refactored to use LibTorch and run also on NVIDIA GPUs. LibTorch vastly simplifies the checkpointing logic, and should resolve the unsafe memory access and resume from/checkpointing crashes in previous beta releases as well as significantly improve performance. [Edit 1 times, last edit by Unixchick at May 26, 2025 10:49:21 PM] |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
Unixchick:
Thanks for the update! It is GREATLY appreciated! |
||
|
|
Bryn Mawr
Senior Cruncher Joined: Dec 26, 2018 Post Count: 384 Status: Offline Project Badges:
|
Given that MAM is based on MCM, does this imply that all of the work will moving to GPU in the near term?
I will not run a big GPU and my GT710s are not exactly powerful :-) |
||
|
|
Boca Raton Community HS
Senior Cruncher Joined: Aug 27, 2021 Post Count: 209 Status: Offline Project Badges:
|
MAM1 is being refactored to use LibTorch and run also on NVIDIA GPUs. This is an interesting development! More research completed faster is good news. Based on how it was written, it sounds like there will be both GPU and CPU versions of this work, but I am not 100% sure. Exciting changes! I wonder what the timeline is for these beta tasks? |
||
|
|
|