Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 387
Posts: 387   Pages: 39   [ Previous Page | 30 31 32 33 34 35 36 37 38 39 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 48389 times and has 386 replies Next Thread
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2494
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

MCM is in "tasks are committed to other platforms" mode

ARP has either stopped or slowed way down in sending fresh WUs. I haven't gotten one in a while, and the reference number is decreasing.

I'll be on my backup project in 5 hours unless things change. Is it a Sahara weekend?
Yes, it's totally Sahara now. I can't even get the single MCM I need for tomorrow.
[May 25, 2025 12:33:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

MCM is in "tasks are committed to other platforms" mode

ARP has either stopped or slowed way down in sending fresh WUs. I haven't gotten one in a while, and the reference number is decreasing.

I'll be on my backup project in 5 hours unless things change. Is it a Sahara weekend?
Yes, it's totally Sahara now. I can't even get the single MCM I need for tomorrow.

Dry, Dry Dry.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[May 25, 2025 7:02:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2494
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Sahara event is over. New work is coming in now.
[May 25, 2025 8:09:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Sahara event is over. New work is coming in now.

Windows is getting units, but Linux remains dry.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[May 25, 2025 10:28:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

One of my Linux systems started getting [new] MCM1 work again at about 20:00 UTC on 2025-05-25 but it took until after 22:00 UTC for all of my systems to get back to normal cache levels.

The ARP1 drought seems to continue, however -- 2 tasks in the last 48 hours, and those were both missed deadline retries!

It would be nice to know why there are these long intervals without MCM1 work; is it an issue with work generation or is it an attempt to cope with backed up retries or are things dropping out of service?

Roll on the full commissioning of the new data centre. Hopefully things will improve then!

Cheers - Al.
[May 26, 2025 6:07:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

chiara.p of WCG posted a copy of the latest WCG Operational Status report in the MCM1 forum (as it is [mostly] about the MCM1 workflow issues and a possible resolution.) I'll leave it to Unixchick to decide whether to copy it in here as well...

The rest of the report was about the MAM1 beta, where the learning logic is being moved to LibTorch (the core of the popular PyTorch Python Deep Learning environment); one effect of this is that it will allow [some] NVIDIA GPUs to participate!

If they manage to deliver on both of these in a reasonable time frame, that's going to be a massive step forwards (especially if/when they do a LibTorch MCM1 version!)

Cheers - Al.
[May 26, 2025 8:17:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

It is nice to have an update on the operational status. I was checking earlier for one, as it has been a month since the last one.

It looks like the team is working on making the system more stable. It is a good update.

I'll copy it below for those who don't want to click to find it. The link to operational status updates are in the first post

May 26, 2025

MCM1 Workunit Availability: The recurring problem where older servers/VMs in our private cloud lose their DHCP lease on all interfaces and effectively go down, has been causing the quorum of coordinator nodes that accept jobs and assign them to workers ("the scheduler"), to stop accepting job submissions when this crash coincides with a second issue that renders the coordinators unable to recover their quorum. Building on the work we did to generate ARP1 and MAM1 workunits locally on WCG servers, we are migrating the MCM1 workunit delivery to Kubernetes, which should permanently resolve the issue, and increase workunit supply overall. Initially we had planned to complete this work after the release of MAM1 7.05, which has been a significant refactor, but given the frequency of failures we are moving it up and will complete the migration this week.
MAM1 7.05 - Why is it Taking so Long? MAM1 is being refactored to use LibTorch and run also on NVIDIA GPUs. LibTorch vastly simplifies the checkpointing logic, and should resolve the unsafe memory access and resume from/checkpointing crashes in previous beta releases as well as significantly improve performance.
----------------------------------------
[Edit 1 times, last edit by Unixchick at May 26, 2025 10:49:21 PM]
[May 26, 2025 10:48:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Unixchick:

Thanks for the update!
It is GREATLY appreciated!
[May 26, 2025 11:40:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 384
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Given that MAM is based on MCM, does this imply that all of the work will moving to GPU in the near term?

I will not run a big GPU and my GT710s are not exactly powerful :-)
[May 27, 2025 1:01:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Boca Raton Community HS
Senior Cruncher
Joined: Aug 27, 2021
Post Count: 209
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

MAM1 is being refactored to use LibTorch and run also on NVIDIA GPUs.

This is an interesting development! More research completed faster is good news. Based on how it was written, it sounds like there will be both GPU and CPU versions of this work, but I am not 100% sure. Exciting changes! I wonder what the timeline is for these beta tasks?
[May 27, 2025 1:43:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 387   Pages: 39   [ Previous Page | 30 31 32 33 34 35 36 37 38 39 | Next Page ]
[ Jump to Last Post ]
Post new Thread