Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 102
Posts: 102   Pages: 11   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4438 times and has 101 replies
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 937
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Adri,

That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then!
That's right, Al, I decided to take another look at that thread and it looks like 25% of the assimilators are down at the moment.

Adri
Saw (and replied to) your post in the "Assimilators" thread in the MCM1 forum; this time it looks as if the validator for WUs whose IDs are divisible by 4 has gone AWOL; no validation means no assimilation (but the assimilator might be dead as well!)

Further to my earlier remark about not seeing new work for WUs divisible by 4, I'm still only seeing new work for WUs with IDs not divisible by 4! I'd love to know how they spread the various daemons out across their VMs as that might explain why everything to do with a particular subset of WUs can seem to hang up at the same time...

I just hope it isn't a case of there being some database entries that are breaking one or more of the daemons...

Cheers - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Feb 25, 2025 12:08:05 PM]
[Feb 25, 2025 12:05:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 924
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Now would be a good time for an official update.

The best guesses we have without an official update is that the server room (not under WCG control) is in the process of upgrading. They got some of it done during the Dec-Jan shutdown, and are still working on it.
Since WCG doesn't have access to all of its computing power, it at first limited the number of. ARP WUs, I think there are few to none ARP WUs going out right now. I want to thank the tech team for prioritizing the extremes when they knew they couldn't send us much.

MCM has been issuing WUs from time to time, but rarely enough. MCM also hasn't been handling the returned results in a timely manner, so your results list is probably longer than it is normally.

WCG is limping along right now, so it is time for some back up projects (set to 0%) until this is fixed.

Here is the link to the server room info (thanks AgrFan) I'm going to put this in my first post too. https://docs.alliancecan.ca/wiki/Infrastructure_renewal
----------------------------------------
[Edit 1 times, last edit by Unixchick at Feb 26, 2025 3:41:19 PM]
[Feb 26, 2025 3:37:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 801
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I received 3 ARP WUs yesterday 02/25/25, though they have been slowly released since the restart.
----------------------------------------

[Feb 26, 2025 5:46:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2139
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Distribution of MCM tasks seems to have stopped totally. I haven't received any new MCM tasks for hours.
[Feb 26, 2025 7:45:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hunter1978
Advanced Cruncher
United States
Joined: Apr 24, 2010
Post Count: 110
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

No tasks left either projection my computers sad
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by hunter1978 at Feb 26, 2025 11:22:41 PM]
[Feb 26, 2025 11:14:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Boca Raton Community HS
Advanced Cruncher
Joined: Aug 27, 2021
Post Count: 123
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Distribution of MCM tasks seems to have stopped totally. I haven't received any new MCM tasks for hours.


Same. Maybe prepping for MAM? Well, that is what I am going to tell myself...
[Feb 27, 2025 12:21:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 924
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

The server room posted this update today. looks like it is running at 25% capacity with other restrictions.

Feb 26, 2025 UPDATE: The Graham compute cluster is now running with reduced capacity, but some services are still unavailable:

GPU nodes are not yet available.
gra-vdi is not yet available.
/nearline is not yet available.
Job runtime is limited to a maximum of 3 days.

The site is actively working to restore these services. Users experiencing specific issues, such as job scheduling constraints, are encouraged to report them to technical support.

Graham is available for login, and user storage is accessible. However, project storage remains read-only while data migration is being completed.

Until the new Nibi system is available, the reduced Graham cluster will have a simplified scheduling configuration:

Jobs can be either CPU or GPU-based.
Available GPU types: V100, T4, A100, A5000.
Long jobs (over 3 days) will not be allowed.

Auxiliary services like Globus and gra-vdi will return as time permits. Graham Cloud remains operational during this period.
[Feb 27, 2025 1:22:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Totally out of work - shutting down when no personal work.

Mike
[Feb 27, 2025 3:07:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7642
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Totally out of work - shutting down when no personal work.

Mike

Likewise.
Totally dry

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Feb 27, 2025 3:55:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2139
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I'm getting a trickle of new (_0, _1) tasks now. A few here and there, but at least it's something, and enough to keep my computer crunching.
[Feb 27, 2025 4:54:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 102   Pages: 11   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread