Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 102
Posts: 102   Pages: 11   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4439 times and has 101 replies
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 937
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Adri,

That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then!

So it looks as if they might be running 4 feeders (or 4 work generators for MCM1?) and the "0 mod 4" one isn't working. I haven't seen a task for a "0 mod 4" MCM1 WU since about 22:20 on 2025-02-23 and I haven't seen any ARP1 work since about the same time...

I have also seen a couple of tasks where my wingman missed the deadline and returned the NSD error (that WCG doesn't name as such), but no retry was queued. That's more extreme than having retries Waiting to be sent!!! The most recent one was WU 67096596 [MCM1] - initial wingman went NSD (2025-02-24 02:57 UTC) but no retry has been generated at the time of this posting. Please note that it, too, is a "0 mod 4" case, although that may be a coincidence!

I'm not sure, but that failure to generate a retry suggests that perhaps a transitioner backlog had happened, as if that happens all sorts of odd side-effects might show up. The most common one is that properly coded work unit generators[*1] should wait to iterate until they see their most recent previous new work has been "transitioned" (which could take some time!); another may be failure to mark returned task/WU state transitions in a timely fashion (so the failed task doesn't get flagged for a retry, and transitions through the stages of assimilation and purging might be held up...)

All in all, not nice :-(

Cheers - Al.

*1 -- a while ago, MilkyWay had a situation where one of their work generators created millions of WUs, and it took ages to get back to normal. I had a look at what was [allegedly] their generator (which was very customized) and noted that it didn't have the "safety net" code that made it wait if the transitioner was backlogged (which it had been!), so I contacted the person then responsible for the server to let him know. (That was my first code-dive into BOINC server stuff, and I'm glad I don't have to maintain that stuff!)
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Feb 24, 2025 4:41:12 PM]
[Feb 24, 2025 3:51:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 937
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Unixchick -- your post appeared while I was creating mine :-)

If we had a subset of the basic BOINC server status page, one of the things it would tell us is the transitioner backlog in hours (which should ideally always be zero!), hence confirming or denying my suspicions; even Einstein's heavily modified server status page gives that information as it is easy to work out. (It's a really trivial database query to get the minimum pending transitioner request time; it is also queried in that generator safety net code!)

Much as I enjoy looking at evidence and trying to come up with explanations, I would much prefer to have proper "certified" information :-)

Cheers - Al.
[Feb 24, 2025 4:57:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 924
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I so agree Al. I would love to make this thread obsolete. I would love a real status page. They can throttle it to update hourly if the db access is the issue. I'm guessing the small tech team is focused on fixing things and prepping new projects.

I complain about how things could be better, but I try to also remember to be grateful for what we have. A system that functions, and gives us WUs of two projects even if it isn't at the rate we would like. I think it must have been a BIG task to get WCG to run on different hardware, and then to take on the ARP task. I think it is amazing that they have a new project to give us too. Just reminds me that I need to make another monetary donation to the group.
[Feb 24, 2025 5:49:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 376
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

WCG capacity may be reduced due to the Graham infrastructure upgrade.
https://docs.alliancecan.ca/wiki/Infrastructure_renewal

Estimated By Feb 23, 2025 Upcoming Upcoming Graham (25%) Reduction
The start date for the Graham's return to service, previously changed from January 16 to February 17, has been delayed

Feb 18, 2025 UPDATE: The Graham compute cluster is now scheduled to reopen with reduced capacity. As of the latest update, the site is working to return Graham to service by the end of the week due to delays in receiving storage equipment. No action is required.

Graham is available for login, and user storage is accessible. However, project storage remains read-only while data migration is being completed. Storage migration is nearly complete, but additional capacity has been ordered and will be installed the week of February 3.

Until the new Nibi system is available, the reduced Graham cluster will have a simplified scheduling configuration: Jobs can be either CPU or GPU-based. Available GPU types: V100, T4, A100, A5000. Long jobs will not be allowed to run.

Auxiliary services like Globus and gra-vdi will return as time permits.

Graham Cloud remains operational during this period.

For more details, please check the status page and the Graham wiki page.
----------------------------------------
[Edit 2 times, last edit by AgrFan at Feb 24, 2025 6:00:33 PM]
[Feb 24, 2025 5:54:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 924
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thanks for the link AgrFan. I had forgotten about the upgrades. This is likely the issue.
Thus WCG is just dealing with the issues they have no control over.

I'm out of WUs at the moment. No ARP, and no MCM. I'm not sure if others are getting MCM as it says "committed to other platforms" so maybe MCMs are going out still ??
[Feb 24, 2025 9:19:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
MJH333
Senior Cruncher
England
Joined: Apr 3, 2021
Post Count: 265
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I'm out of WUs at the moment. No ARP, and no MCM.
Me too. I'm just getting "No tasks are available".
Back to backup project.
Cheers,
Mark
[Feb 24, 2025 9:27:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7642
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Dry on all systems here.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Feb 24, 2025 9:50:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I received 3 MCM1 20 minutes ago and some 40 minutes ago. but not full cache.

No ARP1.

Mike
[Feb 24, 2025 11:25:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2148
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Adri,

That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then!
That's right, Al, I decided to take another look at that thread and it looks like 25% of the assimilators are down at the moment.

Adri
[Feb 25, 2025 10:25:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 792
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

alanb1951 said:
So it looks as if they might be running 4 feeders (or 4 work generators for MCM1?) and the "0 mod 4" one isn't working. I haven't seen a task for a "0 mod 4" MCM1 WU since about 22:20 on 2025-02-23 and I haven't seen any ARP1 work since about the same time...


Four feeders or work servers rings a bell actually. I remember some time in the past someone explaining this, and each feeder is responsible for the 0, 1, 2, 3 work unit generation (or feeding, not sure).
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Feb 25, 2025 11:51:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 102   Pages: 11   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread