Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 102
|
![]() |
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 937 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Adri,
----------------------------------------That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then! So it looks as if they might be running 4 feeders (or 4 work generators for MCM1?) and the "0 mod 4" one isn't working. I haven't seen a task for a "0 mod 4" MCM1 WU since about 22:20 on 2025-02-23 and I haven't seen any ARP1 work since about the same time... I have also seen a couple of tasks where my wingman missed the deadline and returned the NSD error (that WCG doesn't name as such), but no retry was queued. That's more extreme than having retries Waiting to be sent!!! The most recent one was WU 67096596 [MCM1] - initial wingman went NSD (2025-02-24 02:57 UTC) but no retry has been generated at the time of this posting. Please note that it, too, is a "0 mod 4" case, although that may be a coincidence! I'm not sure, but that failure to generate a retry suggests that perhaps a transitioner backlog had happened, as if that happens all sorts of odd side-effects might show up. The most common one is that properly coded work unit generators[*1] should wait to iterate until they see their most recent previous new work has been "transitioned" (which could take some time!); another may be failure to mark returned task/WU state transitions in a timely fashion (so the failed task doesn't get flagged for a retry, and transitions through the stages of assimilation and purging might be held up...) All in all, not nice :-( Cheers - Al. *1 -- a while ago, MilkyWay had a situation where one of their work generators created millions of WUs, and it took ages to get back to normal. I had a look at what was [allegedly] their generator (which was very customized) and noted that it didn't have the "safety net" code that made it wait if the transitioner was backlogged (which it had been!), so I contacted the person then responsible for the server to let him know. (That was my first code-dive into BOINC server stuff, and I'm glad I don't have to maintain that stuff!) [Edit 1 times, last edit by alanb1951 at Feb 24, 2025 4:41:12 PM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 937 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Unixchick -- your post appeared while I was creating mine :-)
If we had a subset of the basic BOINC server status page, one of the things it would tell us is the transitioner backlog in hours (which should ideally always be zero!), hence confirming or denying my suspicions; even Einstein's heavily modified server status page gives that information as it is easy to work out. (It's a really trivial database query to get the minimum pending transitioner request time; it is also queried in that generator safety net code!) Much as I enjoy looking at evidence and trying to come up with explanations, I would much prefer to have proper "certified" information :-) Cheers - Al. |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 924 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() |
I so agree Al. I would love to make this thread obsolete. I would love a real status page. They can throttle it to update hourly if the db access is the issue. I'm guessing the small tech team is focused on fixing things and prepping new projects.
I complain about how things could be better, but I try to also remember to be grateful for what we have. A system that functions, and gives us WUs of two projects even if it isn't at the rate we would like. I think it must have been a BIG task to get WCG to run on different hardware, and then to take on the ARP task. I think it is amazing that they have a new project to give us too. Just reminds me that I need to make another monetary donation to the group. |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
WCG capacity may be reduced due to the Graham infrastructure upgrade.
----------------------------------------https://docs.alliancecan.ca/wiki/Infrastructure_renewal Estimated By Feb 23, 2025 Upcoming Upcoming Graham (25%) Reduction The start date for the Graham's return to service, previously changed from January 16 to February 17, has been delayed Feb 18, 2025 UPDATE: The Graham compute cluster is now scheduled to reopen with reduced capacity. As of the latest update, the site is working to return Graham to service by the end of the week due to delays in receiving storage equipment. No action is required. Graham is available for login, and user storage is accessible. However, project storage remains read-only while data migration is being completed. Storage migration is nearly complete, but additional capacity has been ordered and will be installed the week of February 3. Until the new Nibi system is available, the reduced Graham cluster will have a simplified scheduling configuration: Jobs can be either CPU or GPU-based. Available GPU types: V100, T4, A100, A5000. Long jobs will not be allowed to run. Auxiliary services like Globus and gra-vdi will return as time permits. Graham Cloud remains operational during this period. For more details, please check the status page and the Graham wiki page. [Edit 2 times, last edit by AgrFan at Feb 24, 2025 6:00:33 PM] |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 924 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thanks for the link AgrFan. I had forgotten about the upgrades. This is likely the issue.
Thus WCG is just dealing with the issues they have no control over. I'm out of WUs at the moment. No ARP, and no MCM. I'm not sure if others are getting MCM as it says "committed to other platforms" so maybe MCMs are going out still ?? |
||
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 265 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I'm out of WUs at the moment. No ARP, and no MCM. Me too. I'm just getting "No tasks are available".Back to backup project. Cheers, Mark |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7642 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Dry on all systems here.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I received 3 MCM1 20 minutes ago and some 40 minutes ago. but not full cache.
No ARP1. Mike |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2148 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Adri, That's right, Al, I decided to take another look at that thread and it looks like 25% of the assimilators are down at the moment.That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then! Adri |
||
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 792 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
alanb1951 said:
----------------------------------------So it looks as if they might be running 4 feeders (or 4 work generators for MCM1?) and the "0 mod 4" one isn't working. I haven't seen a task for a "0 mod 4" MCM1 WU since about 22:20 on 2025-02-23 and I haven't seen any ARP1 work since about the same time... Four feeders or work servers rings a bell actually. I remember some time in the past someone explaining this, and each feeder is responsible for the 0, 1, 2, 3 work unit generation (or feeding, not sure).
|
||
|
|
![]() |