World Community Grid - View Thread - Project Status (old)

World Community Grid Forums

Category: Community

Forum: Chat Room

Thread: Project Status (old)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 387

[ ]

Author

This topic has been viewed 57731 times and has 386 replies

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1337
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Adri,

That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then!

So it looks as if they might be running 4 feeders (or 4 work generators for MCM1?) and the "0 mod 4" one isn't working. I haven't seen a task for a "0 mod 4" MCM1 WU since about 22:20 on 2025-02-23 and I haven't seen any ARP1 work since about the same time...

I have also seen a couple of tasks where my wingman missed the deadline and returned the NSD error (that WCG doesn't name as such), but no retry was queued. That's more extreme than having retries Waiting to be sent!!! The most recent one was WU 67096596 [MCM1] - initial wingman went NSD (2025-02-24 02:57 UTC) but no retry has been generated at the time of this posting. Please note that it, too, is a "0 mod 4" case, although that may be a coincidence!

I'm not sure, but that failure to generate a retry suggests that perhaps a transitioner backlog had happened, as if that happens all sorts of odd side-effects might show up. The most common one is that properly coded work unit generators[*1] should wait to iterate until they see their most recent previous new work has been "transitioned" (which could take some time!); another may be failure to mark returned task/WU state transitions in a timely fashion (so the failed task doesn't get flagged for a retry, and transitions through the stages of assimilation and purging might be held up...)

All in all, not nice :-(

Cheers - Al.

*1 -- a while ago, MilkyWay had a situation where one of their work generators created millions of WUs, and it took ages to get back to normal. I had a look at what was [allegedly] their generator (which was very customized) and noted that it didn't have the "safety net" code that made it wait if the transitioner was backlogged (which it had been!), so I contacted the person then responsible for the server to let him know. (That was my first code-dive into BOINC server stuff, and I'm glad I don't have to maintain that stuff!)

----------------------------------------
[Edit 1 times, last edit by alanb1951 at Feb 24, 2025 4:41:12 PM]

[Feb 24, 2025 3:51:02 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1337
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

Unixchick -- your post appeared while I was creating mine :-)

If we had a subset of the basic BOINC server status page, one of the things it would tell us is the transitioner backlog in hours (which should ideally always be zero!), hence confirming or denying my suspicions; even Einstein's heavily modified server status page gives that information as it is easy to work out. (It's a really trivial database query to get the minimum pending transitioner request time; it is also queried in that generator safety net code!)

Much as I enjoy looking at evidence and trying to come up with explanations, I would much prefer to have proper "certified" information :-)

Cheers - Al.

[Feb 24, 2025 4:57:40 PM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1315
Status: Offline
Project Badges:

180 day badge for Smash Childhood Cancer

45 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

I so agree Al. I would love to make this thread obsolete. I would love a real status page. They can throttle it to update hourly if the db access is the issue. I'm guessing the small tech team is focused on fixing things and prepping new projects.

I complain about how things could be better, but I try to also remember to be grateful for what we have. A system that functions, and gives us WUs of two projects even if it isn't at the rate we would like. I think it must have been a BIG task to get WCG to run on different hardware, and then to take on the ARP task. I think it is amazing that they have a new project to give us too. Just reminds me that I need to make another monetary donation to the group.

[Feb 24, 2025 5:49:06 PM]

AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 397
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project


Re: Project Status (First Post Updated)

WCG capacity may be reduced due to the Graham infrastructure upgrade.
https://docs.alliancecan.ca/wiki/Infrastructure_renewal

Estimated By Feb 23, 2025 Upcoming Upcoming Graham (25%) Reduction
The start date for the Graham's return to service, previously changed from January 16 to February 17, has been delayed

Feb 18, 2025 UPDATE: The Graham compute cluster is now scheduled to reopen with reduced capacity. As of the latest update, the site is working to return Graham to service by the end of the week due to delays in receiving storage equipment. No action is required.

Graham is available for login, and user storage is accessible. However, project storage remains read-only while data migration is being completed. Storage migration is nearly complete, but additional capacity has been ordered and will be installed the week of February 3.

Until the new Nibi system is available, the reduced Graham cluster will have a simplified scheduling configuration: Jobs can be either CPU or GPU-based. Available GPU types: V100, T4, A100, A5000. Long jobs will not be allowed to run.

Auxiliary services like Globus and gra-vdi will return as time permits.

Graham Cloud remains operational during this period.

For more details, please check the status page and the Graham wiki page.

----------------------------------------

i5-10400 (Comet Lake, 6C/12T) @ 2.9 GHz
i5-7400 (Kaby Lake, 4C/4T) @ 3.0 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3330 (Ivy Bridge, 4C/4T) @ 3.0 GHz

----------------------------------------
[Edit 2 times, last edit by AgrFan at Feb 24, 2025 6:00:33 PM]

[Feb 24, 2025 5:54:32 PM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1315
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

Thanks for the link AgrFan. I had forgotten about the upgrades. This is likely the issue.
Thus WCG is just dealing with the issues they have no control over.

I'm out of WUs at the moment. No ARP, and no MCM. I'm not sure if others are getting MCM as it says "committed to other platforms" so maybe MCMs are going out still ??

[Feb 24, 2025 9:19:56 PM]

MJH333
Senior Cruncher
England
Joined: Apr 3, 2021
Post Count: 301
Status: Offline
Project Badges:

5 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

I'm out of WUs at the moment. No ARP, and no MCM.

Me too. I'm just getting "No tasks are available".
Back to backup project.
Cheers,
Mark

[Feb 24, 2025 9:27:37 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7854
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Dry on all systems here.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Feb 24, 2025 9:50:25 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Computing for Sustainable Water

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project


Re: Project Status (First Post Updated)

I received 3 MCM1 20 minutes ago and some 40 minutes ago. but not full cache.

No ARP1.

Mike

[Feb 24, 2025 11:25:15 PM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2360
Status: Recently Active
Project Badges:

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Adri,

That's a repeat of what happened with the [MCM1] assimilators back in early 2024 -- you and I had something to say about it in the "Are all assimilators running?" thread back then!

That's right, Al, I decided to take another look at that thread and it looks like 25% of the assimilators are down at the moment.

Adri

[Feb 25, 2025 10:25:11 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2


Re: Project Status (First Post Updated)

alanb1951 said:

So it looks as if they might be running 4 feeders (or 4 work generators for MCM1?) and the "0 mod 4" one isn't working. I haven't seen a task for a "0 mod 4" MCM1 WU since about 22:20 on 2025-02-23 and I haven't seen any ARP1 work since about the same time...

Four feeders or work servers rings a bell actually. I remember some time in the past someone explaining this, and each feeder is responsible for the 0, 1, 2, 3 work unit generation (or feeding, not sure).

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Feb 25, 2025 11:51:21 AM]

[ ]