World Community Grid - View Thread - Project Status (Old)

World Community Grid Forums

Category: Community

Forum: Chat Room

Thread: Project Status (Old)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 352

[ ]

Author

This topic has been viewed 35526 times and has 351 replies

Link64
Senior Cruncher
Joined: Feb 19, 2021
Post Count: 206
Status: Offline
Project Badges:

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

14 day badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Oh, I also review the RESENDS "waiting to start" on my systems 3 or 4 times a day and set them to running to TRY and avoid or at least minimize SERVER ABORTED Status.

Why would you do that? To force your computer to waste cycles on something, that has been already crunched? Server aborts are there to minimize waste of computing ressources and by that speed up the project. Sorry, but minimizing them is simply stupid and the opposite of helping the project in the best way you can.

----------------------------------------

----------------------------------------
[Edit 5 times, last edit by Link64 at Jul 10, 2025 9:06:31 AM]

[Jul 10, 2025 8:56:45 AM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

bfmorse

I looked at the first one and it seems that you received your copy as soon as the other copy had passed its due date and then it was returned soon after. Any crunching by you would have been wasted as they already had 2 valid results so yours was aborted to save your crunching from being wasted.

Mike

[Jul 10, 2025 2:31:50 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1334
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: Project Status (First Post Updated)

As far as I am aware (if my memories of a past code dive are correct), the various "couldn't use" statuses such as "Server Aborted" and "Too Late" don't count as host errors so should not affect "reliable host" status!

As has already been said, there's no point worrying about Server Aborted tasks, especially if the above is true; some unneeded retries will run anyway (which is why the lack of a grace period for ARP1 is annoying given the potentially unnecessary download traffic), but unless a client system fills up on retries (unlikely. I'd have thought) it shouldn't unduly hinder the [eventual] collection of new work and those retries that aren't for "No Reply" tasks.

Cheers - Al.

[Jul 10, 2025 4:04:39 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

200 year badge for Mapping Cancer Markers

180 day badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

180 day badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

Well, I suppose I should put this in perspective...

I had 18 server aborted work units (WU's) on July 9th - which, to me, is frustrating.

When I started processing them, I have no idea that the original volunteers run an apparent Queue of at least six days. (A reasonable queue with UNLIMITED as the cap on WU's could do that too.) That six day delay in returning their WU causes an automatic WU release for, in these cases, me.

But I should put my Aborted Work Units in perspective:
I looked at "My Activity over time" on my "OVERVIEW" page, the Mapping Cancer Markers' graph shows that I typically successfully return and get credit for at least 1,500 Results each day. Other projects' WU's I successfully return are not included in those numbers.

All in all, my wish is that queues be monitored and not run so long as to cause a new WU to be released for someone else to process. Thereby making unnecessary work whether they are aborted or not.

[Jul 11, 2025 7:19:18 AM]

Link64
Senior Cruncher
Joined: Feb 19, 2021
Post Count: 206
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

All in all, my wish is that queues be monitored and not run so long as to cause a new WU to be released for someone else to process. Thereby making unnecessary work whether they are aborted or not.

Well, we all wish, that WCG has a steady WU supply, so we can run with nearly no cache at all, but the reality is as it is, some people get frustrated and run too large caches either because of that or because it's what worked OK on other projects for them in the past. Or they simply had their computer off for a couple of days, this can be another couse for late tasks, I've seen that couple of times when suddenly a batch of resends was aborted by the server.

Anyway, regardless of what we wish, we simply need to adapt to reality. The solution that works best for me is a cache size of 1.2 days. That's the largest possible cache, with which the resends do not go instantly into EDF-mode (earliest deadline first), so I give the computers that had the original tasks assigned the longest possible time to complete them and the server to abort the resends before my computer starts processing them.

BTW., if you process 1500 tasks/day, 18 aborted by server isn't that much IMHO.

----------------------------------------

----------------------------------------
[Edit 2 times, last edit by Link64 at Jul 11, 2025 8:52:10 AM]

[Jul 11, 2025 8:42:02 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1334
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

All in all, my wish is that queues be monitored and not run so long as to cause a new WU to be released for someone else to process. Thereby making unnecessary work whether they are aborted or not.

I am in agreement with link64 as to reasons for late returns happening, though I can think of other "innocent" reasons as well...

As for cutting down the number of retries that end up being unnecessary, in olden days (WCG/IBM) projects had a 24 hour grace period associated with them. The client didn't know about this (seeing only the deadline time) but the server would allow that grace period for No Reply cases before sending a retry. This made it far more likely that overdue results would end up aborted by the client (known as "Not started by deadline"[*1]) so the proportion of Server Aborted retries for tasks from applications with short run times (most WCG projects!) was a lot lower then...

However, the grace period was taken off as part of clearing up the outstanding work before the IBM->Krembil migration -- it doesn't seem to have been restored. The ratio of delayed tasks that end up No Reply is a lot higher now. Some convert to Error status fairly quickly (as the client aborts them!) and some don't return before the WU is assimilated, but some of the rest will return late and validate -- some of those will cause Server Aborts but others will result in a WU having three (or more!) validated results!

Personally, I'd like to seem them bring back the grace period and reduce the deadlines by a day at the same time -- nowadays I can't see any reason for a 6 day deadline on something like ARP1[*2], let alone MCM1 or [presumably] MAM1 when it finally launches, and shortening the deadline might help with the occasional problems with either backed-up validations or issuing of [needed?] retries.

Cheers - Al.

*1 -- WCG don't give that as a status, but it is fairly easy to recognise which Error returns are actually "Not started by deadline"

*2 -- I would suggest that anyone who needs more than 72 to 96 wall-clock hours (from collection to completion), let alone 144, to run an ARP1 task probably shouldn't be running ARP1... :-)

----------------------------------------
[Edit 2 times, last edit by alanb1951 at Jul 11, 2025 1:57:32 PM]

[Jul 11, 2025 1:53:26 PM]

Link64
Senior Cruncher
Joined: Feb 19, 2021
Post Count: 206
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

Personally, I'd like to seem them bring back the grace period and reduce the deadlines by a day at the same time -- nowadays I can't see any reason for a 6 day deadline on something like ARP1[*2], let alone MCM1 or [presumably] MAM1 when it finally launches, and shortening the deadline might help with the occasional problems with either backed-up validations or issuing of [needed?] retries.

Short deadlines are a major PITA for people running more than one BOINC project, in particular if they don't crunch 24/7 or maybe even not every day, for example not on weekends. When thinking about what can be improved, we must not forget, that BOINC isn't just for those, who build crunching farms, which can sometimes cost more than a car. But even for those power-crunchers, if they are running just WCG, short deadlines make it harder to keep their systems busy, since they can't build up a cache, which can last long enough for at least the most of the server hick-ups.

The 6 days here are actually quite short already, Einstein has 14 days for their CPU WUs, Milkyway has 12 days and both projects have nearly perfectly stable servers. As long as the results are not needed for example for to create new WUs, there's no need for very short deadlines. AFAIK, only ARP WUs are created based on previous results, so there's even no need for the shorter deadlines for MCM resends.

----------------------------------------

[Jul 11, 2025 3:46:36 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1334
Status: Offline
Project Badges:


Re: Project Status (First Post Updated)

Again, I mostly agree with that, but as this is supposed to be a WCG Project Status thread I will back out now, rather than discussing issues for multi-project users and the lack of generally available, easy to use, flow management for many BOINC projects.

However, I must note that I see a far higher percentage of WUs with missed deadline wingmen at Einstein (which also has long deadlines for GPU work, by the way!) than at WCG. And when I had Milkyway as a main project rather than a reserve I used to see higher percentages of WUs with missed deadlines there as well...

While this has been an interesting discussion, I don't actually see WCG changing anything about task deadlines (or re-introducing grace periods). So ARP1 will continue to clear work more slowly than it could, and some users will continue to be irritated by Server Aborted tasks...

Cheers - Al.

P.S. I run five "standard" systems, all of them deliberately power-limited and configured for custom workloads with fast turnaround; I turn in between 650 and 730 MCM1 tasks a day depending on how many ARP1 tasks my systems actually get! I do realize that not everyone can/will put in the effort to set systems up like that... :-)

[Jul 11, 2025 6:36:34 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7852
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Outsmart Ebola Together

100 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

I run 3 systems at the moment with 52 threads available. If I have a steady supply, I return between 350 and 400 work units a day. Of that total, I return about 4 to 7 ARP units a day, once again based on the steadiness of supply. I have set the systems with fairly short caches, but big enough so I rarely run out unless there is an extended outage. I rarely see any server aborted units from either project, but they do crop up from time to time.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Jul 11, 2025 9:34:25 PM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1313
Status: Offline
Project Badges:

180 day badge for Smash Childhood Cancer

45 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Project Status (First Post Updated)

I haven't seen any ARP WUs in a while. MCM has been flowing well.

We are getting close to the end of generation 148. I hope they will start 149 over the weekend.

[Jul 12, 2025 4:40:03 AM]

[ ]