World Community Grid - View Thread - Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 159

[ ]

Author

This topic has been viewed 17399 times and has 158 replies

savas
Cruncher
Joined: Sep 21, 2021
Post Count: 34
Status: Offline


Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

We have been working with hosting at SHARCNET today to identify potential bottlenecks and solutions to the issue of stalled downloads in the BOINC client for ARP1 and MCM1 workunit inputs. A failing drive on one of the download servers appears to have been contributory, and SHARCNET have migrated this VM to a healthy host. However, there are additional measures we have taken and plan to take.

We have decreased the app weight of ARP1 relative to MCM1 in the feeder, and increased the number of automatic retries per HTTP connection to backend download servers in the load balancer config (HAProxy). We have also decreased the number of concurrent connections allowed per IP recorded in the stick-table for the download server group only in HAProxy. We will decrease the upper limit on workunits to produce and index in BOINC if necessary in the coming days, currently set at 10,000.

Obvious high leverage solutions to the problem as suggested on the forums are to scale out the download server group and reduce the number of downloads for ARP1 workunits to a single file. Also, we have taken steps and will continue to take steps to aggressively pursue complete file transfers after the first request from the BOINC client. Manual intervention from the user clicking "Retry Now" and running "auto-clickers" should be essentially useless by design and provide negligible benefit - that is our goal, we apologize that it is not already met.

In general, these HTTP errors are due to unavailable/busy backend servers that HAProxy cannot establish a connection with - thus a 503 service unavailable is returned. With the help of SHARCNET today, we have the hardware to scale downloads both out and up, and we are provisioning these additional servers now.

HAProxy will be upgraded to a more recent version, we specifically look forward to the potential impact of the retry-on 503 directive and option redispatch directive. Until we can handle transient HTTP errors on our end, we will also look to adjust the project backoff cadence for ARP1 and MCM1 to be less conservative.

With regard to deadlines, though it may not be reflected in your BOINC client and we apologize to users whose deadlines we did not extend in time, we have extended deadlines by 5 days for every single ARP1 workunit in flight this week, on two occasions. Once yesterday Nov 4th, and once today Nov 6th, for all workunits with server_state=IN_PROGRESS in the BOINC result table (https://github.com/BOINC/boinc/wiki/BackendState). The earliest deadline for any ARP1 workunit in the result table of the BOINC db at the moment is Nov 8th, and we will extend proximate deadlines again tomorrow Nov 7th if necessary.

We also agree with feedback on the forums throughout this debacle that the deadlines are too short in general, and we will be extending the deadlines for ARP1 going forward once we get through this. We are currently thinking ~24h additional deadline would work out okay, and we can adjust from there. Please provide feedback if you believe this is the wrong direction or the wrong duration, we appreciate all volunteer feedback on the forums around the issues since launch even those who are rightly upset by our insufficiency. Thank you.

[Nov 6, 2024 9:38:17 PM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1014
Status: Offline
Project Badges:

180 day badge for Smash Childhood Cancer

45 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Great post full of information. Thank you for working on a solution!

[Nov 6, 2024 9:56:12 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12455
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

savas

Good post but why was this not anticipated and solved before restarting? We have had this problem every time we restart.

Mike

[Nov 6, 2024 10:09:42 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 995
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

savas - thanks for this post -- the details are much appreciated (I used to be a "tech team" person before I retired [15+ years ago]...)

I'm not seeing any real change yet -- I still have substantial backlogs and it seems I can only slowly clear it (though I doubt I'd hit the IP address based limit unless the limit is draconian!)

We also agree with feedback on the forums throughout this debacle that the deadlines are too short in general, and we will be extending the deadlines for ARP1 going forward once we get through this. We are currently thinking ~24h additional deadline would work out okay, and we can adjust from there. Please provide feedback if you believe this is the wrong direction or the wrong duration, we appreciate all volunteer feedback on the forums around the issues since launch even those who are rightly upset by our insufficiency. Thank you.

Regarding the part I've quoted, I'd be delighted to see the return of the "grace day" (both for ARP1 and MCM1!) as it might slightly cut down on the numbers of unnecessary retries being served up. However, for MCM1 I'd actually like to see it combined with 1 day being taken off the underlying default so that the overdue task client-side kill time comes earlier whilst some No Reply retries might be avoided...

For both ARP1 and MCM1, it's not clear how many users collect more work than they have any chance of processing -- that's why I'd like to see the MCM1 default deadline cut back, in the hope that the unwitting might find out why they are having problems whilst those who deliberately maintain large caches might be encouraged to reduce the size a little :-)

The situation is not so simple for ARP1, as any client-side deadline reduction might end up with totally unsuitable deadlines for retries unless the retry time limit reduction factor is reduced. Personally I'd like to see a 5-day plus 1 grace day deadline for normal ARP1 tasks, a 3-day plus 0.5 or 1 grace day deadline for retries and "Accelerated" tasks, and the retention of the 1.5 day deadline on "Extreme" tasks (with a possible 0.5 days grace?...) but I suspect that having different grace-day counts for different categories might be a non-starter (in which case I'd suggest 0.5 days for each!)

I have a feeling that some sort of enforced ceiling on ARP1 tasks may also be needed (as it might help spread out the requests once things stabilize), although ideas like "1 per core" or similar that had been suggested in the past may not be easy to graft into the scheduler task supply mechanism :-) so a more brute-force approach such as "no more than 16 or 32 at a time" might be needed. It wouldn't help with users who want a bigger cache in case there's a system outage (or because there's a BOINC challenge) but so be it!

Cheers - Al.

[Nov 6, 2024 11:07:53 PM]

TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 810
Status: Offline
Project Badges:

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

100 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Thanks for the update, looks like the WCG team has a good plan.

----------------------------------------

[Nov 6, 2024 11:10:12 PM]

maeax
Advanced Cruncher
Joined: May 2, 2007
Post Count: 142
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

90 day badge for Help Fight Childhood Cancer

180 day badge for The Clean Energy Project - Phase 2

90 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

200 year badge for Mapping Cancer Markers

180 day badge for Uncovering Genome Mysteries

50 year badge for OpenPandemics - COVID-19


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Have two 64 Core Threadripper and up to this Monday, no problems to finish WCG Tasks in the limit of your defined deadline. They have both identical Boinc Vers. 8.0.2. We hoping this new Hard- and software on your WCG Servers is the solution to continue this great project WCG. Thank you.

----------------------------------------

AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro

----------------------------------------
[Edit 1 times, last edit by maeax at Nov 7, 2024 7:46:12 AM]

[Nov 6, 2024 11:37:33 PM]

danwat1234
Cruncher
Joined: Apr 18, 2020
Post Count: 39
Status: Offline
Project Badges:

50 year badge for Smash Childhood Cancer


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Could you reference the forum posts? TY for the updates I did see some connection difficulties. 10K MCM work units compiled per day can't be right, around 700K are computed per day.
10K ARP1 limit per day as it imposes more load on the servers due to size and # of files per WU?
Could you add to project statistics pages the total number of Work Units compiled ready to be sent to clients and approximate total number remaining and in the case of MCM project, per stage of MCM if feasible?

----------------------------------------
[Edit 2 times, last edit by danwat1234 at Nov 7, 2024 12:24:14 AM]

[Nov 7, 2024 12:15:13 AM]

Boca Raton Community HS
Senior Cruncher
Joined: Aug 27, 2021
Post Count: 153
Status: Offline
Project Badges:

10 year badge for Smash Childhood Cancer

20 year badge for OpenPandemics - COVID-19


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Thank you for this update- this was great information that addressed many of the concerns. Extended deadlines would be great!

[Nov 7, 2024 2:39:20 AM]

Link64
Advanced Cruncher
Joined: Feb 19, 2021
Post Count: 137
Status: Offline
Project Badges:

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

14 day badge for OpenPandemics - COVID-19


Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

10K MCM work units compiled per day can't be right, around 700K are computed per day.

10k seems to be their ready to send buffer, something you see on other BOINC projects on the server status page, which is unfortunately missing here. It has nothing to do with the amount of completed WUs per day.

----------------------------------------