Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 159
Posts: 159   Pages: 16   [ 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 11089 times and has 158 replies Next Thread
savas
Cruncher
Joined: Sep 21, 2021
Post Count: 30
Status: Offline
Reply to this Post  Reply with Quote 
Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

We have been working with hosting at SHARCNET today to identify potential bottlenecks and solutions to the issue of stalled downloads in the BOINC client for ARP1 and MCM1 workunit inputs. A failing drive on one of the download servers appears to have been contributory, and SHARCNET have migrated this VM to a healthy host. However, there are additional measures we have taken and plan to take.

We have decreased the app weight of ARP1 relative to MCM1 in the feeder, and increased the number of automatic retries per HTTP connection to backend download servers in the load balancer config (HAProxy). We have also decreased the number of concurrent connections allowed per IP recorded in the stick-table for the download server group only in HAProxy. We will decrease the upper limit on workunits to produce and index in BOINC if necessary in the coming days, currently set at 10,000.

Obvious high leverage solutions to the problem as suggested on the forums are to scale out the download server group and reduce the number of downloads for ARP1 workunits to a single file. Also, we have taken steps and will continue to take steps to aggressively pursue complete file transfers after the first request from the BOINC client. Manual intervention from the user clicking "Retry Now" and running "auto-clickers" should be essentially useless by design and provide negligible benefit - that is our goal, we apologize that it is not already met.

In general, these HTTP errors are due to unavailable/busy backend servers that HAProxy cannot establish a connection with - thus a 503 service unavailable is returned. With the help of SHARCNET today, we have the hardware to scale downloads both out and up, and we are provisioning these additional servers now.

HAProxy will be upgraded to a more recent version, we specifically look forward to the potential impact of the retry-on 503 directive and option redispatch directive. Until we can handle transient HTTP errors on our end, we will also look to adjust the project backoff cadence for ARP1 and MCM1 to be less conservative.

With regard to deadlines, though it may not be reflected in your BOINC client and we apologize to users whose deadlines we did not extend in time, we have extended deadlines by 5 days for every single ARP1 workunit in flight this week, on two occasions. Once yesterday Nov 4th, and once today Nov 6th, for all workunits with server_state=IN_PROGRESS in the BOINC result table (https://github.com/BOINC/boinc/wiki/BackendState). The earliest deadline for any ARP1 workunit in the result table of the BOINC db at the moment is Nov 8th, and we will extend proximate deadlines again tomorrow Nov 7th if necessary.

We also agree with feedback on the forums throughout this debacle that the deadlines are too short in general, and we will be extending the deadlines for ARP1 going forward once we get through this. We are currently thinking ~24h additional deadline would work out okay, and we can adjust from there. Please provide feedback if you believe this is the wrong direction or the wrong duration, we appreciate all volunteer feedback on the forums around the issues since launch even those who are rightly upset by our insufficiency. Thank you.
[Nov 6, 2024 9:38:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 858
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Great post full of information. Thank you for working on a solution!
[Nov 6, 2024 9:56:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12142
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

savas

Good post but why was this not anticipated and solved before restarting? We have had this problem every time we restart.

Mike
[Nov 6, 2024 10:09:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

savas - thanks for this post -- the details are much appreciated (I used to be a "tech team" person before I retired [15+ years ago]...)

I'm not seeing any real change yet -- I still have substantial backlogs and it seems I can only slowly clear it (though I doubt I'd hit the IP address based limit unless the limit is draconian!)
We also agree with feedback on the forums throughout this debacle that the deadlines are too short in general, and we will be extending the deadlines for ARP1 going forward once we get through this. We are currently thinking ~24h additional deadline would work out okay, and we can adjust from there. Please provide feedback if you believe this is the wrong direction or the wrong duration, we appreciate all volunteer feedback on the forums around the issues since launch even those who are rightly upset by our insufficiency. Thank you.
Regarding the part I've quoted, I'd be delighted to see the return of the "grace day" (both for ARP1 and MCM1!) as it might slightly cut down on the numbers of unnecessary retries being served up. However, for MCM1 I'd actually like to see it combined with 1 day being taken off the underlying default so that the overdue task client-side kill time comes earlier whilst some No Reply retries might be avoided...

For both ARP1 and MCM1, it's not clear how many users collect more work than they have any chance of processing -- that's why I'd like to see the MCM1 default deadline cut back, in the hope that the unwitting might find out why they are having problems whilst those who deliberately maintain large caches might be encouraged to reduce the size a little :-)

The situation is not so simple for ARP1, as any client-side deadline reduction might end up with totally unsuitable deadlines for retries unless the retry time limit reduction factor is reduced. Personally I'd like to see a 5-day plus 1 grace day deadline for normal ARP1 tasks, a 3-day plus 0.5 or 1 grace day deadline for retries and "Accelerated" tasks, and the retention of the 1.5 day deadline on "Extreme" tasks (with a possible 0.5 days grace?...) but I suspect that having different grace-day counts for different categories might be a non-starter (in which case I'd suggest 0.5 days for each!)

I have a feeling that some sort of enforced ceiling on ARP1 tasks may also be needed (as it might help spread out the requests once things stabilize), although ideas like "1 per core" or similar that had been suggested in the past may not be easy to graft into the scheduler task supply mechanism :-) so a more brute-force approach such as "no more than 16 or 32 at a time" might be needed. It wouldn't help with users who want a bigger cache in case there's a system outage (or because there's a BOINC challenge) but so be it!

Cheers - Al.
[Nov 6, 2024 11:07:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 793
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Thanks for the update, looks like the WCG team has a good plan.
----------------------------------------

[Nov 6, 2024 11:10:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
maeax
Advanced Cruncher
Joined: May 2, 2007
Post Count: 142
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Have two 64 Core Threadripper and up to this Monday, no problems to finish WCG Tasks in the limit of your defined deadline. They have both identical Boinc Vers. 8.0.2. We hoping this new Hard- and software on your WCG Servers is the solution to continue this great project WCG. Thank you.
----------------------------------------
AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro
----------------------------------------
[Edit 1 times, last edit by maeax at Nov 7, 2024 7:46:12 AM]
[Nov 6, 2024 11:37:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
danwat1234
Cruncher
Joined: Apr 18, 2020
Post Count: 35
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Could you reference the forum posts? TY for the updates I did see some connection difficulties. 10K MCM work units compiled per day can't be right, around 700K are computed per day.
10K ARP1 limit per day as it imposes more load on the servers due to size and # of files per WU?
Could you add to project statistics pages the total number of Work Units compiled ready to be sent to clients and approximate total number remaining and in the case of MCM project, per stage of MCM if feasible?
----------------------------------------
[Edit 2 times, last edit by danwat1234 at Nov 7, 2024 12:24:14 AM]
[Nov 7, 2024 12:15:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Boca Raton Community HS
Advanced Cruncher
Joined: Aug 27, 2021
Post Count: 113
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

Thank you for this update- this was great information that addressed many of the concerns. Extended deadlines would be great!
[Nov 7, 2024 2:39:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Link64
Advanced Cruncher
Joined: Feb 19, 2021
Post Count: 118
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

10K MCM work units compiled per day can't be right, around 700K are computed per day.
10k seems to be their ready to send buffer, something you see on other BOINC projects on the server status page, which is unfortunately missing here. It has nothing to do with the amount of completed WUs per day.
----------------------------------------

[Nov 7, 2024 9:37:39 AM]   Link   Report threatening or abusive post: please login first  Go to top 
andgra
Senior Cruncher
Sweden
Joined: Mar 15, 2014
Post Count: 183
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024

What a refreshing post savas!!
This is exactly the type of info we like to get.
Good luck in pursuing the problems further.
----------------------------------------
/andgra



[Nov 7, 2024 10:42:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 159   Pages: 16   [ 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread