Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024 |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 159
|
Author |
|
savas
Cruncher Joined: Sep 21, 2021 Post Count: 30 Status: Offline |
We have been working with hosting at SHARCNET today to identify potential bottlenecks and solutions to the issue of stalled downloads in the BOINC client for ARP1 and MCM1 workunit inputs. A failing drive on one of the download servers appears to have been contributory, and SHARCNET have migrated this VM to a healthy host. However, there are additional measures we have taken and plan to take.
We have decreased the app weight of ARP1 relative to MCM1 in the feeder, and increased the number of automatic retries per HTTP connection to backend download servers in the load balancer config (HAProxy). We have also decreased the number of concurrent connections allowed per IP recorded in the stick-table for the download server group only in HAProxy. We will decrease the upper limit on workunits to produce and index in BOINC if necessary in the coming days, currently set at 10,000. Obvious high leverage solutions to the problem as suggested on the forums are to scale out the download server group and reduce the number of downloads for ARP1 workunits to a single file. Also, we have taken steps and will continue to take steps to aggressively pursue complete file transfers after the first request from the BOINC client. Manual intervention from the user clicking "Retry Now" and running "auto-clickers" should be essentially useless by design and provide negligible benefit - that is our goal, we apologize that it is not already met. In general, these HTTP errors are due to unavailable/busy backend servers that HAProxy cannot establish a connection with - thus a 503 service unavailable is returned. With the help of SHARCNET today, we have the hardware to scale downloads both out and up, and we are provisioning these additional servers now. HAProxy will be upgraded to a more recent version, we specifically look forward to the potential impact of the retry-on 503 directive and option redispatch directive. Until we can handle transient HTTP errors on our end, we will also look to adjust the project backoff cadence for ARP1 and MCM1 to be less conservative. With regard to deadlines, though it may not be reflected in your BOINC client and we apologize to users whose deadlines we did not extend in time, we have extended deadlines by 5 days for every single ARP1 workunit in flight this week, on two occasions. Once yesterday Nov 4th, and once today Nov 6th, for all workunits with server_state=IN_PROGRESS in the BOINC result table (https://github.com/BOINC/boinc/wiki/BackendState). The earliest deadline for any ARP1 workunit in the result table of the BOINC db at the moment is Nov 8th, and we will extend proximate deadlines again tomorrow Nov 7th if necessary. We also agree with feedback on the forums throughout this debacle that the deadlines are too short in general, and we will be extending the deadlines for ARP1 going forward once we get through this. We are currently thinking ~24h additional deadline would work out okay, and we can adjust from there. Please provide feedback if you believe this is the wrong direction or the wrong duration, we appreciate all volunteer feedback on the forums around the issues since launch even those who are rightly upset by our insufficiency. Thank you. |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 858 Status: Offline Project Badges: |
Great post full of information. Thank you for working on a solution!
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12142 Status: Offline Project Badges: |
savas
Good post but why was this not anticipated and solved before restarting? We have had this problem every time we restart. Mike |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
savas - thanks for this post -- the details are much appreciated (I used to be a "tech team" person before I retired [15+ years ago]...)
I'm not seeing any real change yet -- I still have substantial backlogs and it seems I can only slowly clear it (though I doubt I'd hit the IP address based limit unless the limit is draconian!) We also agree with feedback on the forums throughout this debacle that the deadlines are too short in general, and we will be extending the deadlines for ARP1 going forward once we get through this. We are currently thinking ~24h additional deadline would work out okay, and we can adjust from there. Please provide feedback if you believe this is the wrong direction or the wrong duration, we appreciate all volunteer feedback on the forums around the issues since launch even those who are rightly upset by our insufficiency. Thank you. Regarding the part I've quoted, I'd be delighted to see the return of the "grace day" (both for ARP1 and MCM1!) as it might slightly cut down on the numbers of unnecessary retries being served up. However, for MCM1 I'd actually like to see it combined with 1 day being taken off the underlying default so that the overdue task client-side kill time comes earlier whilst some No Reply retries might be avoided...For both ARP1 and MCM1, it's not clear how many users collect more work than they have any chance of processing -- that's why I'd like to see the MCM1 default deadline cut back, in the hope that the unwitting might find out why they are having problems whilst those who deliberately maintain large caches might be encouraged to reduce the size a little :-) The situation is not so simple for ARP1, as any client-side deadline reduction might end up with totally unsuitable deadlines for retries unless the retry time limit reduction factor is reduced. Personally I'd like to see a 5-day plus 1 grace day deadline for normal ARP1 tasks, a 3-day plus 0.5 or 1 grace day deadline for retries and "Accelerated" tasks, and the retention of the 1.5 day deadline on "Extreme" tasks (with a possible 0.5 days grace?...) but I suspect that having different grace-day counts for different categories might be a non-starter (in which case I'd suggest 0.5 days for each!) I have a feeling that some sort of enforced ceiling on ARP1 tasks may also be needed (as it might help spread out the requests once things stabilize), although ideas like "1 per core" or similar that had been suggested in the past may not be easy to graft into the scheduler task supply mechanism :-) so a more brute-force approach such as "no more than 16 or 32 at a time" might be needed. It wouldn't help with users who want a bigger cache in case there's a system outage (or because there's a BOINC challenge) but so be it! Cheers - Al. |
||
|
TLD
Veteran Cruncher USA Joined: Jul 22, 2005 Post Count: 793 Status: Offline Project Badges: |
Thanks for the update, looks like the WCG team has a good plan.
---------------------------------------- |
||
|
maeax
Advanced Cruncher Joined: May 2, 2007 Post Count: 142 Status: Offline Project Badges: |
Have two 64 Core Threadripper and up to this Monday, no problems to finish WCG Tasks in the limit of your defined deadline. They have both identical Boinc Vers. 8.0.2. We hoping this new Hard- and software on your WCG Servers is the solution to continue this great project WCG. Thank you.
----------------------------------------
AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro
----------------------------------------[Edit 1 times, last edit by maeax at Nov 7, 2024 7:46:12 AM] |
||
|
danwat1234
Cruncher Joined: Apr 18, 2020 Post Count: 35 Status: Offline Project Badges: |
Could you reference the forum posts? TY for the updates I did see some connection difficulties. 10K MCM work units compiled per day can't be right, around 700K are computed per day.
----------------------------------------10K ARP1 limit per day as it imposes more load on the servers due to size and # of files per WU? Could you add to project statistics pages the total number of Work Units compiled ready to be sent to clients and approximate total number remaining and in the case of MCM project, per stage of MCM if feasible? [Edit 2 times, last edit by danwat1234 at Nov 7, 2024 12:24:14 AM] |
||
|
Boca Raton Community HS
Advanced Cruncher Joined: Aug 27, 2021 Post Count: 113 Status: Offline Project Badges: |
Thank you for this update- this was great information that addressed many of the concerns. Extended deadlines would be great!
|
||
|
Link64
Advanced Cruncher Joined: Feb 19, 2021 Post Count: 118 Status: Offline Project Badges: |
10K MCM work units compiled per day can't be right, around 700K are computed per day. 10k seems to be their ready to send buffer, something you see on other BOINC projects on the server status page, which is unfortunately missing here. It has nothing to do with the amount of completed WUs per day. |
||
|
andgra
Senior Cruncher Sweden Joined: Mar 15, 2014 Post Count: 183 Status: Offline Project Badges: |
What a refreshing post savas!!
----------------------------------------This is exactly the type of info we like to get. Good luck in pursuing the problems further.
/andgra
|
||
|
|