Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: Regarding ARP1 and MCM1 download issues since ARP1's launch on Monday Nov 4th, 2024 |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 159
|
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 874 Status: Offline Project Badges: |
I suspect that if the current WCG had access to more [and more reliable!] data centre resources (bandwidth and hardware) things would be a lot better They don't need necessarily more ressources, it is possible to run WCG on what they have or even less, it's just a matter of proper configuration of the feeder. The frequency at which the feeder queue ist refilled and the mix of WUs must simply match the capabilities of the entire system. Savas said in the first post already, that they did something about it, but obviously it didn't help and I don't think they tried anything more since than.That is why I mentioned better resources -- I don't think that they'll be able to keep up the past higher rates of MCM1 work in tandem with ARP1 (or the return of OPNG or introduction of the new project TigerLily mentioned before leaving) given the obvious constraints at the data centre. The problem of missed deadlines also needs to be addressed in one way or another (using grace days being preferable [in my view] to just giving longer deadlines at the client end) -- even before the current debacle, far too many retries for MCM1 were going out only to end up being Server Aborted or becoming a third valid result when a late responder replied. And now there are download errors to create extra retries, which has caused quite a few ARP1 WUs to be failed unnecessarily! Doing something about deadlines (shorter at the client, grace days at the server?) might also help regulation of the mass downloading of work to larger buffers. (The percentage of missed deadlines amongst my wingmen was noticeable even when only MCM1 work was available [when download/upload times couldn't be blamed!]) Cheers - Al. P.S. A cursory analysis of the daily changes (and associated discrepancies) in the ARP1 generations.txt statistics file Kevin Reed set up suggests that we may have had several hundred units "lost"; an accurate figure would require access to the BOINC database [no chance! :-)] [Edit 2 times, last edit by alanb1951 at Nov 18, 2024 11:40:40 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7581 Status: Offline Project Badges: |
I agree about throttling the feeder more effectively (given the infrastructure they have got) -- however, that would most likely result in significantly less work being available per day. Cue "lack of work" complaints (some justified, some selfish) from various different parts of the user community... Only the ARP units need to be throttled. Apparently the infrastructure support for MCM is adequate even with high numbers of work units being released. I would hazard a guess that the complaints for lack of work available for ARP would be sufficiently less than the complaints about the current bandwidth problem. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Greg_BE
Advanced Cruncher Joined: May 9, 2016 Post Count: 80 Status: Offline Project Badges: |
I agree about throttling the feeder more effectively (given the infrastructure they have got) -- however, that would most likely result in significantly less work being available per day. Cue "lack of work" complaints (some justified, some selfish) from various different parts of the user community... Only the ARP units need to be throttled. Apparently the infrastructure support for MCM is adequate even with high numbers of work units being released. I would hazard a guess that the complaints for lack of work available for ARP would be sufficiently less than the complaints about the current bandwidth problem. Cheers MCM is hopeless as well. I have 12 tasks plus a handful more running and the 12 are locked up trying to upload. WCG as a whole has fallen apart yet again. They had a good run and everything worked fine and now all of a sudden they can't handle the traffic. Are they really that incapable of being future proof? I think I am going to take a break from WCG until January. Having tasks trying to upload for 3 days (ARP) and now MCM is just annoying the he double toothpicks out of me. Throttling is a bandaid. More bandwith capability is what is needed and with ARM with 6 files for one task, as I asked in another area, can't they be Gzipped and unpacked and repacked on the local system? I have seen this or something like it on some of the other projects I run that have large files. And how is it that WCG has so much trouble if they are part of Kremble? Does the parent organization not wamt to fund WCG? I mean if a new project that I was part of for a time, they could handle being bombed with users uploading and downloading tasks with no problem. |
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 294 Status: Offline Project Badges: |
I no longer have the time or inclination to click the WU's to/from my system. Especially since there is a lower success rate with the recent modifications that have been made to the software.
Therefore, I have just revised all the profiles to have to have a limitation of only one or two work units for MCM and ARP. At least this way, I do not expect to get overrun with ERRORED OUT WU's due to them hanging in either up- or down-loading. Queue is still set for zero and will remain there until the system proves it will enforce the "DUE DATES" - after the data flow has been stabilized. Eventually, I expect things to change with regards to data flow to and from the volunteers. :How Soon" is wishful thinking. At that point I will re-evaluate the Profile and Queue values. In the mean time, I continue to crunch, but at a much slower rate and remain committed to the projects |
||
|
homer__simpsons
Cruncher Joined: Nov 14, 2015 Post Count: 2 Status: Offline Project Badges: |
We can clearly see the issues in https://www.worldcommunitygrid.org/stat/viewGlobal.do , world community grid completes 2 times less workunits:
----------------------------------------I can imagine that if they are lacking server ressources it shold be harder to intervene. On my side I set it to no new work to hopefully give them some time to fix this. World communit grid could maybe pause their scheduler or at least give it less workunits to distribute. As I am mostly interested to crunch for medicine topics I contribute to the following projects: - SiDock@Home: Always have work units available - Rosetta@Home : Lot of devices for few tasks, so often does not have workunits available - GPUGRID: Only for GPU tasks, often has work available - DENIS@Home : Has 0 workunit for now, but I believe it will restart soon (before end of year) [Edit 1 times, last edit by homer__simpsons at Nov 19, 2024 1:58:19 PM] |
||
|
scleranthus
Cruncher FRANCE Joined: Feb 8, 2005 Post Count: 13 Status: Offline Project Badges: |
hi all,
In my previous post I said I turned down any new tasks. Yet, All my calculated work units were lost out of delay with a big waste of time and energy (days). ARP1 creates up do 6 files with an average of 14 Mb each : not even a single unit could be sent completly. This is so much more than the 1kb results of MCM1. if I understand pragmatic and positive messages I can't forget I would loose my job with such efficiency and no comunication after so many days. There is no other choice but stopping all and wait. |
||
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 815 Status: Offline Project Badges: |
Unfortunately I am at the same spot. I spent most of today babying this machine trying to upload the 80+ WUs that finished as it continues to complete more; mainly MCM. Tomorrow I need to focus more on my job. In the meantime, I'm swapping to Milkyway@home and will check back on WCG in a week or so.
---------------------------------------- |
||
|
SHOPXXL-COM
Cruncher Joined: Oct 24, 2006 Post Count: 2 Status: Offline Project Badges: |
My MCM1 units can still somewhat get uploaded on time... they're slow, but it still works. For my ARP-Project, they don't: They get stuck an will be turned in LATE. I hope I'll still get credit for, they take a lot of power/time to finish per unit. If Sharcnet can't handle replacing the one failed disk, then it's time to move on. Honestly, if thousands of crunchers can't upload because of ONE DISK failure, then... sorry, but my home-NAS is doing better than that!
By the way, I looked up the official sharcnet website, and I was like... "GUYS, WTF?! ARE YOU SERIOUS?? Is that 'back to the early 2000's' or what?!" Very disappointing website of these canadian Sharcnet guys!! (and I still wonder who reset my forum post count!) |
||
|
savas
Cruncher Joined: Sep 21, 2021 Post Count: 30 Status: Offline |
Dear Volunteers,
We have completed the migration of the WCG to a fax machine. -- WCG Tech Team Joking, but we do have an update to share about the issues since ARP1 restarted on November 4th, 2024: ARP1 was soft-paused last week, there are only 361 workunits left to download claim as of this writing. We will soft-resume when all outstanding work is uploaded, and we will ratelimit those uploads while extending deadlines of ARP1 workunits until we have settled. Approximately ~29,000 ARP1 workunits must now be uploaded, before we can titrate ARP1 slowly to the right level, after implementing traffic shaping and rate limits in workunit creation, download, and upload phases. Deadlines for ARP1 workunits have been extended again on Saturday Nov 16th, 2024, by 5 days. We will continue to extend the deadline of ARP1 workunits. We fixed bugs in the HAProxy and Apache configuration files that were causing connection resets on large file transfers, we hope users have seen that despite the atrocious speeds large file transfers are more likely to succeed. ARP1 should download a single file and upload a single file. We will work on this. Traffic shaping and ratelimits will be applied at the load balancer and backend webservers so that connection slots are less available to client IP addresses specifically for ARP1 file transfers. The overall bandwidth of all ARP1 transfers connecting through to download and upload backends will be limited to a reasonable fraction of total available bandwidth. This should allow downloads that do connect to use as much bandwidth as is available and rarely be throttled, while reserving most of our bandwidth for the much smaller and quick to free resources MCM1 workunits. We provisioned 3 new servers, more to come, with the help of SHARCNET. They have much more CPU, memory, and even a large local disk. -- We are pursuing tiered caching of files for download during creation of the workunit entry in the BOINC database, at which point all files should be account for. If we can bulk transfer or write directly to local disk of the download servers, or if we can implemented a distributed in-memory cache across download servers especially for smaller files, we believe we can greatly reduce load on NFS, SAN, and the storage server, therefore increasing available bandwidth in the cloud environment and reducing latency for those requests that must access shared storage. We migrated all provisioning scripts, code and configuration, as well as build and deploy scripts previously supporting only CentOS 7 to support also Ubuntu 22 as that was the only guest OS we could run on this new server group provisioned by SHARCNET. We migrated the production load balancer, upgrading HAProxy from v1.8 -> v2.8, to one of the new servers. We deployed two new download servers into production in this way as well, migrating scripting, code, configuration to support Ubuntu 22 and pending a final fussing with build scripts that work on CentOS 7 but not yet Ubuntu 22, we will be able to deploy the binaries required to have these two new download servers also accept file uploads. We tuned kernel parameters and application configuration files especially for HAProxy and HTTPD/Apache2 on these new servers, with some napkin math. We did the same for the older CentOS 7 upload/download servers as well. With the massive increase in CPU, memory, and understanding of what does what in each configuration file, we are still only doing ~85-100Mbps at any given time through the production load balancer. Therefore, it seems we are ultimately bandwidth limited. Somewhat low available bandwidth combined with our inefficient use of that bandwidth at the load balancer, where the current configuration prohibits efficient connection pooling and reuse. We will change this. -- Increasing the maxconn and timeout queue of the new download servers to throw less 503 errors, still resulted in timeouts, but also made all aspects of the network in our environment unuseable due to latency, presumably from congestion. Users may have noticed in their BOINC clients that at some points over the last two weeks high timeout queue settings and aggressive keep-alive settings would cause retries in the BOINC client to appear to be stuck, not backing off for minutes in some cases, only to be eventually rejected and only rare cases start late finally pushing through the queue to connect through to the backend. We were experimenting to see if we could serve more requests by having them queue up for longer before being hit with a 503 Service Unavailable, but in those cases, as in many other deployments, we found that throughput barely increases if at all while the network congestion/latency renders the entire cloud environment unuseable if connections are allowed to stay open awaiting such a low chance of being serviced. The website, my terminal, CI/CD jobs, git repos, same or different subnet. We have been aggressively testing in prod and that is bad practice, but we only pushed and evaluated configurations that we believed held promise to improve the situation in some way. We apologize for the chaos and confusion, we will bring results from this experience. In addition, we will be providing further updates, as frequently as we can manage, on https://www.cs.toronto.edu/~juris/jlab/wcg.html under Operational Status |
||
|
scleranthus
Cruncher FRANCE Joined: Feb 8, 2005 Post Count: 13 Status: Offline Project Badges: |
Thanks for this comprehensive status and this new link.
|
||
|
|