| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 30
|
|
| Author |
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
@Cyclops,
----------------------------------------Based on how many WU I currently show “in progress” and how many I TYPICALLY show “in progress,” I would speculate (since I am at work and not home and unable to visually verify) that I have about 40 - 60 WUs stalled with potential http errors. That comment was based on information I emailed to you earlier today. Bruce ETA: Downloads were in worse shape than I suspected. Virtually all indicated WUs were incomplete downloads with http errors. They have been expedited but have seen additional http errors on newer downloads. What I thought were stalled WUs were, in fact, WUs not yet scheduled to be sent. Will continue to monitor. [Edit 1 times, last edit by bfmorse at Nov 5, 2022 7:58:39 AM] |
||
|
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 278 Status: Offline Project Badges:
|
After running smoothly for a week, the download issues are back. The system was running perfectly with the mix of MCM, OPN, and a steady flow of OPNG. Reintroduce ARP, and the issue returns. I'd call that a correlation.
----------------------------------------Given the larger file of size of ARP, if the download speed isn't fast enough, it may be resulting in too many connections, or there could be some other factor involved. Definitely worth looking into.
“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
After running smoothly for a week, the download issues are back. The system was running perfectly with the mix of MCM, OPN, and a steady flow of OPNG. Reintroduce ARP, and the issue returns. I'd call that a correlation. Amen to this! I'd add that the flow of MCM needs to be steady enough that folks don't run out and need to re-download that 100+MB file,Given the larger file of size of ARP, if the download speed isn't fast enough, it may be resulting in too many connections, or there could be some other factor involved. Definitely worth looking into. By the way, adriverhoef speculated on this in a post in another News thread a couple of weeks ago - don't know whether the tech team saw that or not :-) Cheers - Al. |
||
|
|
Pete Broad
Senior Cruncher Wales Joined: Jan 3, 2007 Post Count: 169 Status: Offline Project Badges:
|
I'm one of the people with device manager issues. New machines are getting work but are not shown in the device manager. Also, name changes that I made on some machines are not showing up.
----------------------------------------Pete ![]() |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
Yep, lots of failed downloads.
In the next BOINC programmers' gathering, I'd suggest adding a user-tunable option to the client to let it be more resilient when the download servers are overloaded and are giving lots of these http errors: do more retries, and not do "project backoff"s so readily. Oh, and apparently the "How to run WCG 1.01" manual that IBM handed to Krembil had a section about how you should only try new adventurous things with the project, like vastly increasing the flow of outgoing files, when the weekend is imminent. That way, these new things will create such chaos that you won't so easily forget that they don't work too well. OTOH, please delete that section of the manual ![]() |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
In the next BOINC programmers' gathering, I'd suggest adding a user-tunable option to the client to let it be more resilient when the download servers are overloaded "when the download servers are overloaded" So, how are you gonna tell that the server is overloaded(*)? And why is the server overloaded? Because too many clients are overloading the server? So let's be more resilient and overload the server even more? * The HyperText Transfer Protocol (HTTP) 503 Service Unavailable server error response code indicates that the server is not ready to handle the request. Common causes are a server that is down for maintenance or that is overloaded. In other words, there is no distinction between 'down for maintenance' and 'overloaded'. ![]() |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1295 Status: Offline Project Badges:
|
Thank you for the ARP update, and thank you for the ARP WUs. If I have a choice between download errors or no ARPs, I'll take the ARPs errors and all. I get them eventually.
It will be interesting to see how long this batch of ARPs takes to send out, and then there will be no more errors. The resends are usually a smaller group and more spread out, so don't cause problems. |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
So adriverhoef implies that the current client re-tries and project backoffs algorithms are already optimal.
I came to my desktop this morning and checked the "farm". The 2 machines with GPUs are processing no GPU work, but have OPNG work "Downloading". There are only about 20 files in Project Backoff, some of which would wait more than another 5 hours until their next try, blocking all work. I do some manual "Retry now" clicks. About 1 in 3 files download rapidly on each try, which is a big improvement over the previous day. There must be plenty of little windows of time during which the servers can accept download requests, but also lots of little windows where the servers are busy. There seems to be plenty of bandwidth to transfer the downloads that actually start. I re-try downloading the files that missed out coming on the previous try. Some devices go into Project Backoff on only the second try, which freezes all re-tries from the device. It seems to me to be a far from optimal use of the available server capacity. |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
So adriverhoef implies that the current client re-tries and project backoffs algorithms are already optimal. In a normal situation, yes, that's what I'm suggesting.This situation, where you seem to have to 'fight' for a successful connection with slow speed, is not normal. So, "adding a user-tunable option" is surely constructive thinking and I like that, but I think your suggestion is - as seen in the light of my arguments why it wouldn't work - not doable, or rather not advisable. Sorry to hear about your computerfarm. ![]() [Edit 1 times, last edit by adriverhoef at Nov 6, 2022 11:27:33 AM] |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
@Cyclops
----------------------------------------Http errors and dismal transfer rates seem to be the norm since ARP has been released to join the other the active research WU’s. Although I look forward to processing those WU as well, I cringe at the trending performance of WCG’s web site when WUs on additional, current (but on hold) research is released. Current file transfer throughput hovers around 33KBps to 41KBps for an 18.28 MB file. Download speed to my gateway was just tested and is over 800 Mbps. Is this low transfer rate normal and expected at my end? [ETA: download data rate unit value corrected to read 800 Mbps e.g., 800,000 Kbps. UPLOADS of ARP data files were around 1,000KBps] I REALLY HOPE that troubleshooting, resolution and implementation of the appropriate steps be taken to eliminate these errors! Please advise. [Edit 1 times, last edit by bfmorse at Nov 7, 2022 10:21:44 PM] |
||
|
|
|