Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 159
|
![]() |
Author |
|
Falconet
Master Cruncher Portugal Joined: Mar 9, 2009 Post Count: 3295 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Agreed. A very informative post.
----------------------------------------AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W AMD Ryzen 7 7730U 8C/16T 3.0 GHz |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Still seeing retries as of 6:30 AM CST, but I run fairly short queues, so the the backlogs are short and clear quickly.
----------------------------------------ETA: I don't always get retries, and a fair number go through without problems. [Edit 1 times, last edit by spRocket at Nov 7, 2024 12:35:02 PM] |
||
|
gj82854
Advanced Cruncher Joined: Sep 26, 2022 Post Count: 102 Status: Offline Project Badges: ![]() ![]() |
Not getting any work now due to "Task are committed to other platforms" message. That will definitely fix the download problem.
|
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2152 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Looking at the latest 22 results for one of my devices that has a maximum of 1 ARP1-task in its queue, one of the biggest problems was (quoting savas) "a failing drive on one of the download servers", oldest first:
<22> * ARP1_0014490_126_0 Fedora Linux Error 2024-11-04T07:23:47 2024-11-05T03:10:13 14.39/15.69 Adri |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1948 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For both ARP1 and MCM1, it's not clear how many users collect more work than they have any chance of processing -- that's why I'd like to see the MCM1 default deadline cut back, in the hope that the unwitting might find out why they are having problems whilst those who deliberately maintain large caches might be encouraged to reduce the size a little :-) This is an issue that I mentioned several times in the past, but that has always been brushed aside and I have been marked the scapegoat.If you go though some of the threads regarding the ARP1 download issues in recent days, you will find several posts of people that clearly state that they have loaded up choke full of ARP1 WUs, even thought the FAQ clearly lists that ARP1 requires MUCH more resources than any other project, in all terms, like download size, upload size, drive space and RAM needed. But it seems too many folks just ignore this and being selfish, loading up with huge numbers, willfully removing the systems default restrictions of WUs active per host and thus only contribute to exaggerate the whole issue. It was two years ago already established that the bottleneck here is the number of concurrent connections to the back end database servers. And yet another disk failing doesn't help either, but that's life. As folks don't seem to be willing to restrict themselves in this situation to the default restrictions per host, the WCG team should look into introducing a way for a hard limit of concurrent WUs, that can't be modified by the "volunteer", at least until a better solution on the back end is found and implemented. Ralf ![]() |
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As folks don't seem to be willing to restrict themselves in this situation to the default restrictions per host, the WCG team should look into introducing a way for a hard limit of concurrent WUs, that can't be modified by the "volunteer", at least until a better solution on the back end is found and implemented. Ralf I seem to remember one WU per thread was the hard limit for the Clean Energy - Phase 2 project. CEP2 had one large file per WU and the application unzipped the files before it started to run. There were bandwidth restrictions also. Lots of technical information can be found in the CEP2 forum. [Edit 7 times, last edit by AgrFan at Nov 7, 2024 6:54:37 PM] |
||
|
ericinboston
Senior Cruncher Joined: Jan 12, 2010 Post Count: 258 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Can I ask for a little clarity here, please?
----------------------------------------1)What is ARP1 and why is it affecting downloads for Mapping Cancer Makers (I assume this is what MCM1 means). 2)Although I am quite technical on many levels, I don't understand a lot of this WCG-specific technical post from savas and thus I don't have any expectations of when things will be back to normal for MCM WUs. Can someone please take it up a notch and maybe give a short, technical answer regarding this problem? For example, someone might say "Our systems run in the Cloud at SHARCNET. A few days ago the hard drive failed on a particular box which caused ______. We replaced the drive and WUs are being sent out as of 4:05PM ET Nov 6, 2024 in normal fashion. You may need to wait up to 48 hours to receive MCM WUs due to high demand." This level of detail/verbiage would be much appreciated for typical outages. 3)With all due respect, it's been 24 hours since savas posted the problem and implying the fix has been implemented (as far as I can tell). So why 24 hours later are my 10 machines not receiving WUs? I ask this again for clarity on expectations of when things will be back to normal for us volunteers. If a fix has not been implemented, can you please set our expectations of when it will be? Thanks! ![]() |
||
|
Freewill
Cruncher United States Joined: Mar 28, 2006 Post Count: 39 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
ARP jamming up itself and MCM is the same issue, as far as I can tell, that we had last time ARP was active more than a year ago. Was this anticipated? Did the IT team try to do something before the restart?
This recent update on efforts to address is greatly appreciated. I cannot however see it has improved the situation, at least for my PCs. |
||
|
imakuni
Advanced Cruncher Joined: Jun 11, 2009 Post Count: 103 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
1)What is ARP1 and why is it affecting downloads for Mapping Cancer Makers (I assume this is what MCM1 means) Look through the list of projects and truncate the names to the first letter. Yes, MCM stands for "Mapping Cancer Makers (phase 1)", and ARP stands for "Africa Rainfall Project (Phase 1)". 2)Although I am quite technical on many levels, I don't understand a lot of this WCG-specific technical post from savas and thus I don't have any expectations of when things will be back to normal for MCM WUs. MCM will get back on track when ARP stops being sent out. In theory they could make it work, but in practice they have proven time and again to be incapable of doing so. Can someone please take it up a notch and maybe give a short, technical answer regarding this problem? For example, someone might say "Our systems run in the Cloud at SHARCNET. A few days ago the hard drive failed on a particular box which caused ______. We replaced the drive and WUs are being sent out as of 4:05PM ET Nov 6, 2024 in normal fashion. You may need to wait up to 48 hours to receive MCM WUs due to high demand." This level of detail/verbiage would be much appreciated for typical outages Here's a breakdown from what I gather. -A drive was bad, the server is now on new hardware. -If a computer requests work and has ARP and MCM selected, the server "randomly" sends either one. Now the odds of sending ARP are lower. -When a computer requests a connection, the server tries to establish it for longer before giving up. -Each computer can't transfer as many files at once. Say I could be transferring up to 10; now I can only do 5 at a time, and need to wait those to finish before I can start transferring the next one. -The server must stock some units to send to people when they request work. In the future, they MIGHT have less of of them available for delivery at a given time; think of it like a supermarket having 10 crates of milk for sale rather than 100. -More hardware is coming. -Better software to handle communications is coming. -If you fail to transfer a file, you can retry sooner (say, wait 10min rather than 1h). -The deadline to complete any given piece of work has been extended. You now have about a week rather than a couple days. 3)With all due respect, it's been 24 hours since savas posted the problem and implying the fix has been implemented (as far as I can tell). So why 24 hours later are my 10 machines not receiving WUs? I ask this again for clarity on expectations of when things will be back to normal for us volunteers. If a fix has not been implemented, can you please set our expectations of when it will be? Yeah, see, here's the thing: they definitely implemented changes, and we can clearly see that... but the effect is that people that don't micro manage their tasks still can't get any work at a reasonable pace (so no change there), whereas people that do are now stuck in the same boat as people that don't (which is to say, they can't get any work either). TLDR, they're gaslighting everyone, including themselves. ![]() Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840 |
||
|
Link64
Advanced Cruncher Joined: Feb 19, 2021 Post Count: 129 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
We have decreased the app weight of ARP1 relative to MCM1 in the feeder You need to decrease it even further, so far not the slightest improvement is noticeable at our ends. Eventally stop the feeder for a while until most downloads complete, than start it again with lower weight of ARP.![]() |
||
|
|
![]() |