Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: Comprehensive Issue List & Report Thread (Feb. 24, 2023) |
No member browsing this thread |
Thread Status: Active Thread Type: Sticky Thread Total posts in this thread: 423
|
Author |
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 254 Status: Offline Project Badges: |
Larger files, like that above mentioned dataset file downloaded without an apparent hitch with up to 2MB/sec just fine. But over and over again, those <1K files, they always get stuck at 107 bytes and may (or may not) download if you torture the [Retry Now] button. Also, a lot of people posted about 500 http errors, like 503 Service unavailable and the like. Those are not directly "bandwidth" related issues, but server hardware/OS related issues, where the server (cluster) just can't keep up with the requests. Exactly. Even in the IBM days, we ran into problems with OPNG and its "lots of little files" work units. I'm not sure it's practical to zip these work units up first and unzip them on the client, though. It would certainly cut down on the number of requests - which seems to be what's snowing the server under. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
Exactly. Even in the IBM days, we ran into problems with OPNG and its "lots of little files" work units. I'm not sure it's practical to zip these work units up first and unzip them on the client, though. It would certainly cut down on the number of requests - which seems to be what's snowing the server under. To be clear, I doubt that this is a problem specific to OPNG WUs, I see those a lot with MCM1 WUs as well, this is a more of a generic issue. Not sure is zipping those tiny files would make any difference, depending on how they are related to the WUs. But the more I think about this issue, I think rather than a networking/bandwidth issue, this is an issue with the server side file system, which can't handle serving up those little files fast enough to the TCP stack, possibly a caching issue, which could explain that weird stop at 107 bytes for those <1K files each and every time... Ralf |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 254 Status: Offline Project Badges: |
Looks like we've used up this test batch of units, at least for x86. I'm still getting new OPN1 units for ARM, but the x86 machines that have downloaded all their work aren't getting any more.
----------------------------------------Seems like our last major issue is serving the work out to clients. The work units themselves have run smoothly, and I haven't seen any problems with uploads. Edit: Famous last words. Some OPNG units just came my way. [Edit 1 times, last edit by spRocket at Aug 19, 2022 5:13:36 PM] |
||
|
rgarvey
Cruncher Joined: Nov 22, 2004 Post Count: 23 Status: Offline Project Badges: |
Is it possible for Krembil to focus on these items: I don't think that 3. is an issue, it seems to work for all I can see. But all the internal stats are not properly "visualized", which to some degree makes it hard to judge the accuracy of the external stats.1. Reliable WU download bandwidth the first time they try to download! 2. Reliable WU overview credit 3. Reliable WU external reporting None of this has occurred since WUs were first released and we are now months into this with no visible progress And I am not sure if "bandwidth" is the issue with the download problems.For example when downloading MCM1 WUs, there are a bunch of small files, between 296 and 1021 bytes in size. That is certainly smaller than the maximum payload of an Ethernet packet of 1500 bytes. Yet, all of those files get always stuck at exactly 107 bytes. And one large .txt file for the project, with a whooping 121MBytes gets downloaded on the same host with a speed of 800KB/sec to 1MB/sec. Ralf Well, by external reporting, I note that your WCG signature shows that your still inactive, while I expect you have been processing some WU's since it started, same as me. This seems to be inconsistent as WCG's own Overview results shows none of these WU's in its own Results returned totals. Oddly, BoincStats does show some daily totals, yet the world positions changed from 200s to 7300s after this mess....???? And yes, the WUs download of pieces is horrible and does not resolve after timed retries, project manual updates or windows spanning days |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
Well, by external reporting, I note that your WCG signature shows that your still inactive, while I expect you have been processing some WU's since it started, same as me. This seems to be inconsistent as WCG's own Overview results shows none of these WU's in its own Results returned totals. Eggsacktly! This is something that I (and at least a couple of others) have reported since they sent out the first test WUs.External BOINC stats are created and can successfully being pulled by the external stats sites, like BOINCstats (I guess FreeDC does as well, just haven't checked that in a long time).. However, the Contributions/Overview page still says that the last WU by me (and likely everyone else) was returned back in February. And neither do the aggregated stats for the various projects show any change since then. The only thing that does work (as wicked as it does since the web site was changed last year, after the move announcement!) is that it shows under "Results" the received WUs, and the status of the returned WUs, until they are (as usual) being purged after a few days. What is not known at this point, and there hasn't been any official response to that, is if all those stats internally to WCG, like WU count, accumulated run time and WCG credit are being retained "for later consideration" or those stats are gone into the big blue yonder for ever... And AFAIK, SNURK is pulling the same data from WCG and thus, that info hasn't been updated since February either... Ralf |
||
|
BrianFR
Cruncher Joined: Aug 15, 2014 Post Count: 4 Status: Offline Project Badges: |
Hello,
I don't know if the issue is known but I have difficulties with the suspend/resume command via boinccmd --project. I guess there is a link with the user preferences issue. SYS |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
Hello, That sounds more like a BOINC client problem than anything directly related to WCG...I don't know if the issue is known but I have difficulties with the suspend/resume command via boinccmd --project. I guess there is a link with the user preferences issue. SYS Ralf |
||
|
rgarvey
Cruncher Joined: Nov 22, 2004 Post Count: 23 Status: Offline Project Badges: |
https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,44260
----------------------------------------OK, so we have a data center networking issue, staffing problem, poor resource planning problem with a solution that only targets a 10-20% improvement for a system running on a 'lower capacity' for bring-up testing???? Don't get me wrong, I appreciate the desired effort to 'increasing our capacity', it certainly seems like their goal falls FAR short of what is required to bring this back online in FULL capacity equivalent to what existed previously!!! And, I look forward to an answer to all the results tracking/reporting issues.... [Edit 1 times, last edit by rgarvey at Aug 19, 2022 8:52:48 PM] |
||
|
Kirel2
Advanced Cruncher United States Joined: Sep 24, 2014 Post Count: 99 Status: Offline Project Badges: |
They may not have the capability or resources to bring the system completely back to the way IBM had it. We have no idea of the resources available, unfortunately.
---------------------------------------- |
||
|
dcs1955
Veteran Cruncher USA Joined: May 24, 2016 Post Count: 668 Status: Offline Project Badges: |
Does anyone know if we are getting credit for crunching these work units? A couple of weeks ago I got about 15 MCM jobs but I see no change to my stats. TIA
---------------------------------------- |
||
|
|