Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: 2022-10-27 Update (Workunits & storage update) |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 46
|
Author |
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
Hi everyone, we’re happy to see that volunteers are receiving more OPN1 workunits than last week. We recently increased our DB2 storage pool and switched to a more coarse-grained scheduling method for creating and packaging new workunits for each project. This change may have temporarily disrupted WU scheduling, but we will need to monitor further and likely explore additional possible causes before we can consider the issue resolved.
Another (less optimistic) theory is that other tasks, specifically OPNG, were the cause of our recent storage issues and database-wide system errors. We have no solid evidence yet, only an observation that there is typically a decline in available OPNG work around the same time the download issues are less prevalent. A high load on the storage server and scheduler coincide with the database crashes and a phenomenon whereby the download/upload server groups intermittently register as down from the perspective of our load balancer. We continue to monitor the system to determine what the best course of action is to stabilize our internal network. Thank you for your support, patience and understanding. WCG team at Krembil Research Institute |
||
|
Blount
Senior Cruncher Joined: Aug 19, 2005 Post Count: 397 Status: Offline Project Badges: |
How many workunits have been released in the last 8 hours? I have 3 AMD 16core CPUs each with fancy graphics cards that have received Zero workunits in over 8 hours.
|
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 858 Status: Offline Project Badges: |
Thank you for the update Cyclops. I hope the team is able to find the issue soon and put in a fix.
|
||
|
TLD
Veteran Cruncher USA Joined: Jul 22, 2005 Post Count: 793 Status: Offline Project Badges: |
Thank you for the update Cyclops
---------------------------------------- |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
Another (less optimistic) theory is that other tasks, specifically OPNG, were the cause of our recent storage issues and database-wide system errors. We have no solid evidence yet, only an observation that there is typically a decline in available OPNG work around the same time the download issues are less prevalent. A high load on the storage server and scheduler coincide with the database crashes and a phenomenon whereby the download/upload server groups intermittently register as down from the perspective of our load balancer. Well, to be honest, this doesn't make that much sense to me.There have been times where you send out a very large number of OPNG WUs, next to regular OPN1, without there being any download problems (hard to keep track of when exactly, as the system is acting up so much over the time of several months now). With all the OPNx files being rather small, there shouldn't be THAT much strain on storage servers, beside the number of connections to the database, which I had mentioned before (rather than the number of connection and "bandwidth" of the external/Internet connection). But what is a difference between OPNx and the other projects on WCG is the way how the filenames are constructed and possibly the way how the server side of the project needs to keep track of all those files. It seems the filenames are all randomized (like a UUID number) to create unique filenames, which would require more effort to keep track of them (associate multiple download files to each WU/result ID) than the more "organized" way how filenames seem to be constructed on other projects. Also, the download errors this morning, after the feeding of new WUs had been restarted, happened apparently without any OPNG being involved... Ralf |
||
|
ramnet
Cruncher Joined: Feb 25, 2013 Post Count: 2 Status: Offline Project Badges: |
Thanks for the update Cyclops.
----------------------------------------I had noticed the download errors becoming less frequent in the last few days, so while things still aren't perfect at least progress is being made. Still far too many download errors though. [Edit 1 times, last edit by ramnet at Oct 27, 2022 5:33:12 PM] |
||
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 242 Status: Offline Project Badges: |
Thank you for the update. I'd suggest testing the theory by stress testing it. Limit the projects to OPN (which is mostly what we have now), and gradually add ever increasing availability of OPNG to monitor the system response, and to see if, or at what level the system "breaks".
----------------------------------------“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792) |
||
|
SD Surfer
Advanced Cruncher Joined: Nov 22, 2005 Post Count: 56 Status: Offline Project Badges: |
Thank you for the update Cyclops. WCG is my top choice for distributed computing and I am still hopeful for a stable continuation. Please carry on with the updates they are helpful.
----------------------------------------
1 x AMD Ryzen 3950x 16c/32t
Various Androids |
||
|
phillipspencer
Advanced Cruncher France Joined: Apr 9, 2015 Post Count: 71 Status: Offline Project Badges: |
Appreciate the update.
I had completely run out of work units with none being downloaded today before lunch (French time). Finally received some OPN1 WUs early afternoon (I guess Canada had woken up by then!) While you are referencing database and storage issues, I noticed for the first time that I got the transient HTTP error on upload when some completed just now. Example: 27/10/2022 23:21:28 | World Community Grid | Temporarily failed upload of OPN1_0120390_00986_0_r569887486_0: transient HTTP error I guess between the network issues and these other challenges you updated us on that the WCG team must be really struggling to isolate specific causes of individual bugs and testing fixes must be a nightmare given the complexity and concurrent issues. Good luck! Phillip |
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 294 Status: Offline Project Badges: |
Cyclops, thanks for the update, feedback is always appreciated!
|
||
|
|