Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 46
Posts: 46   Pages: 5   [ 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 20331 times and has 45 replies Next Thread
Cyclops
Senior Cruncher
Joined: Jun 13, 2022
Post Count: 295
Status: Offline
Reply to this Post  Reply with Quote 
2022-10-27 Update (Workunits & storage update)

Hi everyone, we’re happy to see that volunteers are receiving more OPN1 workunits than last week. We recently increased our DB2 storage pool and switched to a more coarse-grained scheduling method for creating and packaging new workunits for each project. This change may have temporarily disrupted WU scheduling, but we will need to monitor further and likely explore additional possible causes before we can consider the issue resolved.

Another (less optimistic) theory is that other tasks, specifically OPNG, were the cause of our recent storage issues and database-wide system errors. We have no solid evidence yet, only an observation that there is typically a decline in available OPNG work around the same time the download issues are less prevalent. A high load on the storage server and scheduler coincide with the database crashes and a phenomenon whereby the download/upload server groups intermittently register as down from the perspective of our load balancer.

We continue to monitor the system to determine what the best course of action is to stabilize our internal network.

Thank you for your support, patience and understanding.

WCG team at Krembil Research Institute
[Oct 27, 2022 2:39:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Blount
Senior Cruncher
Joined: Aug 19, 2005
Post Count: 397
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

How many workunits have been released in the last 8 hours? I have 3 AMD 16core CPUs each with fancy graphics cards that have received Zero workunits in over 8 hours.
[Oct 27, 2022 2:58:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 858
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Thank you for the update Cyclops. I hope the team is able to find the issue soon and put in a fix.
[Oct 27, 2022 3:38:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 793
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Thank you for the update Cyclops
----------------------------------------

[Oct 27, 2022 4:50:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Another (less optimistic) theory is that other tasks, specifically OPNG, were the cause of our recent storage issues and database-wide system errors. We have no solid evidence yet, only an observation that there is typically a decline in available OPNG work around the same time the download issues are less prevalent. A high load on the storage server and scheduler coincide with the database crashes and a phenomenon whereby the download/upload server groups intermittently register as down from the perspective of our load balancer.
Well, to be honest, this doesn't make that much sense to me.
There have been times where you send out a very large number of OPNG WUs, next to regular OPN1, without there being any download problems (hard to keep track of when exactly, as the system is acting up so much over the time of several months now). With all the OPNx files being rather small, there shouldn't be THAT much strain on storage servers, beside the number of connections to the database, which I had mentioned before (rather than the number of connection and "bandwidth" of the external/Internet connection).
But what is a difference between OPNx and the other projects on WCG is the way how the filenames are constructed and possibly the way how the server side of the project needs to keep track of all those files. It seems the filenames are all randomized (like a UUID number) to create unique filenames, which would require more effort to keep track of them (associate multiple download files to each WU/result ID) than the more "organized" way how filenames seem to be constructed on other projects.

Also, the download errors this morning, after the feeding of new WUs had been restarted, happened apparently without any OPNG being involved... confused

Ralf
----------------------------------------

[Oct 27, 2022 5:26:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ramnet
Cruncher
Joined: Feb 25, 2013
Post Count: 2
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Thanks for the update Cyclops.

I had noticed the download errors becoming less frequent in the last few days, so while things still aren't perfect at least progress is being made. Still far too many download errors though.
----------------------------------------
[Edit 1 times, last edit by ramnet at Oct 27, 2022 5:33:12 PM]
[Oct 27, 2022 5:30:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 242
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Thank you for the update. I'd suggest testing the theory by stress testing it. Limit the projects to OPN (which is mostly what we have now), and gradually add ever increasing availability of OPNG to monitor the system response, and to see if, or at what level the system "breaks".
----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
[Oct 27, 2022 6:07:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SD Surfer
Advanced Cruncher
Joined: Nov 22, 2005
Post Count: 56
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Thank you for the update Cyclops. WCG is my top choice for distributed computing and I am still hopeful for a stable continuation. Please carry on with the updates they are helpful.
----------------------------------------
1 x AMD Ryzen 3950x 16c/32t
Various Androids
[Oct 27, 2022 7:01:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
phillipspencer
Advanced Cruncher
France
Joined: Apr 9, 2015
Post Count: 71
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Appreciate the update.
I had completely run out of work units with none being downloaded today before lunch (French time). Finally received some OPN1 WUs early afternoon (I guess Canada had woken up by then!)
While you are referencing database and storage issues, I noticed for the first time that I got the transient HTTP error on upload when some completed just now. Example:
27/10/2022 23:21:28 | World Community Grid | Temporarily failed upload of OPN1_0120390_00986_0_r569887486_0: transient HTTP error
I guess between the network issues and these other challenges you updated us on that the WCG team must be really struggling to isolate specific causes of individual bugs and testing fixes must be a nightmare given the complexity and concurrent issues.

Good luck!
Phillip
[Oct 27, 2022 9:43:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-10-27 Update (Workunits & storage update)

Cyclops, thanks for the update, feedback is always appreciated!
[Oct 27, 2022 11:17:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 46   Pages: 5   [ 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread