Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 107132 times and has 9 replies Next Thread
TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
2024-03-11 Update (Website errors and MCM1 work unit availability)

We are working to resolve the frequent 503 service unavailable errors that users are encountering when the load balancer is unable to connect to the production webserver by adding additional instances of the webserver to balance the load between, and an additional load balancer as well. We expect this work to be completed this week.

Regarding the lack of available MCM1 work units, we noticed that the createWork process was not loading new MCM1 work units last week. We were able to begin building batches of work units again on Friday evening last week (March 8). However, the throughput is still poor and we will be pushing additional changes to prod to reduce the batch size (in terms of enumerated work units) of SQL queries hitting the BOINC database from various BOINC components, reduce the LIMIT value in some queries and adjust the parameters of the database and database clients on the relevant servers to allow for larger MySQL packets containing these remote queries. We will communicate whether the issue was resolved as we expect, which in this case will be clearly reflected in the logs on the BOINC database server. As soon as we have made these changes and confirmed the result, we will update everyone here.

Thank you for your support, and for your patience as we work to resolve these issues.

WCG Team
[Mar 11, 2024 5:33:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Barnsley_Tatts
Senior Cruncher
Joined: Nov 3, 2005
Post Count: 291
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

MCM WUs seem to be flowing quite well at the moment thankfully.

Long may it continue! :)
----------------------------------------

[Mar 11, 2024 5:46:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
as1981
Advanced Cruncher
Joined: Dec 3, 2006
Post Count: 51
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

@TigerLily Thank you for the update
I'm also able to download work units at the moment.
[Mar 11, 2024 7:39:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1296
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

Thank you for the detailed update TigerLily. Thanks to the tech team for the fixes.
[Mar 11, 2024 8:10:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Blount
Veteran Cruncher
Joined: Aug 19, 2005
Post Count: 590
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

TigerLily, I am seeing a small trickle of MCM WUs. Any idea when we might see a ramp of of WUs?
[Mar 13, 2024 7:49:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mike0164
Cruncher
Joined: Apr 28, 2007
Post Count: 9
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

TigerLily, I am seeing a small trickle of MCM WUs. Any idea when we might see a ramp of of WUs?


TigerLily,

Any more news on the MCM WU front? It has been over a week since your last message here. As others have seen, I'm only getting a trickle of WUs per day at this point. Per the other thread, I do know you said the tech team is working the issue. Would/could you give us more details?

Thank you.

Mike
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by mike0164 at Mar 21, 2024 2:44:36 AM]
[Mar 21, 2024 2:11:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

Well,...

OPNG WUs reduced to a trickle again, though MCM1 seems to be sending out adequate amounts of WUs, even when it sometimes looks a bit delayed, but no host is really running empty...

However, looking at my stats, I started to notice that the amount of valid MCM1 WUs on the Results page seems to increase again, instead of the usual (well, in good times wink ) 3 days retention before they got purged, they are going back to at least a week, with the oldest valid WU returned 3//23.

I think it would prudent to check if the problem that caused those WUs to accumulate for 4 months isn't rearing it's ugly head again and would be foreshadowing more problems to come.

thanks,

Ralf
[Apr 3, 2024 4:50:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

As far as I can tell from my returned results analysis, the assimilator still seems to be performing as it should but it looks as if d/b purge hasn't been doing anything since about 12:20 UTC on 2024-04-03.

Ralf - regarding long waits for purging: there seem to be two 24r hour lags between validation and purging[*1] -- my apologies if you already knew that :-) If something I return has to wait for a retry because of a wingman missing the deadline it might well take 8 or 9 days (or more under certain uncommon circumstances) for a task to disappear! I only start worrying if an MCM1 task sits at Valid state for more than two days after the last returned result time :-)

Cheers - Al

[*1] One of those 24 hour intervals is between flagging for purge and the actual purge; the other is between assimilation and file deletion.
[Apr 3, 2024 6:35:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

Further to my post immediately above: database puirge appears to have restarted 24 hours after it stopped, and it took about 2 hours to get back to the normal "gone 24 hours after flagging for purge" routine...

Was it a scheduled [but unannounced] outage, or is there usually a restart each day at that time and yesterday it just didn't [re]start???

Cheers - Al.
[Apr 4, 2024 5:41:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2024-03-11 Update (Website errors and MCM1 work unit availability)

As far as I can tell from my returned results analysis, the assimilator still seems to be performing as it should but it looks as if d/b purge hasn't been doing anything since about 12:20 UTC on 2024-04-03.

Ralf - regarding long waits for purging: there seem to be two 24r hour lags between validation and purging[*1] -- my apologies if you already knew that :-) If something I return has to wait for a retry because of a wingman missing the deadline it might well take 8 or 9 days (or more under certain uncommon circumstances) for a task to disappear! I only start worrying if an MCM1 task sits at Valid state for more than two days after the last returned result time :-)

Cheers - Al

[*1] One of those 24 hour intervals is between flagging for purge and the actual purge; the other is between assimilation and file deletion.
The issue is not only that valid WUs are sitting for a week or longer, but that I noticed a sharp increase (about 4x by now) of those WUs since last week. And with all those folks complaining that they don't get any WUs, there shouldn't be much of a problem with WUs sitting for a week or more, unless we would really be dealing with a large number of people hogging WUs in multi-day caches, for no logical, only selfish reasons. And I tried to explain in the past why this habit is overall just counter-productive.

Beside that, the dreaded 503 errors are back quite frequently, which as experience for more than a year now shows, are the harbinger of bad news if not being dealt with in a timely fashion... sad


Ralf
[Apr 4, 2024 10:17:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread