Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 21
Posts: 21   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1430 times and has 20 replies Next Thread
savas
Cruncher
Joined: Sep 21, 2021
Post Count: 34
Status: Offline
Reply to this Post  Reply with Quote 
Server issues, Feb 27, 2025

We are aware of the issue with sporadic generation of workunits. The cause is our job scheduling infrastructure, we are attempting to resolve this today and will update volunteers tomorrow on the result.
Several jobs that we schedule on the backend write lockfiles that will prevent the job from running again, if the previous run is still ongoing. However, there are several points in the scripts that launch these jobs where a non-critical failure will result in the lockfile preventing future work from being indexed and created. We began experiencing these failures after we had multiple server "crashes" in the past two weeks, a lingering issue with the DHCP agent becoming "stuck", eventually resulting in non-responsive VMs as their DHCP leases expire and are not renewed. This has happened to multiple of the six workunit management servers where backend jobs to generate workunits are run, and one of the upload/download servers, in the past week or so. This has resulted in several problems, that we are trying to resolve to bring the system back into a stable state.
The "modulo 4" issue identified with the MCM1 validators for example, was caused by one of these crashes. As surmised in the forums, the number of validators is scaled up by re-launching them as services on the Mesos cluster with a specific modulus. We have investigated the logs for the validators and despite the other issues we are facing with the Mesos cluster at the moment, we are seeing validations go through. The rate of validation should improve as we restore stability to the Mesos cluster and the services the jobs that run on Mesos interact with.
Unfortunately, these problems are caused by part of the system running in the oldest servers (which we started to use in 2021), which we will move out of once the new system becomes available.
[Feb 27, 2025 9:27:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 926
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Thank you for the informative update Savas. I wish your team luck in fixing the issues, and I'm happy the hardware issues will be resolved with new hardware in the future !
[Feb 27, 2025 9:45:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Maxxina
Advanced Cruncher
Joined: Jan 5, 2008
Post Count: 124
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Is this problem related to Africa project too ?
[Feb 27, 2025 10:35:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
PowerFactor
Ace Cruncher
Joined: Dec 9, 2016
Post Count: 4025
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Thanks for the update Savas smile .
[Feb 27, 2025 11:46:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7643
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Savas:
Thank you for the update. I appreciate the explanation and the steps you are takingg to resolve the multiple issues.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Feb 28, 2025 12:58:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
puurome
Cruncher
Joined: Jan 6, 2024
Post Count: 28
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Savas:
Thank you for the update. I appreciate the comprehensive explanation and the steps you are taking to resolve the multiple issues.

Best Regards
----------------------------------------

[Feb 28, 2025 6:18:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 792
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

+1 thank you Savas for the detailed update. Best of luck. Hope the new system is ready sooner than later.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Mar 1, 2025 10:00:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 274
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Thanks for the update. I hope we can get past these issues soon.

I hate to say it, but it looks like something once again fell over. Things were working well last night, but this morning (March 1) the work units have once again dried up.
[Mar 1, 2025 2:31:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2139
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

Yeah, something fell over again. Last MCM task received 2025-03-01 05:34:08 UTC
[Mar 1, 2025 2:42:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2139
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Server issues, Feb 27, 2025

MCM tasks are beginning to come in now.
[Mar 1, 2025 7:46:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 21   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread