Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 21
|
![]() |
Author |
|
savas
Cruncher Joined: Sep 21, 2021 Post Count: 34 Status: Offline |
We are aware of the issue with sporadic generation of workunits. The cause is our job scheduling infrastructure, we are attempting to resolve this today and will update volunteers tomorrow on the result.
Several jobs that we schedule on the backend write lockfiles that will prevent the job from running again, if the previous run is still ongoing. However, there are several points in the scripts that launch these jobs where a non-critical failure will result in the lockfile preventing future work from being indexed and created. We began experiencing these failures after we had multiple server "crashes" in the past two weeks, a lingering issue with the DHCP agent becoming "stuck", eventually resulting in non-responsive VMs as their DHCP leases expire and are not renewed. This has happened to multiple of the six workunit management servers where backend jobs to generate workunits are run, and one of the upload/download servers, in the past week or so. This has resulted in several problems, that we are trying to resolve to bring the system back into a stable state. The "modulo 4" issue identified with the MCM1 validators for example, was caused by one of these crashes. As surmised in the forums, the number of validators is scaled up by re-launching them as services on the Mesos cluster with a specific modulus. We have investigated the logs for the validators and despite the other issues we are facing with the Mesos cluster at the moment, we are seeing validations go through. The rate of validation should improve as we restore stability to the Mesos cluster and the services the jobs that run on Mesos interact with. Unfortunately, these problems are caused by part of the system running in the oldest servers (which we started to use in 2021), which we will move out of once the new system becomes available. |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 926 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thank you for the informative update Savas. I wish your team luck in fixing the issues, and I'm happy the hardware issues will be resolved with new hardware in the future !
|
||
|
Maxxina
Advanced Cruncher Joined: Jan 5, 2008 Post Count: 124 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Is this problem related to Africa project too ?
|
||
|
PowerFactor
Ace Cruncher Joined: Dec 9, 2016 Post Count: 4025 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the update Savas
![]() |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7643 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Savas:
----------------------------------------Thank you for the update. I appreciate the explanation and the steps you are takingg to resolve the multiple issues. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
puurome
Cruncher Joined: Jan 6, 2024 Post Count: 28 Status: Offline Project Badges: ![]() |
Savas:
----------------------------------------Thank you for the update. I appreciate the comprehensive explanation and the steps you are taking to resolve the multiple issues. Best Regards ![]() |
||
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 792 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
+1 thank you Savas for the detailed update. Best of luck. Hope the new system is ready sooner than later.
----------------------------------------
|
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thanks for the update. I hope we can get past these issues soon.
I hate to say it, but it looks like something once again fell over. Things were working well last night, but this morning (March 1) the work units have once again dried up. |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yeah, something fell over again. Last MCM task received 2025-03-01 05:34:08 UTC
|
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
MCM tasks are beginning to come in now.
|
||
|
|
![]() |