| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 387
|
|
| Author |
|
|
nekomi_ch
Cruncher Joined: Apr 23, 2024 Post Count: 18 Status: Offline Project Badges:
|
Server can't open database
Yeah something definitely has gone wrong |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
From around 10:00 to 12:05 UTC it was "can't open database" (uploads still working though). As seems to be common recently, it then took a couple of hours to move on, and at 12:20 UTC it was "feeder not running" -- I wonder if that's a built-in "I give up!" time-out...
----------------------------------------I do hope there's no correlation between users killing off masses of MAM1 Beta tests (as per multiple threads in the Beta Tests forum) and the database problem :-) [Edited to add...] For ARP1 progress watchers -- the midday stats run hasn't happened (no surprise there) and as at 14:20 UTC the three text files are empty; whether those will actually get filled when the database comes back remains to be seen. Cheers - Al. [Edit 2 times, last edit by alanb1951 at Apr 26, 2025 2:32:06 PM] |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1295 Status: Offline Project Badges:
|
Thanks to Grumpy for calling it early and for everyone else who reported what they are seeing. We did indeed go "Boom"
----------------------------------------"Feeder not running" error Reminder: This is a good time for machine maintenance. Do you need a system update? time to get the dust out of the box? does it just need a good "turning it off and on again" ? If all is in good order, then it is time to look at backup projects. [Edit 1 times, last edit by Unixchick at Apr 26, 2025 3:40:58 PM] |
||
|
|
Hans Sveen
Veteran Cruncher Norge Joined: Feb 18, 2008 Post Count: 984 Status: Offline Project Badges:
|
Web sites seems to work again!
See latest bupdate from Jurisica Lab 🙂 Updates of done Wus still struggle! Hans S. |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1295 Status: Offline Project Badges:
|
Thank you Dylan ! Nice to see the WUs flowing again. It will take a while for the system to catch up.
|
||
|
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 786 Status: Offline Project Badges:
|
April 26, 2025
----------------------------------------12:40 ET - WCG database crashed. Dylan is rebooting VM and hopefully we can recover without Sharcnet sys admins. Server is up. Assuming DB crash recovery goes well, we should be back online in 1-2 hours. BOINC database crash recovery was successful; service has been restored: DB recovered and restarted, all BOINC daemons and feeder have been restarted; reset the state of the scheduler coordinator and confirmed job submissions are working, verified download of new MCM1 and ARP1 workunits. MAM1 Beta Batches w/ Unacceptable Error Rate and Runtime: A small 4-work units batch (MAM1_9800035) to confirm the fix and a larger 100 work units batch (MAM1_9800036) were released yesterday night and this morning, respectively. So far no errors have been returned from either batch. Reported Missing Beta Results; Lack of Results on Community Stats Page: Between 2025-04-24 and 2025-04-25, several beta batches were released for Linux/Windows MAM1 application version 7.04, numbered between 9800000-9800026. The intent was to vary parameters to optimize the quality of signatures returned based on previous runs of MCM1 and our local testing. Unfortunately, these batches revealed multiple issues and in some cases exploded the runtime as many volunteers have reported in the forums. We have updated the workunit records in the BOINC database on 2025-04-25 for all these workunits in this range of batches to hopefully prevent resends and Server Abort any BETA workunits that were awaiting a free slot to begin execution on BOINC clients. For those who continue to run these long running workunits, we will monitor those that remain outstanding and continue to extend deadlines. New MAM1 Beta Batches 9800027-9800031 Released based on Low/No Error Rate Batches (9800001, 980008): We plotted outcomes for the 100-1000 workunit batches between 9800000-9800026, and released a series of batches 2025-04-25 from 9800027 onward that are based on those with no or low error rate. For these we have varied the length of the signature, and number of iterations, but left the model parameters known to be stable untouched. We will continue tuning the MAM1 model settings for the new dataset based on these preliminary distributions of signatures being returned by beta testers, and strictly confirm resource requirements and runtime of any changes to model parameters in the workunit settings file in all future MAM1 beta batches. Fixing OOM after suspending or power-cycling the machine running the BOINC client: We will fix these issues in version 7.05 of the MAM1 application next week.
Paul.
|
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
Thanks for the update!
|
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Thanks for the update.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Boca Raton Community HS
Senior Cruncher Joined: Aug 27, 2021 Post Count: 209 Status: Offline Project Badges:
|
Great update! Looks like we have some of the 9800027 onward on our systems so we will see how they process.
|
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Thank you for the update and for Dylan and/or savas and/or others in getting systems back up as well as fine-tuning MAM1 beta! Hope you guys have a wonderful weekend without more work. Much appreciated.
----------------------------------------
|
||
|
|
|