| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| Member(s) browsing this thread: Country Bumkin |
|
Thread Status: Active Total posts in this thread: 613
|
|
| Author |
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1306 Status: Offline Project Badges:
|
Happy Monday everyone !
Just popping in to wish the tech team a good week of smashing bugs, and inspired code writing. Sending thanks and good wishes to Dylan and the team. It looks like a dribble of resends have been going out, but I can't personally confirm that. I'm running my backup project in a limited way, so I'll be ready to jump back into WCG WUs when they are ready to send me resends or hopefully soon, some fresh ones. I'm getting close to my MCM 2 year badge, so I'd like to have enough in PV jail to give me the new badge. |
||
|
|
dylanht
World Community Grid Tech Joined: Jul 1, 2021 Post Count: 35 Status: Offline Project Badges:
|
Thanks Unixchick, appreciate the kind words.
It has been resends only over the weekend, the mcm1_create_work daemons lost their database connection during BOINC database maintenance, and I realized they needed some code changes to not skip batches when the database connection won't allow BOINC to receive the new workunits defined by the batch plan, and a few other BOINC daemons like the batch assimilator (all the daemons live in the same container I've stuffed all our legacy code into) also needed some work to setup the fix for the validation fix. There will be a more comprehensive update posted shortly on the lab website "Operational Status" tab (https://www.cs.toronto.edu/~juris/jlab/wcg.html), but the TLDR is that today I plan to restart MCM1 batch production after I push a new build of the BOINC daemons, transitioner, setup a Kafka broker on the BOINC database node to backfill the assimilators with resends and scheduler reported tasks that didn't have the details needed to calc credit when the assimilator first received the upload pair from the validator, and if all that goes well then I am pretty sure I can finally piece together the full validation backlog from over the break and set the assimilators upon it. |
||
|
|
Hans Sveen
Veteran Cruncher Norge Joined: Feb 18, 2008 Post Count: 1006 Status: Recently Active Project Badges:
|
Thank You Dylan for the short update 👍🤓
|
||
|
|
Garrulus glandarius
Advanced Cruncher Romania Joined: Apr 10, 2025 Post Count: 89 Status: Offline Project Badges:
|
I think we've been on a "retries only" diet all weekend I got some resends today as well, including some from the "old" ..9998 test batches ![]() ![]() ![]() |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1327 Status: Offline Project Badges:
|
Dylan,
Thanks for the update (and the hard work!)... I was about to post a list of WUs where one result was flagged valid and the other still Pending Validation (as they are still in that state!), but I hope that might get sorted when the backfill happens -- if the list would still be useful, let me know :-) Cheers - Al. |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 448 Status: Offline Project Badges:
|
re: MCM1_0241872_3918 WU ID:
https://www.worldcommunitygrid.org/contribution/workunit/771396885 Adri, thanks for reminding me. no reset was done to system that I recall. My Event Log rolling window starts at 11/5/2025 11:47:56 AM which by my best guess is after that file would have been sent to my system. I've gone through the Event Log about 3 - 4 times but did not see any "_3918" WU's other than the ones about "expired" that I posted in my original message. At the time, I had "unlimited" setting and ~2.4 days of extra work on my profile so that computer was overwhelmed with work. Let me know if there is anything else I should share. |
||
|
|
Hans Sveen
Veteran Cruncher Norge Joined: Feb 18, 2008 Post Count: 1006 Status: Recently Active Project Badges:
|
New update available as Dylan promised earlier in this thread, a lot to read!!
----------------------------------------Thank Your again for the work and the update Dylan 😊 Hans S. Ps.. Forgot the link/text, to late and a bit sleepy 😪 Thank you Unixchick for fixing the text!! [Edit 4 times, last edit by Hans Sveen at Nov 10, 2025 9:56:34 PM] |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1306 Status: Offline Project Badges:
|
November 11, 2025
----------------------------------------Database maintenance over Friday/Saturday completed without issue. We have resolved an issue with the backup scripts, effectively increased memory used to service database queries and added some new indices. We expect better performance from the BOINC database going forward. However, the disk remains slower than initial benchmarking when we stood up the database. We will monitor and reach out to hosting to see if the Ceph placement group expansion (that caused the stuck blocks of that particular disk when the placement group the result table lives on) got stuck in a "peering" state.We were informed that we should expect temporary, possibly intermittent slow IO during this Ceph maintenance window. If we can get faster disks for the BOINC database - which would require restoring the database to a new volume as we did to migrate - we will consider a maintenance window. Right now, we are optimistic the issues revealed in the new system by hanging database queries and database crashes can all be resolved with patches the new BOINC daemons, and current performance will be sufficient. As mentioned, this event identified several issues with the new BOINC daemons. MCM1 workunit creation proceeds in the Kafka topic even though the database is down, the mcm1_create_work daemon for it's Kafka partition on science01...science06 tries to commit it's part of the batch, database isn't there, so it doesn't do anything, but it does commit it's offset/pointer into the batch plan topic and move on to consume the next batch plan. That means every 10-15m while the database is down, a batch is effectively skipped. We were able to fix that, and have restarted MCM1 batch creation at roughly 5:00 p.m. EST, November 10th, 2025. We believe we have finally architected a fix for the pending validation backlog issue. This requires some non-trivial plumbing in the MCM1 batch assimilator, a Kafka connector deployed on the BOINC database node, and transitioner code changes. Workunit supply may remain artificially lower while we roll out the new batch assimilator builds and monitor the transitioner -> Kafka event consumption and result table interaction. We were able to resolve the issue with computing preferences not being updated from the website to BOINC client and vice versa. Generally, when the BOINC database goes down, so does the event listener that handles these messages on the webserver. We are still working on resolving the validation backlog from over the break, with the result table bricked during the Ceph maintenance we architected a "trust the filesystem" solution, and we are hopeful that this issue will be resolved this week. MAM1 was initially planned to be resumed in beta30 last week, to see if 7.07 fairly schedules work and respects --nthreads, which is a blocking issue in promoting the beta application to production. Depending on the error rate and behaviour on BOINC clients, we would then consider the stable code paths for the first production batches. Given our increased control over batch parameters with the new Kafka topic that uses a protobuf schema to fill out the workunit and result table entires, we intend to run work in production on Linux as soon as the beta30 application is stable with an error rate lower than MCM1 excepting the GLIBC dependency, which is typically the only repeated error we see from clients on the current LibTorch code path. We will then rely on iterating the beta30 application to 7.08 and 7.09 to get GPU and Windows support, and Parquet IO for input and uploaded results. [Edit 5 times, last edit by Unixchick at Nov 10, 2025 9:34:37 PM] |
||
|
|
Boca Raton Community HS
Senior Cruncher Joined: Aug 27, 2021 Post Count: 209 Status: Offline Project Badges:
|
Definitely good progress! Excited about the GPU port, but that is going to open so many new cans of worms/new issues. I am hoping everything will completely stabilize for MCM1, then the cpu version of MAM (and ARP resumes) first before it is even in beta testing. Definitely exciting though!
|
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2347 Status: Offline Project Badges:
|
Hi bfmorse, if you ask me, you never really crunched that task. In one of your former messages, you quoted: "11/8/2025 1:13:44 PM | World Community Grid | Didn't resend lost task MCM1_0241872_3918_1 (expired)" from your Event Log. The server tried to send it to you, but never got around to doing that in a proper way. If the server would have succeeded in resending(!) it, it should have said "Resent lost task". That is the reason why the server marked your task "No Reply". For comparison, on the 31st of October my Event Log looked like this: 31-Oct-2025 16:10:43 [World Community Grid] Scheduler request completed: got 15 new tasks You can see the difference here: my system received 15 new tasks, and when you compare this to your situation, it said: "Scheduler request completed: got 0 new tasks" (on 11/8/2025 1:13:44 PM; you wrote that in post 707341). |
||
|
|
|