Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
Member(s) browsing this thread: Country Bumkin
Thread Status: Active
Total posts in this thread: 613
Posts: 613   Pages: 62   [ Previous Page | 35 36 37 38 39 40 41 42 43 44 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 49210 times and has 612 replies Next Thread
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1306
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Happy Monday everyone !

Just popping in to wish the tech team a good week of smashing bugs, and inspired code writing. Sending thanks and good wishes to Dylan and the team.

It looks like a dribble of resends have been going out, but I can't personally confirm that.

I'm running my backup project in a limited way, so I'll be ready to jump back into WCG WUs when they are ready to send me resends or hopefully soon, some fresh ones. I'm getting close to my MCM 2 year badge, so I'd like to have enough in PV jail to give me the new badge.
[Nov 10, 2025 4:29:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
dylanht
World Community Grid Tech
Joined: Jul 1, 2021
Post Count: 35
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thanks Unixchick, appreciate the kind words.

It has been resends only over the weekend, the mcm1_create_work daemons lost their database connection during BOINC database maintenance, and I realized they needed some code changes to not skip batches when the database connection won't allow BOINC to receive the new workunits defined by the batch plan, and a few other BOINC daemons like the batch assimilator (all the daemons live in the same container I've stuffed all our legacy code into) also needed some work to setup the fix for the validation fix. There will be a more comprehensive update posted shortly on the lab website "Operational Status" tab (https://www.cs.toronto.edu/~juris/jlab/wcg.html), but the TLDR is that today I plan to restart MCM1 batch production after I push a new build of the BOINC daemons, transitioner, setup a Kafka broker on the BOINC database node to backfill the assimilators with resends and scheduler reported tasks that didn't have the details needed to calc credit when the assimilator first received the upload pair from the validator, and if all that goes well then I am pretty sure I can finally piece together the full validation backlog from over the break and set the assimilators upon it.
[Nov 10, 2025 5:23:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Hans Sveen
Veteran Cruncher
Norge
Joined: Feb 18, 2008
Post Count: 1006
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thank You Dylan for the short update 👍🤓
[Nov 10, 2025 6:23:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Garrulus glandarius
Advanced Cruncher
Romania
Joined: Apr 10, 2025
Post Count: 89
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I think we've been on a "retries only" diet all weekend


I got some resends today as well, including some from the "old" ..9998 test batches cool
----------------------------------------

[Nov 10, 2025 7:06:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1327
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Dylan,

Thanks for the update (and the hard work!)...

I was about to post a list of WUs where one result was flagged valid and the other still Pending Validation (as they are still in that state!), but I hope that might get sorted when the backfill happens -- if the list would still be useful, let me know :-)

Cheers - Al.
[Nov 10, 2025 7:32:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

re: MCM1_0241872_3918 WU ID:
https://www.worldcommunitygrid.org/contribution/workunit/771396885

Adri, thanks for reminding me.

no reset was done to system that I recall.

My Event Log rolling window starts at 11/5/2025 11:47:56 AM which by my best guess is after that file would have been sent to my system.

I've gone through the Event Log about 3 - 4 times but did not see any "_3918" WU's other than the ones about "expired" that I posted in my original message. At the time, I had "unlimited" setting and ~2.4 days of extra work on my profile so that computer was overwhelmed with work.

Let me know if there is anything else I should share.
[Nov 10, 2025 7:44:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Hans Sveen
Veteran Cruncher
Norge
Joined: Feb 18, 2008
Post Count: 1006
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

New update available as Dylan promised earlier in this thread, a lot to read!!

Thank Your again for the work and the update Dylan 😊

Hans S.


Ps..
Forgot the link/text, to late and a bit sleepy 😪

Thank you Unixchick for fixing the text!!
----------------------------------------
[Edit 4 times, last edit by Hans Sveen at Nov 10, 2025 9:56:34 PM]
[Nov 10, 2025 9:19:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1306
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

November 11, 2025

Database maintenance over Friday/Saturday completed without issue. We have resolved an issue with the backup scripts, effectively increased memory used to service database queries and added some new indices. We expect better performance from the BOINC database going forward.

However, the disk remains slower than initial benchmarking when we stood up the database. We will monitor and reach out to hosting to see if the Ceph placement group expansion (that caused the stuck blocks of that particular disk when the placement group the result table lives on) got stuck in a "peering" state.We were informed that we should expect temporary, possibly intermittent slow IO during this Ceph maintenance window. If we can get faster disks for the BOINC database - which would require restoring the database to a new volume as we did to migrate - we will consider a maintenance window. Right now, we are optimistic the issues revealed in the new system by hanging database queries and database crashes can all be resolved with patches the new BOINC daemons, and current performance will be sufficient.

As mentioned, this event identified several issues with the new BOINC daemons.

MCM1 workunit creation proceeds in the Kafka topic even though the database is down, the mcm1_create_work daemon for it's Kafka partition on science01...science06 tries to commit it's part of the batch, database isn't there, so it doesn't do anything, but it does commit it's offset/pointer into the batch plan topic and move on to consume the next batch plan. That means every 10-15m while the database is down, a batch is effectively skipped. We were able to fix that, and have restarted MCM1 batch creation at roughly 5:00 p.m. EST, November 10th, 2025.

We believe we have finally architected a fix for the pending validation backlog issue. This requires some non-trivial plumbing in the MCM1 batch assimilator, a Kafka connector deployed on the BOINC database node, and transitioner code changes.
Workunit supply may remain artificially lower while we roll out the new batch assimilator builds and monitor the transitioner -> Kafka event consumption and result table interaction.

We were able to resolve the issue with computing preferences not being updated from the website to BOINC client and vice versa. Generally, when the BOINC database goes down, so does the event listener that handles these messages on the webserver.
We are still working on resolving the validation backlog from over the break, with the result table bricked during the Ceph maintenance we architected a "trust the filesystem" solution, and we are hopeful that this issue will be resolved this week.

MAM1 was initially planned to be resumed in beta30 last week, to see if 7.07 fairly schedules work and respects --nthreads, which is a blocking issue in promoting the beta application to production. Depending on the error rate and behaviour on BOINC clients, we would then consider the stable code paths for the first production batches. Given our increased control over batch parameters with the new Kafka topic that uses a protobuf schema to fill out the workunit and result table entires, we intend to run work in production on Linux as soon as the beta30 application is stable with an error rate lower than MCM1 excepting the GLIBC dependency, which is typically the only repeated error we see from clients on the current LibTorch code path. We will then rely on iterating the beta30 application to 7.08 and 7.09 to get GPU and Windows support, and Parquet IO for input and uploaded results.
----------------------------------------
[Edit 5 times, last edit by Unixchick at Nov 10, 2025 9:34:37 PM]
[Nov 10, 2025 9:30:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Boca Raton Community HS
Senior Cruncher
Joined: Aug 27, 2021
Post Count: 209
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Definitely good progress! Excited about the GPU port, but that is going to open so many new cans of worms/new issues. I am hoping everything will completely stabilize for MCM1, then the cpu version of MAM (and ARP resumes) first before it is even in beta testing. Definitely exciting though!
[Nov 10, 2025 10:45:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2347
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)


Hi bfmorse,
if you ask me, you never really crunched that task. In one of your former messages, you quoted: "11/8/2025 1:13:44 PM | World Community Grid | Didn't resend lost task MCM1_0241872_3918_1 (expired)" from your Event Log. The server tried to send it to you, but never got around to doing that in a proper way. If the server would have succeeded in resending(!) it, it should have said "Resent lost task".

That is the reason why the server marked your task "No Reply".

For comparison, on the 31st of October my Event Log looked like this:
31-Oct-2025 16:10:43 [World Community Grid] Scheduler request completed: got 15 new tasks
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1123_1
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1212_1
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1268_1
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1339_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1380_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1413_1
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1519_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1544_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1637_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1675_1
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1696_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1749_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1765_1
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1770_0
31-Oct-2025 16:10:43 [World Community Grid] Resent lost task MCM1_0241516_1792_1

You can see the difference here: my system received 15 new tasks, and when you compare this to your situation, it said: "Scheduler request completed: got 0 new tasks" (on 11/8/2025 1:13:44 PM; you wrote that in post 707341).
[Nov 10, 2025 11:38:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 613   Pages: 62   [ Previous Page | 35 36 37 38 39 40 41 42 43 44 | Next Page ]
[ Jump to Last Post ]
Post new Thread