Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 567
Posts: 567   Pages: 57   [ Previous Page | 48 49 50 51 52 53 54 55 56 57 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 42047 times and has 566 replies Next Thread
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2492
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I don't want to jinx things, but the flow of work the last couple of days has been very good. I also noticed now, that the validations of tasks crunched, uploaded, and reported before the migration, is now starting to validate. Same goes for cached tasks, crunched during the migration, and uploaded, and reported after the system came back.

Good job Dylan, and the entire team.
[Dec 4, 2025 2:20:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thanks for your post Grumpy Swede. I thought I might be imagining things.

Kudos to the tech team. Here is to your continued success.
[Dec 4, 2025 3:56:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Selected current grouping as of
2025-11-12 23:59:48 UTC; 2025-11-13 02:24:20 UTC; 2025-11-13 04:29:58 UTC; 2025-11-13 14:08:02 UTC; 2025-11-13 16:45:52 UTC; 2025-11-14 18:56:02 UTC; 2025-11-16 02:33:09 UTC; 2025-11-16 16:50:41 UTC; 2025-11-18 05:12:14 UTC; 2025-11-18 17:31:49 UTC; 2025-11-20 05:27:00 UTC; 2025-11-21 06:08:47 UTC; *Account Last updated: Nov. 23, 2025 - 00:06 UTC; 2025-11-25 07:32:58 UTC; Last updated: Nov. 26, 2025 - 12:06 UTC; 2025-11-28 03:22:28 UTC; 2025-12-04 17:55:59 UTC


The following are at current time:

In progress - 229 items; 240 items; 260 items; 253 items; 242 items; 240 items; 241 items; 245 items; 241 items; 104 items; 312 items; 310 items; *21 items; 6 items; 0 items; 248 items; 249 items

Pending Validation - 13801 items; 13853 items; 13952 items; 14319 items; 14394 items; 14897 items; 15371 items; 15820 items; 16503 items; 16614 items; 17259 items; 18139 items; *19105 items; 21846 items [oldest 2025-08-21 09:36:32 UTC]; 21579 items [most recent sent time 2025-11-24 19:30:42 UTC]; 19792 items; 21555 items

Valid - 47260 items; 47272 items; 47306 items; 47592 items; 47702 items; 48813 items; 50384 items; 50850 items; 52512 items; 53192 items; 54597 items; 56066 items; *58811 items; 59067 items [oldest 2025-08-08]; 59345 items; 63224 items; 72886 items

* - indicates placekeeper data - during "Server error feeder not running" issues over several hours on 11/23
[Dec 4, 2025 6:03:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2492
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

New update from the WCG team: https://www.cs.toronto.edu/~juris/jlab/wcg.html (Operational Status tab)
[Dec 5, 2025 12:19:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

December 4, 2025

BOINC feeder/scheduler reporting "tasks committed to other platforms" is resolved - details are further down about the resolution and future plans to keep this issue from coming back.

Validation Backlog has begun for workunits that were held over the break, and workunits that fell through our new validation logic unvalidated. We intend to ramp up these passes in the coming days, and will report on progress and project expected dates for fully backfilling all such cases and finally catching up validations to in flight work next week, now that we know our scripting works to backfill validations.

We will not restart the file_deleter or db_purge BOINC services until we have validated every file we possess that was uploaded before/after the break, including sending resends for some cases of "orphans".

What was the workaround for the feeder/scheduler blockage due to hr_class mismatch between results for the same workunit? The resolution to the issue that we chose for now, was to simply purge stale feeder entries effectively resetting their hr_class (homogenous redundancy) to 0 and allowing any host/platform to download the result if the result sits in memory for too long. The feeder can be started with a CLI option and specified time frame for occupancy of a result in a slot before it considers this course.

What does resetting hr_class=0 as a workaround accomplish? The hr_class=0 reset matches the value assigned to fresh workunit results being sent out for the first time, essentially dictating to the scheduler that any host/platform may claim and compute this result (i.e., _0 and _1 results have hr_class=0, resends consult the hr_class of the host that reported results already). There is some computational overhead, as a second tier of validation is then required to validate the exact gene signatures and their scores are "the same" between these results computed on different platforms in the case of purged resends that had their hr_class reset to 0. We intend to disable hr_class (homogenous redundancy) completely for MCM1 at some point in the future, and instead rely directly on this currently secondary validation, and record of the delta between exact scores and verification of equivalent gene signatures found for these results sent to different platforms to ensure they are within a reasonable error bound/tolerance as a rule.

Does this workaround affect the integrity of MCM1 results? No, but it does introduce a new edge cases to account for. The score can vary within the upper and lower bound of possible floating point error between platforms for the same workunit. Ensuring that the floating point calculations are not different enough to invalidate the computational result is a vastly easier problem when using the hr_class mechanism. However, because MCM1 produces a list of genes as well as a score, the only additional validation criteria we incur by disabling hr_class is ostensibly "score is just below the threshold on this system" exclusion, and "score is just above the threshold on this system" inclusion, for specific signatures very close to the configured threshold. In these cases, we can take the union of these additional results slightly above or below the threshold score, between all results for a workunit, provided the rest of the results above the threshold are equivalent.

Why have hr_class at all for MCM1 then? Indeed. We intend to track the above cases and any other cases among validation failures where we can discern any unforseen effect of allowing resends to potentially go to different platforms, try this "disable hr_class if the feeder gets stuck" system for MAM1 which does have a numerical optimization routine to explore the signature search space that could change the actual signatures under test due to floating point error and so may not be a good candidate for this (and yet the calculations are valid, so any reasonable overlap or a "canary" or "spike-in" validation system might be considered sufficient validation...). If we are satisfied with the outcome of post-processing results that came from different platforms, we can disable it. This will accelerate throughput and discovery for MCM1 and possibly MAM1 while buying time to resolve this issue more permanently for applications such as ARP1 that this thinking does not apply to, where the floating point calculations must be byte-wise equivalent between results or the result is simply invalid. Once we can confirm that newer 8.x+ BOINC clients permitting WSL on Windows hosts is the only source of this hr_class confusion bug, and possibly the "W"/"W" os_name and os_version truncation bug, we can apply a targeted fix.
----------------------------------------
[Edit 4 times, last edit by Unixchick at Dec 5, 2025 12:30:22 AM]
[Dec 5, 2025 12:22:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 278
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

A detailed and informative update. This part stuck out to me given the discussion this week:
"Once we can confirm that newer 8.x+ BOINC clients permitting WSL on Windows hosts is the only source of this hr_class confusion bug, and possibly the "W"/"W" os_name and os_version truncation bug, we can apply a targeted fix."

Good progress is continuing to be made.
----------------------------------------
“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
----------------------------------------
[Edit 1 times, last edit by Paul Schlaffer at Dec 5, 2025 2:42:23 AM]
[Dec 5, 2025 2:41:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1316
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

That is a very informative update, especially their thoughts on whether MCM1/MAM1 actually need a non-zero hr_class. Also good to know that the file deleter won't restart until there's nothing of the backlog left (a perfectly understandable call!)

The key thing is the action taken regarding hr_class stuff. It confirms my suspicion as to why I've had a significant number of retries where I've ended up with the "lowest common denominator" (32-bit) application[*1] -- every one had at least one "fake Linux" wingman with a download error or missed deadline. I can stop trying to monitor that now! smile

As for dealing with the misreporting hosts, if I were to allow my inner curmudgeon out for a while he would suggest:

  • getting the scheduler to quickly verify that the platform and/or alt_platform in the request tallies with what the O/S-based platform routines suggest is valid; if it passes that test, there's no potential problem with any application cool but if it fails then what happens next will depend on whether there are any non-HR applications available,
  • if it fails at the first hurdle and no non-HR tasks are available, treat it as "unknown platform" and make sure the client gets a message reflecting why devilish, rather than the [over-polite] "No tasks are available..." or "Tasks are committed to other platforms"...
As always, thanks to Tech Team for the work put in. It's going to be interesting to see how well the processes will "ramp up" wink -- I notice that 2025-12-04 was another day with over 2 million results getting credit, so that's around 1 million WUs sorted (not sure how many would have had 3 or more valid results, so can't just divide by 2...)

Cheers - Al.

P.S. Paul Schlaffer's post highlighting the mention of client issue and WSL landed while I was still forcing this post past Forbidden previews caused by use of parentheses...

[Most recent edit to add to the "ramp up" comment once the 2025-12-04 Project stats updated!...]

*1 I can live with doing retries for legitimate 32-bit cases and I don't want to disable alt_platform anyway...
----------------------------------------
[Edit 4 times, last edit by alanb1951 at Dec 5, 2025 7:51:59 AM]
[Dec 5, 2025 3:01:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 567   Pages: 57   [ Previous Page | 48 49 50 51 52 53 54 55 56 57 ]
[ Jump to Last Post ]
Post new Thread