Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 352
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 30217 times and has 351 replies Next Thread
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I'm out of the loop. Is there any recent criticism on here against Krembil? (Just asking.)

I'd love to see WCG get a piece of some corporate sponsorships to take the burden of financial funding off their minds.

Look at https://www.foldingathome.org at another major distributed computing project that's similar in age to WCG (maybe 2000, so slightly older). Scroll to the bottom and look at the mega-large corporate sponsorships.

I believe that WCG fights a similar cause, but we need marketing professionals to really pitch the idea and handle that, not a bunch of scientists who may not have the same kind of expertise (and time).

A lot more good will come when WCG is adequately funded.

And honestly, I don't believe Krembil contributes much to the effort other than slapping their name on it. Seems like Jurisica Lab -- a university enterprise? -- shoulders the burden along with UHN/SHARC university datacenter resources.

I have a thread in MCM forum and also e-mailed WCG directly several months ago. Zero response. They don't care about communication. I think the silence speaks volumes. None of the researchers from ANY of the projects participate in the forums.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Jun 16, 2025 7:15:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

One of my crunchboxes has 4+ ARP1 tasks and is crunching along. Another crunchbox got 1 task recently. That's neat. Plugging along into Generation 147.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Jun 16, 2025 7:16:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

hchc
ARPs are great because they run cooler and use slightly less electricity. Maybe the weather application doesn't take advantage of AVX2 CPU instruction sets or something.
ARP1 is much more memory-intensive than MCM1, so it is far more likely to have instructions stalling waiting for stuff from RAM (especially if there isn't much L3 cache or if the "page table" needs to be updated...). Enough of a pause and the core will reduce its power drain!...

The way MCM1 works [at present] results in less frequent changes of data analysis locations, so whilst there will be memory-related "pauses" they'll be more or less hidden amongst all the other things that reduce efficiency. Whether that will still be the same if/when MCM1 migrates to LibTorch as its "engine" remains to be seen...

Systems with two execution threads per CPU core may also see some slowdown (depending on threads used and on workload mix) if running lots of floating-point intensive stuff, but that won't apply to your 4C/4T systems :-)

Cheers - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jun 16, 2025 9:05:26 AM]
[Jun 16, 2025 9:03:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

MCM is flowing well.

Looks like they've increased the flow of ARP.

I'm happy to have 2 ARP going at all times now. I don't have a full cache (1 day), but I have enough to keep my machine busy. I'll consider allowing more to run simultaneously if I get a longer queue.
[Jun 16, 2025 2:30:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

As to cache values, I had increased mine from zero to 0.7 days a few weeks ago and ended up getting "SERVER ABORTED" to about 5 or 6 WU's each day. (Thanks to one or more volunteers running 5 - 7 day queues - WCG sends out the resends and then aborts that resend when the delayed WU is returned.)

So I have reset my queue back to ZERO DAYS and live with the reduced efficiency - would rather have that than SERVER ABORTED.

Is there a better way to deal with this?

one example:

MCM1_0237874_2330
Project name: Mapping Cancer Markers
Created: Jun. 10, 2025 - 00:30 UTC
Name: MCM1_0237874_2330
Minimum Quorum: 2
Replication: 2

Result name OS type OS version Status Sent time Time due/ Return time Cpu time/ Elapsed time Claimed credit/ Granted credit

MCM1_0237874_2330_0 Microsoft Windows 11 Pro for Workstations x64 Edition, (10.00.26100.00) Valid 2025-06-10 00:30:29 UTC 2025-06-10 06:48:42 UTC 1.97 / 1.99 72.7 / 74.7

MCM1_0237874_2330_1 Microsoft Windows 11 Professional x64 Edition, (10.00.26100.00) Valid 2025-06-10 00:30:29 UTC 2025-06-16 01:04:12 UTC 2.06 / 2.11 76.8 / 74.7

MCM1_0237874_2330_2 Microsoft Windows 11 Professional x64 Edition, (10.00.26100.00) Server Aborted 2025-06-16 00:30:33 UTC 2025-06-16 01:45:54 UTC

only editing done was formatting layout for readability and colorizing area of interest.
----------------------------------------
[Edit 1 times, last edit by bfmorse at Jun 16, 2025 3:54:47 PM]
[Jun 16, 2025 3:47:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

bfmorse.

With MCM1 (and, I fear, MAM1 when it goes live) there isn't a lot that users hit by this can do about it. It's only truly expensive if one runs such a small buffer that one frequently runs out of work, requiring the re-download of the MCM1 master data file; otherwise it's just an annoyance as the job control file is very small. (Of course, ARP1 is a different matter; see later...)

For comparison, I typically handle 700 or so MCM1 tasks a day across 4 systems (none of which have a cache of more than 0.6 days, and tend to see an average of 4 or 5 Server Aborted each day... I can live with that :-)

I suspect this was far less common in the days before the IBM->Krembil migration, because of a server-side feature that was disabled when they tried to clear out all the pending work before the move. This is the concept of a "grace period" as an extension to the due time that isn't reported to the client (so late, unstarted, tasks should still error out as "Not Started by Deadline") but it would allow some extra time before the retries went out for No Reply tasks. I'd be happy to see the return of grace periods for some projects, but only if combined with reduced deadlines -- in this day and age, a 6 day deadline seems a tad excessive for something like MCM1!...

Not all the late returning results are going to be because of excess caching, by the way.
  • There are [non-WCG] projects out there that consistently mis-estimate the amount of time their tasks will take. For instance, MilkyWay often underrestimates by orders of magnitude for some tasks then massively overestimates for others, causing all sorts of scheduling issues! And yes, I mean orders of magnitude - e.g. estimate 10 minutes, actual over 24 hours; estimate 6 hours, actual 15 minutes!...
  • There are users out there that don't make allowance for the time their systems are turned off or tasks are suspended because BOINC thinks the system is busy
  • There are users who might make those allowances but don't use "Leave applications in memory so if they run multiple projects a lot of tasks will keep winding back to previous checkpoints!
There are, of course, less "excusable" reasons why tasks arrive late...)

I have actually had an ARP1 task Server Aborted because an original wingman clocked in late; that doesn't happen very often, because I tend to be able to start new ARP1 tasks almost at once (and often return the retry before a late returner does!) but on this occasion the receiving host was already at its max_concurrent limit... When I looked at the stderr.txt for that task, it turned out that the wall-clock times for the checkpoints included two transitions past midnight; as the total CPU and elapsed times were both comfortably under 24 hours, that indicated either a lot of swapping between different BOINC tasks or use of system suspend/hibernation (but at least LAIM seemed to have been active if it was for the former reason!) -- stuff happens :-)

Cheers - Al.

P.S. during the initial MAM1 Beta testing I got so swamped with Beta tasks (some of them very long-running) that I had to micro-manage my systems to make sure I returned non-Beta tasks far enough before deadline for my satisfaction. Without the aforementioned grace periods, any late return (accidental or otherwise) would [of course] tee up a retry, which I always try to avoid... Hopefully the next Betas won't need as much run time and shouldn't cause so much trouble :-)
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jun 16, 2025 4:50:30 PM]
[Jun 16, 2025 4:41:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TonyEllis
Senior Cruncher
Australia
Joined: Jul 9, 2008
Post Count: 286
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Following up from Al's comments. I run MCM with a buffer a tad over 1 day in an effort to reduce loss of crunch time as a result from Krembil issues. Looking at my "Results" page for MCM there are 163 valid and 1 server aborted. From memory this about what has happened over the last many months and is very different to the experience of bfmorse.
Suspect the difference is the environment the wus are run in. bfmorse is running Windows. It would be expected that many of the Windows machines to be desktops/laptops. Not running 24/7 and the other problems Al detailed. Many Window windows would be just ordinary users.
The other major OS is Linux. Suspect a much greater percentage of these, especially servers, will be running 24/7 with a much bigger percentage being work-stations and servers rather than desktops/laptops. Also with a bigger percentage of servers and similar wonder if the technical knowledge of those Linux users running WCG, especially sysadmins and such, is higher percent wise and thus more likely to maintain a better return time.
----------------------------------------
[Jun 17, 2025 6:40:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I tend to agree that the issues with missed deadlines are far more common on the Windows side; I've even had MCM1 retries where two initial Windows wingmen missed the deadlines with an Error (WCG doesn't explicitly identify "Not Started by Deadline", but it's very recognisable!) and because the retries were teed up at the same time and there were no active tasks Linux got them and polished them off!

I suspect there are a higher proportion of "fire and forget" configurations on Windows systems, as it requires a little more effort to set up BOINC properly on some Linux flavours... The obvious effects will be seen within the WIndows user community -- those of us on Linux are more likely to notice when we can't get work because "tasks are committed to other platforms" (which I believe is usually associated with mass distribution of retries...)

Not all Linux users have workstation or server grade kit, of course -- for instance, I'm running an old i7-7700K (mostly useful as a GPU platform), two Ryzen-based mini-PCs and a Ryzen 7900 :-) However, the comment about technical knowledge certainly applies, as I spend quite a bit of time on a new system working out what I think is an acceptable workload; I "re-tune" systems whenever the application mix changes (e.g. when MCM1 switches targets) and have different app_config files for when Beta tasks are available or if ARP1 has another hiatus!

I've just had a quick look at the data for tasks I processed that were due during May 2025 using my "wingmen" database for MCM1 (in which I keep basic information about every WU processed, both for my task and those of wingmen...)...

I processed tasks for 18433 WUs. 1333 of the WUs had at least one retry, 1093 of them with a missed deadline task. 796 of my tasks were successful retries (many of them for reasons other than missed deadlines!) and 75 were Server Aborted. 60 of the SA tasks were on systems which wouldn't necessarily have got round to the retries for several hours; if I ran (say) a 0.2 day buffer instead of the 0.6 day buffer I typically use, I'd hardly see any Server Aborts but I'd probably end up running more redundant tasks!

Amongst the wingmen tasks were 744 that never returned (NSD, No Reply or Too Late) and 41 that ended up Server Aborted. 186 WUs ended up with 3 valid results!

Those numbers (6% of WUs having a missed deadline, 15+% of missed deadline tasks returning late and 9..10% of my retries being Server Aborted) seem fairly consistent (unless there's a server outage that backs work up!)

Cheers - Al.

P.S. I haven't used Windows for BOINC since before I retired (so nearly 20 years now!); I like systems that could run GPU work without my having to be logged in, and which don't update without my permission. I suspect I could've hammered Windows into submission, but I couldn't be bothered (and I'm not afraid of the command line...)
[Jun 17, 2025 10:49:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

MCM1 throughput seems to be a lot more consistent at the moment -- if they've managed to put the new workflow management in place I hope they tell us soon...

Also, I've just been checking my ARP1 overview and I noticed that yesterday's midday generations information shows that we passed 250,000 moved units this year, more than half of them in the last 60 days or so. The reported returned tasks total for this year passed 500,000 at mid-day on 2025-06-12 :-)

As Mike's reports show, we've been picking up the pace recently; that said, in my view (expressed elsewhere) the number of tasks taking 4 days or longer is a tad high...

Cheers - Al.

P.S. If someone checks the numbers and finds something significantly different, please let me know so I can try to check my numbers and put things right if I've made a howler of some sort :-)
[Jun 17, 2025 11:08:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

The MCM flow is more consistent, like Al says. I've also noticed ARP having a similar pattern of resends only during certain periods and then going back to fresh WUs.

I don't think the return time length of ARP matters much as long as we have restricted flow. There is no need to rush completion of a WU only to have it sit a while before being sent out again.
Once we go to an all available ARPs are out, then the length completion time might be an issue, but we need to balance project completion with involving as many people as possible.
[Jun 17, 2025 4:25:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread