| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 352
|
|
| Author |
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
I'm out of the loop. Is there any recent criticism on here against Krembil? (Just asking.)
----------------------------------------I'd love to see WCG get a piece of some corporate sponsorships to take the burden of financial funding off their minds. Look at https://www.foldingathome.org at another major distributed computing project that's similar in age to WCG (maybe 2000, so slightly older). Scroll to the bottom and look at the mega-large corporate sponsorships. I believe that WCG fights a similar cause, but we need marketing professionals to really pitch the idea and handle that, not a bunch of scientists who may not have the same kind of expertise (and time). A lot more good will come when WCG is adequately funded. And honestly, I don't believe Krembil contributes much to the effort other than slapping their name on it. Seems like Jurisica Lab -- a university enterprise? -- shoulders the burden along with UHN/SHARC university datacenter resources. I have a thread in MCM forum and also e-mailed WCG directly several months ago. Zero response. They don't care about communication. I think the silence speaks volumes. None of the researchers from ANY of the projects participate in the forums.
|
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
One of my crunchboxes has 4+ ARP1 tasks and is crunching along. Another crunchbox got 1 task recently. That's neat. Plugging along into Generation 147.
----------------------------------------
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
hchc
----------------------------------------ARPs are great because they run cooler and use slightly less electricity. Maybe the weather application doesn't take advantage of AVX2 CPU instruction sets or something. ARP1 is much more memory-intensive than MCM1, so it is far more likely to have instructions stalling waiting for stuff from RAM (especially if there isn't much L3 cache or if the "page table" needs to be updated...). Enough of a pause and the core will reduce its power drain!...The way MCM1 works [at present] results in less frequent changes of data analysis locations, so whilst there will be memory-related "pauses" they'll be more or less hidden amongst all the other things that reduce efficiency. Whether that will still be the same if/when MCM1 migrates to LibTorch as its "engine" remains to be seen... Systems with two execution threads per CPU core may also see some slowdown (depending on threads used and on workload mix) if running lots of floating-point intensive stuff, but that won't apply to your 4C/4T systems :-) Cheers - Al. [Edit 1 times, last edit by alanb1951 at Jun 16, 2025 9:05:26 AM] |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1294 Status: Offline Project Badges:
|
MCM is flowing well.
Looks like they've increased the flow of ARP. I'm happy to have 2 ARP going at all times now. I don't have a full cache (1 day), but I have enough to keep my machine busy. I'll consider allowing more to run simultaneously if I get a longer queue. |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
As to cache values, I had increased mine from zero to 0.7 days a few weeks ago and ended up getting "SERVER ABORTED" to about 5 or 6 WU's each day. (Thanks to one or more volunteers running 5 - 7 day queues - WCG sends out the resends and then aborts that resend when the delayed WU is returned.)
----------------------------------------So I have reset my queue back to ZERO DAYS and live with the reduced efficiency - would rather have that than SERVER ABORTED. Is there a better way to deal with this? one example: MCM1_0237874_2330 Project name: Mapping Cancer Markers Created: Jun. 10, 2025 - 00:30 UTC Name: MCM1_0237874_2330 Minimum Quorum: 2 Replication: 2 Result name OS type OS version Status Sent time Time due/ Return time Cpu time/ Elapsed time Claimed credit/ Granted credit MCM1_0237874_2330_0 Microsoft Windows 11 Pro for Workstations x64 Edition, (10.00.26100.00) Valid 2025-06-10 00:30:29 UTC 2025-06-10 06:48:42 UTC 1.97 / 1.99 72.7 / 74.7 MCM1_0237874_2330_1 Microsoft Windows 11 Professional x64 Edition, (10.00.26100.00) Valid 2025-06-10 00:30:29 UTC 2025-06-16 01:04:12 UTC 2.06 / 2.11 76.8 / 74.7 MCM1_0237874_2330_2 Microsoft Windows 11 Professional x64 Edition, (10.00.26100.00) Server Aborted 2025-06-16 00:30:33 UTC 2025-06-16 01:45:54 UTC only editing done was formatting layout for readability and colorizing area of interest. [Edit 1 times, last edit by bfmorse at Jun 16, 2025 3:54:47 PM] |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
bfmorse.
----------------------------------------With MCM1 (and, I fear, MAM1 when it goes live) there isn't a lot that users hit by this can do about it. It's only truly expensive if one runs such a small buffer that one frequently runs out of work, requiring the re-download of the MCM1 master data file; otherwise it's just an annoyance as the job control file is very small. (Of course, ARP1 is a different matter; see later...) For comparison, I typically handle 700 or so MCM1 tasks a day across 4 systems (none of which have a cache of more than 0.6 days, and tend to see an average of 4 or 5 Server Aborted each day... I can live with that :-) I suspect this was far less common in the days before the IBM->Krembil migration, because of a server-side feature that was disabled when they tried to clear out all the pending work before the move. This is the concept of a "grace period" as an extension to the due time that isn't reported to the client (so late, unstarted, tasks should still error out as "Not Started by Deadline") but it would allow some extra time before the retries went out for No Reply tasks. I'd be happy to see the return of grace periods for some projects, but only if combined with reduced deadlines -- in this day and age, a 6 day deadline seems a tad excessive for something like MCM1!... Not all the late returning results are going to be because of excess caching, by the way.
I have actually had an ARP1 task Server Aborted because an original wingman clocked in late; that doesn't happen very often, because I tend to be able to start new ARP1 tasks almost at once (and often return the retry before a late returner does!) but on this occasion the receiving host was already at its max_concurrent limit... When I looked at the stderr.txt for that task, it turned out that the wall-clock times for the checkpoints included two transitions past midnight; as the total CPU and elapsed times were both comfortably under 24 hours, that indicated either a lot of swapping between different BOINC tasks or use of system suspend/hibernation (but at least LAIM seemed to have been active if it was for the former reason!) -- stuff happens :-) Cheers - Al. P.S. during the initial MAM1 Beta testing I got so swamped with Beta tasks (some of them very long-running) that I had to micro-manage my systems to make sure I returned non-Beta tasks far enough before deadline for my satisfaction. Without the aforementioned grace periods, any late return (accidental or otherwise) would [of course] tee up a retry, which I always try to avoid... Hopefully the next Betas won't need as much run time and shouldn't cause so much trouble :-) [Edit 1 times, last edit by alanb1951 at Jun 16, 2025 4:50:30 PM] |
||
|
|
TonyEllis
Senior Cruncher Australia Joined: Jul 9, 2008 Post Count: 286 Status: Offline Project Badges:
|
Following up from Al's comments. I run MCM with a buffer a tad over 1 day in an effort to reduce loss of crunch time as a result from Krembil issues. Looking at my "Results" page for MCM there are 163 valid and 1 server aborted. From memory this about what has happened over the last many months and is very different to the experience of bfmorse.
----------------------------------------Suspect the difference is the environment the wus are run in. bfmorse is running Windows. It would be expected that many of the Windows machines to be desktops/laptops. Not running 24/7 and the other problems Al detailed. Many Window windows would be just ordinary users. The other major OS is Linux. Suspect a much greater percentage of these, especially servers, will be running 24/7 with a much bigger percentage being work-stations and servers rather than desktops/laptops. Also with a bigger percentage of servers and similar wonder if the technical knowledge of those Linux users running WCG, especially sysadmins and such, is higher percent wise and thus more likely to maintain a better return time.
Run Time Stats https://grassmere-productions.no-ip.biz/
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
I tend to agree that the issues with missed deadlines are far more common on the Windows side; I've even had MCM1 retries where two initial Windows wingmen missed the deadlines with an Error (WCG doesn't explicitly identify "Not Started by Deadline", but it's very recognisable!) and because the retries were teed up at the same time and there were no active tasks Linux got them and polished them off!
I suspect there are a higher proportion of "fire and forget" configurations on Windows systems, as it requires a little more effort to set up BOINC properly on some Linux flavours... The obvious effects will be seen within the WIndows user community -- those of us on Linux are more likely to notice when we can't get work because "tasks are committed to other platforms" (which I believe is usually associated with mass distribution of retries...) Not all Linux users have workstation or server grade kit, of course -- for instance, I'm running an old i7-7700K (mostly useful as a GPU platform), two Ryzen-based mini-PCs and a Ryzen 7900 :-) However, the comment about technical knowledge certainly applies, as I spend quite a bit of time on a new system working out what I think is an acceptable workload; I "re-tune" systems whenever the application mix changes (e.g. when MCM1 switches targets) and have different app_config files for when Beta tasks are available or if ARP1 has another hiatus! I've just had a quick look at the data for tasks I processed that were due during May 2025 using my "wingmen" database for MCM1 (in which I keep basic information about every WU processed, both for my task and those of wingmen...)... I processed tasks for 18433 WUs. 1333 of the WUs had at least one retry, 1093 of them with a missed deadline task. 796 of my tasks were successful retries (many of them for reasons other than missed deadlines!) and 75 were Server Aborted. 60 of the SA tasks were on systems which wouldn't necessarily have got round to the retries for several hours; if I ran (say) a 0.2 day buffer instead of the 0.6 day buffer I typically use, I'd hardly see any Server Aborts but I'd probably end up running more redundant tasks! Amongst the wingmen tasks were 744 that never returned (NSD, No Reply or Too Late) and 41 that ended up Server Aborted. 186 WUs ended up with 3 valid results! Those numbers (6% of WUs having a missed deadline, 15+% of missed deadline tasks returning late and 9..10% of my retries being Server Aborted) seem fairly consistent (unless there's a server outage that backs work up!) Cheers - Al. P.S. I haven't used Windows for BOINC since before I retired (so nearly 20 years now!); I like systems that could run GPU work without my having to be logged in, and which don't update without my permission. I suspect I could've hammered Windows into submission, but I couldn't be bothered (and I'm not afraid of the command line...) |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
MCM1 throughput seems to be a lot more consistent at the moment -- if they've managed to put the new workflow management in place I hope they tell us soon...
Also, I've just been checking my ARP1 overview and I noticed that yesterday's midday generations information shows that we passed 250,000 moved units this year, more than half of them in the last 60 days or so. The reported returned tasks total for this year passed 500,000 at mid-day on 2025-06-12 :-) As Mike's reports show, we've been picking up the pace recently; that said, in my view (expressed elsewhere) the number of tasks taking 4 days or longer is a tad high... Cheers - Al. P.S. If someone checks the numbers and finds something significantly different, please let me know so I can try to check my numbers and put things right if I've made a howler of some sort :-) |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1294 Status: Offline Project Badges:
|
The MCM flow is more consistent, like Al says. I've also noticed ARP having a similar pattern of resends only during certain periods and then going back to fresh WUs.
I don't think the return time length of ARP matters much as long as we have restricted flow. There is no need to rush completion of a WU only to have it sit a while before being sent out again. Once we go to an all available ARPs are out, then the length completion time might be an issue, but we need to balance project completion with involving as many people as possible. |
||
|
|
|