Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 387
|
![]() |
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1058 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There's something amiss, as the three ARP1 stats files didn't update at mid-day today; I wonder if whatever caused that to happen also stalled work generation or the path to the feeder.
Also, I've noticed that quite a few MCM1 tasks with the second result returned today are ending up in true PVal jail. So far I've only seen examples with even-numbered WUs, but not all even-numbered WUs end up stalled, so I've no idea what that's about :-) Cheers - Al. |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1114 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Looks like someone kicked the machine. I'm getting ARP (new and resend) again.
|
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1114 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I was checking the links in the first post when. I noticed that Dr. Jurisica is biking to fund raise for Cancer research. This will happen at the end of May and I wanted to point it out for anyone that wants to support him.
From the donation page: To explore additional fundraising options - I have joined and participate in the Team Ian Ride. Supporting any of the riders, including Igor Jurisica would supplement the WCG-MCM budget to cover the deficit. Thank you in advance. Notably, these donations are related to the MCM cancer project - and thus are handled by the Princess Margaret Cancer Foundation. Here is the direct page for Dr. Jurisica https://supportthepmcf.ca/ui/DIY/p/TeamIanIJ |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1058 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Further to my post yesterday that mentioned MCM1 tasks in PVal jail...
(Oh, for a server status page, even if it only reports what's running!) It looks as if quite a few services/daemons suffered a bit during 2025-03-12; it appears that a fairly substantial backlog of MCM1 validations started then, and it also looks as if there were some file delete daemon problems (lots of my completed tasks ended up stuck in "delete state 1" for prolonged periods, finally being purged without being spotted in "delete state 2"...). Also, today I saw some ARP1 tasks end up in PVal jail... As for the missed generations.txt file creation, there currently seems to be a substantial discrepancy between the results returned according to the web site and the cells that have shifted generations over the last two days. I don't think it can be put down to "natural variation" but as the trigger points for counting the items aren't explained anywhere I have to wonder if something in the grid cell management/new work production chain went down (and whether it's been fully restored yet)... Finally, none of my MCM1 PVal jail tasks had escaped after 24 hours, and more have been "imprisoned"; however, one of the ARP1 tasks has now validated so that's not so bad... All the above said, at least there's something to do most of the time :-). And when the new data centre stuff comes on stream, a lot of these issues should become "things of the past"... Cheers - Al. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12559 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As for the missed generations.txt file creation, there currently seems to be a substantial discrepancy between the results returned according to the web site and the cells that have shifted generations over the last two days. I don't think it can be put down to "natural variation" but as the trigger points for counting the items aren't explained anywhere I have to wonder if something in the grid cell management/new work production chain went down (and whether it's been fully restored yet)... Al When you refer to results returned, I presume you are referring to the project history page. I have presumed that includes at least 2 copies per unit and includes more where late running copies have been validated. There is also a 12 hour disparity in the periods taken into account. I think that there is also an interval (variable) between validation and the generation shift for checking against the actual weather and the predicted weather. As such the different results should not be mixed. My progress reports use only generations.txt for compatibility. Mike [Edit 1 times, last edit by Mike.Gibson at Apr 14, 2025 1:24:15 AM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1058 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mike,
Your motivation is slightly different from what I'm trying to do, so some information in response... Firstly, I have a script that captures the mid-day Project History, so I'm not 12 hours out of sync if/when I can attempt any [approximate] comparisons between the two sources of information. Secondly, I don't "mix" the data for any other purpose than to try to get some sort of understanding of the data you don't need to consider for your analysis, such as how many tasks get within a day of the existing deadlines (or miss completely) and how many late running copies get validated! Those are statistics that completed.txt can't give us, although it does help to show up when there are lots of later returns in the system (right up to missed deadlines and beyond...) My major interest in all of this is to try to establish whether we are actually slowing down the project by allowing such long deadlines :-) Unfortunately, I only have data for Linux (and I believe Adri is in the same position)[*1] so all I can do in detail is look at task performance amongst my [Linux] wingmen; anything else has to be a bit of a punt[*2]. Hence the interest in "dodgy" statistical work using the combinations you say should not be mixed[*3] :-) I'd rather like to see this project complete before I die, and the way things are going at present I'll need to get way past 80 years old! :-) So I hope you'll forgive me for wondering if we could speed things up a bit despite some of the work-issue problems implicit in the form of the experiment... Cheers - Al. P.S. My [BSc] degree (many, many years ago!) was in Computing and Statistics, so I do know I'm aiming for approximation, not perfection :-) *1 -- if anyone on the Windows side is doing wingman performance statistics for ARP1, please let me know! *2 -- Anywhere else but WCG i could track down some "big hitters" and see what their return rates and time taken per task might be; the [understandable] WCG restrictions on access to detailed statistics for other users render that impossible (as I will not do a massive data dive in the [vain?] hope of getting something useful...) *3 -- I can think of two or three fairly simple ways the "leaving a generation" shift might be counted, but I don't know what Kevin selected so I can only observe the [relative] consistency of linkages rather than being more assertive about how the data sets might align! |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12559 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My major interest in all of this is to try to establish whether we are actually slowing down the project by allowing such long deadlines :-) Unfortunately, I only have data for Linux (and I believe Adri is in the same position)[*1] so all I can do in detail is look at task performance amongst my [Linux] wingmen; anything else has to be a bit of a punt[*2]. Hence the interest in "dodgy" statistical work using the combinations you say should not be mixed[*3] :-) Al The situation is that we have 1,435,296 ARP1 units to go as at 12:00 GMT (UTC) on 13 April 2025. Approaching your quandary from a slightly different angle, I would say that we are only slowed down by any extra copies crunched in excess of the quorum. The longer the deadline, the less 'wasted' crunching and therefore the sooner the project would finish. The shorter deadlines were introduced to enable the laggards to catch up but caused more resends and therefore more wasted crunching.. However, the present regime is doing that with less wasted crunching. We don't want some people to take advantage of long deadlines so I suggest that we keep the 6 days displayed deadline but allow up to 10 or even 14 days before the extra copies are sent out. Shorter deadlines would also reduce the number of machines crunching which would slow down the project. Mike [Edit 2 times, last edit by Mike.Gibson at Apr 14, 2025 1:24:48 AM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1058 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mike,
----------------------------------------Out of interest, some questions...
Regarding #1 above -- In all the time I've been gathering wingman data I've seen exactly eleven wingman tasks (out of over 4500 for around 3100 WUs) that used more than four days of CPU/elapsed time to finish a job, and 10 of those were during 2022! A few more have needed 48+ hours, and I suspect some of those are "too many at once" rather than extremely slow hardware! The vast majority have needed less than 24 hours (and typical run times are definitely getting lower as time goes by). As for #2, I suppose I could switch to a strategy of asking for enough work for (say) three days but still only run the restricted numbers of simultaneous tasks that let me achieve the fast run times; however, I would then end up with hundreds of tasks at a time from EInstein@home as my GPU project :-) And regarding #4 above, I suspect that many of the users who regularly approach or miss the deadlines wouldn't notice the difference -- what I'd call "fire and forget" systems :-) All in all, it's an interesting topic to think about, but I suspect there's no perfect answer. Cheers - Al. [Edited to correct and expand the long runners comment -- I hadn't been counting data from 2022, and that hid a few slower systems :-)] [Edit 2 times, last edit by alanb1951 at Apr 14, 2025 3:59:43 AM] |
||
|
TLD
Veteran Cruncher USA Joined: Jul 22, 2005 Post Count: 824 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My oldest system has I3 2 core 4 thread takes around 36 hours for ARP WUs.
----------------------------------------I have it set with app_config.xml to run 2 ARP WUs. I have WCG set with app_config.xml for no more than 9 WUs. I don't see any reason why the deadline couldn't be shorter. ![]() |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1058 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Further to my posts of the last two days that mentioned tasks in PVal jail -- the ARP1 tasks cleared fairly quickly but the MCM1 tasks didn't clear until well after mid-day (UTC) on 2025-04-13.
----------------------------------------File deletion has also caught up, so things seem back to normal now (for the current constrained-systems value of normal). Cheers - Al. [Edit 1 times, last edit by alanb1951 at Apr 14, 2025 4:08:37 AM] |
||
|
|
![]() |