Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 387
Posts: 387   Pages: 39   [ Previous Page | 15 16 17 18 19 20 21 22 23 24 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 32169 times and has 386 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1058
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

There's something amiss, as the three ARP1 stats files didn't update at mid-day today; I wonder if whatever caused that to happen also stalled work generation or the path to the feeder.

Also, I've noticed that quite a few MCM1 tasks with the second result returned today are ending up in true PVal jail. So far I've only seen examples with even-numbered WUs, but not all even-numbered WUs end up stalled, so I've no idea what that's about :-)

Cheers - Al.
[Apr 12, 2025 6:47:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1114
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Looks like someone kicked the machine. I'm getting ARP (new and resend) again.
[Apr 12, 2025 9:00:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1114
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I was checking the links in the first post when. I noticed that Dr. Jurisica is biking to fund raise for Cancer research. This will happen at the end of May and I wanted to point it out for anyone that wants to support him.

From the donation page:
To explore additional fundraising options - I have joined and participate in the Team Ian Ride. Supporting any of the riders, including Igor Jurisica would supplement the WCG-MCM budget to cover the deficit. Thank you in advance. Notably, these donations are related to the MCM cancer project - and thus are handled by the Princess Margaret Cancer Foundation.

Here is the direct page for Dr. Jurisica https://supportthepmcf.ca/ui/DIY/p/TeamIanIJ
[Apr 13, 2025 3:41:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1058
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Further to my post yesterday that mentioned MCM1 tasks in PVal jail...

(Oh, for a server status page, even if it only reports what's running!)

It looks as if quite a few services/daemons suffered a bit during 2025-03-12; it appears that a fairly substantial backlog of MCM1 validations started then, and it also looks as if there were some file delete daemon problems (lots of my completed tasks ended up stuck in "delete state 1" for prolonged periods, finally being purged without being spotted in "delete state 2"...). Also, today I saw some ARP1 tasks end up in PVal jail...

As for the missed generations.txt file creation, there currently seems to be a substantial discrepancy between the results returned according to the web site and the cells that have shifted generations over the last two days. I don't think it can be put down to "natural variation" but as the trigger points for counting the items aren't explained anywhere I have to wonder if something in the grid cell management/new work production chain went down (and whether it's been fully restored yet)...

Finally, none of my MCM1 PVal jail tasks had escaped after 24 hours, and more have been "imprisoned"; however, one of the ARP1 tasks has now validated so that's not so bad...

All the above said, at least there's something to do most of the time :-). And when the new data centre stuff comes on stream, a lot of these issues should become "things of the past"...

Cheers - Al.
[Apr 13, 2025 5:49:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12559
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)


As for the missed generations.txt file creation, there currently seems to be a substantial discrepancy between the results returned according to the web site and the cells that have shifted generations over the last two days. I don't think it can be put down to "natural variation" but as the trigger points for counting the items aren't explained anywhere I have to wonder if something in the grid cell management/new work production chain went down (and whether it's been fully restored yet)...

Al

When you refer to results returned, I presume you are referring to the project history page. I have presumed that includes at least 2 copies per unit and includes more where late running copies have been validated.

There is also a 12 hour disparity in the periods taken into account.

I think that there is also an interval (variable) between validation and the generation shift for checking against the actual weather and the predicted weather.

As such the different results should not be mixed.

My progress reports use only generations.txt for compatibility.

Mike
----------------------------------------
[Edit 1 times, last edit by Mike.Gibson at Apr 14, 2025 1:24:15 AM]
[Apr 13, 2025 8:32:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1058
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Mike,

Your motivation is slightly different from what I'm trying to do, so some information in response...

Firstly, I have a script that captures the mid-day Project History, so I'm not 12 hours out of sync if/when I can attempt any [approximate] comparisons between the two sources of information.

Secondly, I don't "mix" the data for any other purpose than to try to get some sort of understanding of the data you don't need to consider for your analysis, such as how many tasks get within a day of the existing deadlines (or miss completely) and how many late running copies get validated! Those are statistics that completed.txt can't give us, although it does help to show up when there are lots of later returns in the system (right up to missed deadlines and beyond...)

My major interest in all of this is to try to establish whether we are actually slowing down the project by allowing such long deadlines :-) Unfortunately, I only have data for Linux (and I believe Adri is in the same position)[*1] so all I can do in detail is look at task performance amongst my [Linux] wingmen; anything else has to be a bit of a punt[*2]. Hence the interest in "dodgy" statistical work using the combinations you say should not be mixed[*3] :-)

I'd rather like to see this project complete before I die, and the way things are going at present I'll need to get way past 80 years old! :-) So I hope you'll forgive me for wondering if we could speed things up a bit despite some of the work-issue problems implicit in the form of the experiment...

Cheers - Al.

P.S. My [BSc] degree (many, many years ago!) was in Computing and Statistics, so I do know I'm aiming for approximation, not perfection :-)

*1 -- if anyone on the Windows side is doing wingman performance statistics for ARP1, please let me know!

*2 -- Anywhere else but WCG i could track down some "big hitters" and see what their return rates and time taken per task might be; the [understandable] WCG restrictions on access to detailed statistics for other users render that impossible (as I will not do a massive data dive in the [vain?] hope of getting something useful...)

*3 -- I can think of two or three fairly simple ways the "leaving a generation" shift might be counted, but I don't know what Kevin selected so I can only observe the [relative] consistency of linkages rather than being more assertive about how the data sets might align!
[Apr 13, 2025 9:48:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12559
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)


My major interest in all of this is to try to establish whether we are actually slowing down the project by allowing such long deadlines :-) Unfortunately, I only have data for Linux (and I believe Adri is in the same position)[*1] so all I can do in detail is look at task performance amongst my [Linux] wingmen; anything else has to be a bit of a punt[*2]. Hence the interest in "dodgy" statistical work using the combinations you say should not be mixed[*3] :-)


Al

The situation is that we have 1,435,296 ARP1 units to go as at 12:00 GMT (UTC) on 13 April 2025.

Approaching your quandary from a slightly different angle, I would say that we are only slowed down by any extra copies crunched in excess of the quorum.

The longer the deadline, the less 'wasted' crunching and therefore the sooner the project would finish.

The shorter deadlines were introduced to enable the laggards to catch up but caused more resends and therefore more wasted crunching.. However, the present regime is doing that with less wasted crunching.

We don't want some people to take advantage of long deadlines so I suggest that we keep the 6 days displayed deadline but allow up to 10 or even 14 days before the extra copies are sent out.

Shorter deadlines would also reduce the number of machines crunching which would slow down the project.

Mike
----------------------------------------
[Edit 2 times, last edit by Mike.Gibson at Apr 14, 2025 1:24:48 AM]
[Apr 14, 2025 1:21:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1058
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Mike,

Out of interest, some questions...

  • How many machines need 6 days to return a result? :-)
  • How many systems like mine (that are set up to compute and return results in under 12 hours but can never get enough work to be fully occupied) are out there? (Come to that, substitute 18 hours and ask again!)
  • How many of the systems running ARP1 might be running too many at a time, dramatically increasing their overall run time?
  • Would reducing the deadline to 5 days and restoring the grace day drive users away?
  • Is there anything that can be done to encourage users to only ask for enough ARP1 work to keep their system occupied for [say] two or three days maximum now that outages tend to be a lot shorter?
Without knowing the answer to questions such as those it isn't actually possible to work out an optimum strategy for reducing the overall time taken to clear an individual tranche of work. I would observe that at present there are probably lots of fast systems that can't get enough work, but that could change if [say] they make a big increase in the amount of work allowed to be in the field at any time! (However, that would also probably lead to many more tasks missing deadlines on other systems unless some sort of maximum limit was applied!)

Regarding #1 above -- In all the time I've been gathering wingman data I've seen exactly eleven wingman tasks (out of over 4500 for around 3100 WUs) that used more than four days of CPU/elapsed time to finish a job, and 10 of those were during 2022! A few more have needed 48+ hours, and I suspect some of those are "too many at once" rather than extremely slow hardware! The vast majority have needed less than 24 hours (and typical run times are definitely getting lower as time goes by).

As for #2, I suppose I could switch to a strategy of asking for enough work for (say) three days but still only run the restricted numbers of simultaneous tasks that let me achieve the fast run times; however, I would then end up with hundreds of tasks at a time from EInstein@home as my GPU project :-)

And regarding #4 above, I suspect that many of the users who regularly approach or miss the deadlines wouldn't notice the difference -- what I'd call "fire and forget" systems :-)

All in all, it's an interesting topic to think about, but I suspect there's no perfect answer.

Cheers - Al.

[Edited to correct and expand the long runners comment -- I hadn't been counting data from 2022, and that hid a few slower systems :-)]
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Apr 14, 2025 3:59:43 AM]
[Apr 14, 2025 2:17:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 824
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

My oldest system has I3 2 core 4 thread takes around 36 hours for ARP WUs.

I have it set with app_config.xml to run 2 ARP WUs.
I have WCG set with app_config.xml for no more than 9 WUs.

I don't see any reason why the deadline couldn't be shorter.
----------------------------------------

[Apr 14, 2025 2:38:39 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1058
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Further to my posts of the last two days that mentioned tasks in PVal jail -- the ARP1 tasks cleared fairly quickly but the MCM1 tasks didn't clear until well after mid-day (UTC) on 2025-04-13.

File deletion has also caught up, so things seem back to normal now (for the current constrained-systems value of normal).

Cheers - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Apr 14, 2025 4:08:37 AM]
[Apr 14, 2025 4:06:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 387   Pages: 39   [ Previous Page | 15 16 17 18 19 20 21 22 23 24 | Next Page ]
[ Jump to Last Post ]
Post new Thread