Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 352
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 30203 times and has 351 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I've just noticed that my ancient laptop managed to report 3 tasks and get 4 new tasks at about 14:35 UTC, three minutes after it saw the "scheduler instance" message for the last time :-)

However, it also had some issues downloading those .png files that always seem to show up after an outage, so I wasn't surprised to note that all my other systems got the "high load" message the next time they checked in after 14:35 UTC (one of them getting a timeout at 14:38 UTC so I think that one caught them restarting things!).

Now we wait to see when it gets fixed (and whether they'll let us know what happened...).

Cheers - Al.

P.S. Forum is acting up again (although not as bad as yesterday...) -- I wonder if something isn't right on their internal network...
[Jun 19, 2025 4:11:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I'm downloading (slowly) some fresh MCM and ARP.

as Al states... the png files are downloading too.

TECHS: remove the png files for done projects, please.
----------------------------------------
[Edit 1 times, last edit by Unixchick at Jun 19, 2025 6:13:34 PM]
[Jun 19, 2025 6:11:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

It was 1850 UTC before things started moving here. All seems fine now.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jun 19, 2025 8:23:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

18:46 GMT here.

Mike
[Jun 20, 2025 12:37:20 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Al

With a bit of luck the No Replies that finish before the restart will get credited as the re-sends are also held up.

Mike
My only worry was about the possible build-up of resends (remembering some of the blockages we've had in the past!); as it happens, this outage was short enough that it's even possible that some of the retries will never go out! (I'll have a better picture of that once I look at my WU status reports at the end of today...)

[General information follows for those who may not already know...]

Late returners should always get credited unless they hit the Too Late gate. However, if they turn up after the canonical result has been decided their granted credit will be that computed when the canonical result was scored, rather than their own CPU time being taken into account -- sometimes that's to the benefit of the late returner!

WCG also has a [manual?] hack to credit returns that fail validation, which it seems to have applied to some of those Darwin ARP1 tasks that were having validation issues! I hadn't previously noticed whether it credited the [rare] Linux ARP1 Invalids that I've seen (none of them mine!)...

[Edit 2+ hours later...] And lo and behold, I've just this minute seen another one, and it did get credit (so it's probably automated after all, but I've no idea whether it also applies to MCM1 as it's been a very long time since I saw an Invalid wingman there...)

Cheers - Al.
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Jun 20, 2025 9:38:26 AM]
[Jun 20, 2025 7:15:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

With the proliferations of RESENDS, I'm glad I left my Queue where it was.
[Jun 20, 2025 10:12:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Wow. I really hope this is the new normal. I've got an almost full queue of ARP. I've also upped how many ARP WUs I'm doing at a time. I need to do more tests on my system to figure out what is best (2 at a time in 4 hours, 3 at a time at 6 hours...) I'm also considering heat issues, so I'm also testing when my fan comes on. This is FUN.

I hope everyone is getting a full meal of WUs MCM and ARP.
[Jun 20, 2025 2:57:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Long post - people not interested in retry behaviour and/or empty caches might want to skip this :-)

As people may have noticed, I'm always interested in the effect of "missed deadline" retries on both the user experience and (in the case of ARP1) effective project performance. So I'm intrigued by this...
With the proliferations of RESENDS, I'm glad I left my Queue where it was.
Not a criticism, but I still fail to see what's the problem with getting the occasional MCM1 Server Aborted task (unless one has to keep downloading the master file on limited bandwidth - see later). But then (as TonyEllis and I mentioned) perhaps it's not such a massive issue on Linux as it is on Windows...

In general, the percentage of Server Aborted retries seems to be at its highest when there has been a serious server issue. I've seen 50%+ of retries killed on some such occasions, but even then I've never seen more than 2% of total daily tasks killed like that in all the time I've been keeping records. Normally.the killed task count is 0, 1 or 2 out of over 600 total tasks...

For the record, over the last week here's the number of tasks I dealt with each day, along with a count of how many of them were retries and how many of the retries ended up S(erver) A(borted). The Tasks count is the number of viable returns, and the Run count is the number of retries that didn't get aborted.

Retries
Return date Tasks Run SA
----------- ----- --- --
2025-06-13 696 18 2
2025-06-14 739 11 4
2025-06-15 703 22 1
2025-06-16 686 106 5
2025-06-17 714 125 1
2025-06-18 306 108 1 [Task count affected by scheduler problems]
2025-06-19 503 64 0 [Task count affected by scheduler problems]

Those huge retry counts over the last few days are mostly due to No Reply tasks from what appeared to be cloud instances that weren't shut down nicely (so the likelihood of any of their retries going SA was effectively zero!), though the count for the 19th is also affected by there being some tasks being flagged because they couldn't report in during the scheduler problems; I picked up most of those either on nearly-empty systems before the scheduler issue or after they would have validated on finally managing to report (so none Server Aborted at all!)

Those numbers come from an old laptop, a pair of mini-PCs and a couple of "desktop" systems, all on Linux. The laptop runs one MCM1 task at a time (max_concurrent) and can't download more than 5 tasks (WCG profile). The other systems are set up using a similar approach to make sure they never have more than 5 or 6 hours of MCM1 work to run (though the queue limit on some of them is quite a bit higher to accommodate ARP1 and BETA tasks).

The above is just my experience; obviously I can't speak for all Linux users and I can't speak for non-Linux users at all...

As I have said before, I'd be a tad more annoyed if I got lots of ARP1 retries that I never got to run because that is a waste of server bandwidth (my high-speed network isn't metered so it doesn't upset me as much on that account). However, I get really annoyed when I start running one and the No Reply returns late and validates -- that's both a waste of server bandwidth and my CPU resources!

As for that master file... If a user's MCM1 cache empties when there are only retries available, and the user has a very small (or 0) buffer, a classic "laundry instructions" moment could arise...
  • Run out of work
  • Wait for [scarce] work
  • Master file gets deleted as unneeded
  • Get one new task
  • Master file downloads
  • Task runs and completes
  • Rinse and repeat
I hope that when MAM1 goes live they can manage to make the master data file sticky, especially if it is bigger than the current MCM1 master!

Cheers - Al.
[Jun 20, 2025 4:16:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I had 8754 work units completed and validated in May. Of that number 267 were retry 2, 58 were retry 3, and 2 were retry 4. Of these, 285 were various flavors of Linux and 43 were Windows 7. I have 56 threads running Linux and 8 threads running Windows.308 were from MCM and 19 were ARP.

I can not slice and dice as well as Alan, but I only see intermittent batches of resends, apparently when something else goes haywire.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jun 21, 2025 1:29:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 856
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Some one please fix the server.

World Community Grid 6/21/2025 10:57:04 AM Tasks are committed to other platforms
----------------------------------------

[Jun 21, 2025 6:12:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread