| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 352
|
|
| Author |
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
I've just noticed that my ancient laptop managed to report 3 tasks and get 4 new tasks at about 14:35 UTC, three minutes after it saw the "scheduler instance" message for the last time :-)
However, it also had some issues downloading those .png files that always seem to show up after an outage, so I wasn't surprised to note that all my other systems got the "high load" message the next time they checked in after 14:35 UTC (one of them getting a timeout at 14:38 UTC so I think that one caught them restarting things!). Now we wait to see when it gets fixed (and whether they'll let us know what happened...). Cheers - Al. P.S. Forum is acting up again (although not as bad as yesterday...) -- I wonder if something isn't right on their internal network... |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1293 Status: Offline Project Badges:
|
I'm downloading (slowly) some fresh MCM and ARP.
----------------------------------------as Al states... the png files are downloading too. TECHS: remove the png files for done projects, please. [Edit 1 times, last edit by Unixchick at Jun 19, 2025 6:13:34 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
It was 1850 UTC before things started moving here. All seems fine now.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
18:46 GMT here.
Mike |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Al My only worry was about the possible build-up of resends (remembering some of the blockages we've had in the past!); as it happens, this outage was short enough that it's even possible that some of the retries will never go out! (I'll have a better picture of that once I look at my WU status reports at the end of today...)With a bit of luck the No Replies that finish before the restart will get credited as the re-sends are also held up. Mike [General information follows for those who may not already know...] Late returners should always get credited unless they hit the Too Late gate. However, if they turn up after the canonical result has been decided their granted credit will be that computed when the canonical result was scored, rather than their own CPU time being taken into account -- sometimes that's to the benefit of the late returner! WCG also has a [manual?] hack to credit returns that fail validation, which it seems to have applied to some of those Darwin ARP1 tasks that were having validation issues! I hadn't previously noticed whether it credited the [rare] Linux ARP1 Invalids that I've seen (none of them mine!)... [Edit 2+ hours later...] And lo and behold, I've just this minute seen another one, and it did get credit (so it's probably automated after all, but I've no idea whether it also applies to MCM1 as it's been a very long time since I saw an Invalid wingman there...) Cheers - Al. [Edit 2 times, last edit by alanb1951 at Jun 20, 2025 9:38:26 AM] |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
With the proliferations of RESENDS, I'm glad I left my Queue where it was.
|
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1293 Status: Offline Project Badges:
|
Wow. I really hope this is the new normal. I've got an almost full queue of ARP. I've also upped how many ARP WUs I'm doing at a time. I need to do more tests on my system to figure out what is best (2 at a time in 4 hours, 3 at a time at 6 hours...) I'm also considering heat issues, so I'm also testing when my fan comes on. This is FUN.
I hope everyone is getting a full meal of WUs MCM and ARP. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Long post - people not interested in retry behaviour and/or empty caches might want to skip this :-)
As people may have noticed, I'm always interested in the effect of "missed deadline" retries on both the user experience and (in the case of ARP1) effective project performance. So I'm intrigued by this... With the proliferations of RESENDS, I'm glad I left my Queue where it was. Not a criticism, but I still fail to see what's the problem with getting the occasional MCM1 Server Aborted task (unless one has to keep downloading the master file on limited bandwidth - see later). But then (as TonyEllis and I mentioned) perhaps it's not such a massive issue on Linux as it is on Windows...In general, the percentage of Server Aborted retries seems to be at its highest when there has been a serious server issue. I've seen 50%+ of retries killed on some such occasions, but even then I've never seen more than 2% of total daily tasks killed like that in all the time I've been keeping records. Normally.the killed task count is 0, 1 or 2 out of over 600 total tasks... For the record, over the last week here's the number of tasks I dealt with each day, along with a count of how many of them were retries and how many of the retries ended up S(erver) A(borted). The Tasks count is the number of viable returns, and the Run count is the number of retries that didn't get aborted.
Those huge retry counts over the last few days are mostly due to No Reply tasks from what appeared to be cloud instances that weren't shut down nicely (so the likelihood of any of their retries going SA was effectively zero!), though the count for the 19th is also affected by there being some tasks being flagged because they couldn't report in during the scheduler problems; I picked up most of those either on nearly-empty systems before the scheduler issue or after they would have validated on finally managing to report (so none Server Aborted at all!) Those numbers come from an old laptop, a pair of mini-PCs and a couple of "desktop" systems, all on Linux. The laptop runs one MCM1 task at a time (max_concurrent) and can't download more than 5 tasks (WCG profile). The other systems are set up using a similar approach to make sure they never have more than 5 or 6 hours of MCM1 work to run (though the queue limit on some of them is quite a bit higher to accommodate ARP1 and BETA tasks). The above is just my experience; obviously I can't speak for all Linux users and I can't speak for non-Linux users at all... As I have said before, I'd be a tad more annoyed if I got lots of ARP1 retries that I never got to run because that is a waste of server bandwidth (my high-speed network isn't metered so it doesn't upset me as much on that account). However, I get really annoyed when I start running one and the No Reply returns late and validates -- that's both a waste of server bandwidth and my CPU resources! As for that master file... If a user's MCM1 cache empties when there are only retries available, and the user has a very small (or 0) buffer, a classic "laundry instructions" moment could arise...
Cheers - Al. |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
I had 8754 work units completed and validated in May. Of that number 267 were retry 2, 58 were retry 3, and 2 were retry 4. Of these, 285 were various flavors of Linux and 43 were Windows 7. I have 56 threads running Linux and 8 threads running Windows.308 were from MCM and 19 were ARP.
----------------------------------------I can not slice and dice as well as Alan, but I only see intermittent batches of resends, apparently when something else goes haywire. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
TLD
Veteran Cruncher USA Joined: Jul 22, 2005 Post Count: 856 Status: Offline Project Badges:
|
Some one please fix the server.
----------------------------------------World Community Grid 6/21/2025 10:57:04 AM Tasks are committed to other platforms ![]() |
||
|
|
|