| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 352
|
|
| Author |
|
|
GB033533
Senior Cruncher UK Joined: Dec 8, 2004 Post Count: 206 Status: Offline Project Badges:
|
Is anyone else getting this error msg again;
----------------------------------------19/06/2025 10:30:43 | World Community Grid | Another scheduler instance is running for this host or just me? I don't think I've really got another instance running. ![]() |
||
|
|
Warped@RSA
Senior Cruncher South Africa Joined: Jan 15, 2006 Post Count: 440 Status: Offline Project Badges:
|
Is anyone else getting this error msg again; 19/06/2025 10:30:43 | World Community Grid | Another scheduler instance is running for this host or just me? I don't think I've really got another instance running. Yes, I am getting the same. I tried exiting BOINC which did not help. I then rebooted the machine, still without success. The tasks have been uploaded but remain "Ready to Report".
Dave
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Recently Active Project Badges:
|
Hey, ho; here we go again... Haven't seen this one for a while. It seems to have started at about 08:45 UTC today.
----------------------------------------That's a server-side message so there's nothing users can do to resolve it :-( -- in brief, it means that the server is having user-specific lock-file problems, probably because of filestore connectivity issues. [Edit]Search, search, found it! I dug the following out of a post I made in late 2023... ...scheduler requests use a per-host lock file to ensure that there aren't two concurrent requests from one host. The file is created at the start of the request, holds the PID of the scheduler instance, and is deleted at the end of the request. Cheers - Al.There are two possible error conditions, one of which is that the lock file can't be acquired in the first place, the other that there is an existing lock. Unfortunately, although the message written to the server log distinguishes the two cases, the message sent to the client does not. In this case, I suspect the issue is an inability to create the lock file in the first place :-( [Final edits to add the time the problem started, then to correct the time I'd entered to UTC!] [Edit 3 times, last edit by alanb1951 at Jun 19, 2025 11:04:45 AM] |
||
|
|
ATHANASIOS PAN. GKOLIARAS
Cruncher Joined: Dec 10, 2006 Post Count: 10 Status: Offline Project Badges:
|
The beloved WCG brings new problems every time :D
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Recently Active Project Badges:
|
One of the annoying issues when we have a scheduler outage is caused by tasks that don't finish until near the deadline[*1] send in their results but are unable to report them -- a fair few tasks will end up as late returners (with redundant retries that may or may not get flushed before they get sent out...)
The build-up of No Reply tasks seems to have started, and is currently accompanied by retries being marked as "Waiting to be sent" (again, because of the scheduler outage)... Of course, this will sort itself out once the scheduler issue is resolved, but it might be a bit messy at first because of the number of out-of-work users who are going to need to download the (non-sticky) MCM1 master data file again :-( At least events of this severity have been far less frequent recently. Roll on July, if the infrastructure used by WCG improves when the new data centre stuff should go live... Cheers - Al. *1 Why some systems seem to need nearly 6 days to return tasks that should only take a few hours to run is a complex (and, I suspect, emotive) issue with no simple answer... |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1293 Status: Recently Active Project Badges:
|
Wow. We haven't seen that error in a while. Hopefully the tech team will soon be in to fix this issue.
Thanks to everyone who reported on the problem and added extra details. |
||
|
|
catchercradle
Senior Cruncher England Joined: Jan 16, 2009 Post Count: 167 Status: Offline Project Badges:
|
Thanks. Completely new one to me. Shame it had to happen at the same time as CPDN's servers went down in Oxford due to power going out in a server room. (That was about 24 hours ago but no update since Andy's email.
|
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Al
With a bit of luck the No Replies that finish before the restart will get credited as the re-sends are also held up. Mike |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1293 Status: Recently Active Project Badges:
|
Message from server: We are currently experiencing high load and are temporarily deferring your scheduler request. Your client will automatically try again later.
I'm now getting this message in the logs, so they are working on the issue |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1316 Status: Recently Active Project Badges:
|
[Edited to reflect that, by the time I'd wrangled this into shape, Unixchick had made a much shorter post on the subject...]
----------------------------------------I notice that this has happened in the past, and in the days of IBM at that... Here's a thread from March 2014 -- I've linked to uplinger's initial reply It is quite interesting reading the rest of that thread from that point -- I wonder if Sgt. Joe remembers it (and I saw some other familiar names too...) Whilst it isn't about lock file problems, it does give an insight into "what happened next"... Unfortunately, results can't be reported (even if one sets No New Tasks the request gets deferred...) so we're still stuck... Cheers - Al. P.S. Not necessarily connected to the lock file issue but of note given some of the speculation in that thread... Over the last three days I've seen over 300 MCM1 No Reply tasks (about 20% of the tasks I processed) that looked likely to be from cloud instances that had been turned off without tidying up first, and that was before the scheduler became unable to send out retries... I wonder how many more of those might still be lurking? And is that only going to be a Linux issue?? Ah, well... [Edit 1 times, last edit by alanb1951 at Jun 19, 2025 3:46:35 PM] |
||
|
|
|