Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 352
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 30120 times and has 351 replies Next Thread
GB033533
Senior Cruncher
UK
Joined: Dec 8, 2004
Post Count: 206
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Is anyone else getting this error msg again;

19/06/2025 10:30:43 | World Community Grid | Another scheduler instance is running for this host

or just me? I don't think I've really got another instance running.
----------------------------------------

[Jun 19, 2025 9:37:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Warped@RSA
Senior Cruncher
South Africa
Joined: Jan 15, 2006
Post Count: 440
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Is anyone else getting this error msg again;

19/06/2025 10:30:43 | World Community Grid | Another scheduler instance is running for this host

or just me? I don't think I've really got another instance running.

Yes, I am getting the same.
I tried exiting BOINC which did not help.
I then rebooted the machine, still without success.
The tasks have been uploaded but remain "Ready to Report".
----------------------------------------
Dave
[Jun 19, 2025 9:55:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1316
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Hey, ho; here we go again... Haven't seen this one for a while. It seems to have started at about 08:45 UTC today.

That's a server-side message so there's nothing users can do to resolve it :-( -- in brief, it means that the server is having user-specific lock-file problems, probably because of filestore connectivity issues.

[Edit]Search, search, found it! I dug the following out of a post I made in late 2023...
...scheduler requests use a per-host lock file to ensure that there aren't two concurrent requests from one host. The file is created at the start of the request, holds the PID of the scheduler instance, and is deleted at the end of the request.

There are two possible error conditions, one of which is that the lock file can't be acquired in the first place, the other that there is an existing lock. Unfortunately, although the message written to the server log distinguishes the two cases, the message sent to the client does not.

In this case, I suspect the issue is an inability to create the lock file in the first place :-(
Cheers - Al.

[Final edits to add the time the problem started, then to correct the time I'd entered to UTC!]
----------------------------------------
[Edit 3 times, last edit by alanb1951 at Jun 19, 2025 11:04:45 AM]
[Jun 19, 2025 10:44:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
ATHANASIOS PAN. GKOLIARAS
Cruncher
Joined: Dec 10, 2006
Post Count: 10
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

The beloved WCG brings new problems every time :D
[Jun 19, 2025 12:02:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1316
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

One of the annoying issues when we have a scheduler outage is caused by tasks that don't finish until near the deadline[*1] send in their results but are unable to report them -- a fair few tasks will end up as late returners (with redundant retries that may or may not get flushed before they get sent out...)

The build-up of No Reply tasks seems to have started, and is currently accompanied by retries being marked as "Waiting to be sent" (again, because of the scheduler outage)...

Of course, this will sort itself out once the scheduler issue is resolved, but it might be a bit messy at first because of the number of out-of-work users who are going to need to download the (non-sticky) MCM1 master data file again :-(

At least events of this severity have been far less frequent recently. Roll on July, if the infrastructure used by WCG improves when the new data centre stuff should go live...

Cheers - Al.

*1 Why some systems seem to need nearly 6 days to return tasks that should only take a few hours to run is a complex (and, I suspect, emotive) issue with no simple answer...
[Jun 19, 2025 1:36:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Wow. We haven't seen that error in a while. Hopefully the tech team will soon be in to fix this issue.
Thanks to everyone who reported on the problem and added extra details.
[Jun 19, 2025 1:53:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
catchercradle
Senior Cruncher
England
Joined: Jan 16, 2009
Post Count: 167
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thanks. Completely new one to me. Shame it had to happen at the same time as CPDN's servers went down in Oxford due to power going out in a server room. (That was about 24 hours ago but no update since Andy's email.
[Jun 19, 2025 2:26:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Al

With a bit of luck the No Replies that finish before the restart will get credited as the re-sends are also held up.

Mike
[Jun 19, 2025 2:29:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1293
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Message from server: We are currently experiencing high load and are temporarily deferring your scheduler request. Your client will automatically try again later.

I'm now getting this message in the logs, so they are working on the issue
[Jun 19, 2025 2:45:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1316
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

[Edited to reflect that, by the time I'd wrangled this into shape, Unixchick had made a much shorter post on the subject...]

I notice that this has happened in the past, and in the days of IBM at that... Here's a thread from March 2014 -- I've linked to uplinger's initial reply

It is quite interesting reading the rest of that thread from that point -- I wonder if Sgt. Joe remembers it (and I saw some other familiar names too...) Whilst it isn't about lock file problems, it does give an insight into "what happened next"...

Unfortunately, results can't be reported (even if one sets No New Tasks the request gets deferred...) so we're still stuck...

Cheers - Al.

P.S. Not necessarily connected to the lock file issue but of note given some of the speculation in that thread... Over the last three days I've seen over 300 MCM1 No Reply tasks (about 20% of the tasks I processed) that looked likely to be from cloud instances that had been turned off without tidying up first, and that was before the scheduler became unable to send out retries... I wonder how many more of those might still be lurking? And is that only going to be a Linux issue?? Ah, well...
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jun 19, 2025 3:46:35 PM]
[Jun 19, 2025 3:43:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread