Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 352
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 30224 times and has 351 replies Next Thread
huntsmj
Cruncher
U.S.A.
Joined: Dec 31, 2008
Post Count: 12
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

This happens every single weekend! It's almost like the cleaning people are unplugging the servers to run the vacuum.
[Jun 21, 2025 6:24:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

My thoughts on "tasks committed to other platforms" . I think in the past we had a problem of long periods of time where it would get stuck in this status and none of us would get any WUs.
The techs have been playing with this, and I think there is a cycle of new WUs and resends. I think this is a feature, and not a bug. Why do they need to do this?? is it to ensure resends get a chance to go out? is it to pause the new WU generators for load reasons? is it to keep the size of the db under some limit?? I have no idea.
Just when I think things are broke, the system starts sending out new WUs again. It is hard for me to judge lately.

This is probably a weekend outage though.

update: no sooner do I post this then I get fresh MCM and ARP. I think it is just cycles.
----------------------------------------
[Edit 1 times, last edit by Unixchick at Jun 21, 2025 7:01:10 PM]
[Jun 21, 2025 6:52:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Unixchick

Re your query on how many threads to run. Your figures both say 6 in 12 hours, so if you are also running MCM they would be the tiebreaker..

Mike
[Jun 21, 2025 9:57:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Unixchick,

I'm inclined to agree that the intervals of no new work are usually the result of something planned or calculated on the fly -- unfortunately, the underlying infrastructure still seems to be playing up occasionally (including today!) so it's also possible that sometimes something relevant to this stops working and needs an [automated?] restart.

I seem to recall reference to something that tried to moderate new work based on work already out there, so that's one possibility. (If that is what they are doing, I think it is still in need of tuning if [almost] none of the longer intervals with no new work but not much sign of "other platforms" messages are due to service node issues!)

A case could also be made for watching the quantity of pending resends (and the lowest WU numbers associated with them) and stopping new work feeding if it seemed likely that there might be a build up of "Waiting to send" -- recall that when that was happening a lot, retries were getting out but they all tended to have higher WU IDs...

Another possibility might be that something monitors the platforms to which resends might go; if a huge backlog starts to build up for one or more platforms, slamming the brakes on new work might enable those resends to clear out faster. (Dealing with possible backed-up Android retries cropped up in the MCM1 forum very recently, with an attempted explanation of why WCG might not want to send tasks for a single WU to a m!x of platforms! I'm still half-expecting the user I answered to come back and argue...)

On the previous page of this thread I posted a link to a very old thread where there was reference to a problem with a big backlog of retries... I found it when I was looking for references to that "high load" message.

On a different note, but related to retries: the 653 MCM1 WUs for which I processed work on the 20th had 31 WUs that needed retries, and 15 of them were because an initial task failed "can't write init file" - sadly, all of those appeared to be for the same system! I struggle to understand how that can happen without the user being aware unless they never check their system. That's far from the only instance of individual systems appearing to have repeated instances of a single type of failure; stuff happens...

(And, for what it's worth, 4 of my retries for yesterday ended up Server Aborted, another recent topic here!)

Cheers - Al.
----------------------------------------
[Edit 2 times, last edit by alanb1951 at Jun 21, 2025 11:09:35 PM]
[Jun 21, 2025 11:05:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

More forum and web page issues today, ranging from failure to load style-sheets (messy pages!), through apparently lost authentication/authorization tokens to complete failure to load pages. The issue comes and goes -- if lucky it'll just seem sluggish. End result - quiet time on the forums (which may not be a good thing if users can't get on to report issues...)

I'm not sure whether I'm imagining this, but it seems this has been getting worse over the last few days.

It doesn't seem to have been affecting direct usage of APIs badly, though -- my scripts haven't [yet] failed to "log in", though some times it has taken a couple of retries (suggesting a root cause for problems!) and I've had a couple of managed script shutdowns (no data loss) when there are too many successive "503" errors. Some aspects of the web site seem far less well able to cope, probably because they were built for (and had been running in) a less error-prone environment... :-(

Cheers - Al.
[Jun 22, 2025 10:16:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Hans Sveen
Veteran Cruncher
Norge
Joined: Feb 18, 2008
Post Count: 983
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I've had the same problems, but now 0800 local time(UTC+2) all pages are loading without any problems!

Thank You staff for fixing this late evening/early night (Toronto time!)🥱

Hans S.
[Jun 24, 2025 6:54:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

No official updates yet. I'm grateful for the work the techs are doing to keep systems up and improve the system.

MCM is flowing well. ARP is only sending out resends at this point. We are at the 29k point of 35k (roughly) , so we still have some WUs in gen 147 to do when they start sending out fresh WUs again.

Keep sharing what you are seeing.
[Jun 24, 2025 3:18:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

ARP1_0030777_147_0.
No resends here.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jun 25, 2025 3:14:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TLD
Veteran Cruncher
USA
Joined: Jul 22, 2005
Post Count: 856
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

ARP1_0031321_147_1
----------------------------------------

[Jun 25, 2025 4:43:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Thanks for the info. I'm really happy to see ARP sending out fresh WUs. Looks like we are getting close to finishing gen 147
[Jun 25, 2025 5:48:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 352   Pages: 36   [ Previous Page | 1 2 3 4 5 6 7 8 9 10 | Next Page ]
[ Jump to Last Post ]
Post new Thread