Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 3152
Posts: 3152   Pages: 316   [ Previous Page | 307 308 309 310 311 312 313 314 315 316 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2502842 times and has 3151 replies Next Thread
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Adri

You may be right, but I think that would postpone the end of the project by at least 6 months assuming all went well with the restart.

Perhaps they are waiting until they have enough spare capacity (machines and/or people) to do the work needed. They may be prioritising the start of MAM1 at present.

Mike
[Mar 11, 2025 3:50:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

We have just had our first day wihout any movement of extremes. There are now 318 all of which appear to be stuck.

8 accelerated moved of which 4 escaped to normal. There may only be 4 accelerated moving out of 452.

1,478 normals moved out of 28,258 in the generations being released.

There are now 5,033 held up in generation 143.

Mike
[Mar 11, 2025 4:01:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Speedy51
Veteran Cruncher
New Zealand
Joined: Nov 4, 2005
Post Count: 1277
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I think that would postpone the end of the project by at least 6 months

Is there any harm in extending the end of the project by 6 months?
Perhaps they are waiting until they have enough spare capacity (machines and/or people) to do the work needed. They may be prioritising the start of MAM1 at present.

I believe the following answers the question about MAM1 launch
March 5, 2025
The DHCP lease issue, or whatever the root cause of our production VMs losing all network access at an increasing rate such that we are almost sure to experience a server crash multiple times a week, is being investigated by hosting.
Our plan to resolve this regardless of the outcome of the investigation is to fully migrate most production boxes to Kubernetes including the DB2, Websphere, and IBM MQ "axis" of the website/forums and webservices provided by WCG.
Previously, we had only provisioned QA on the Kubernetes cluster, and intended to further provision and deploy containers running Mesos workers as our first production boxes orchestrated by Kubernetes on the new hardware to blue/green deploy and eventually move the coordinator responsibilities and finally all workunit management pipeline responsibilities to Kubernetes running Mesos, which would give us fault tolerance at last as we pick apart all the old Mesos job descriptions and crontabs to fully migrate to Kubernetes, Slurm, Redpanda, and distributed postgres (Citus-Data).
Once finished, we will decomission Aurora/Mesos and the old CentOS 7 boxes that run and coordinate the Mesos cluster, and provision new VMs with an LTS version of Ubuntu as we have on the new hardware to add that capacity to the Kubernetes cluster.
We apologize for the delays to the start of the MAM project, we did not account for the sword of damocles hanging over every production server and falling with increasing frequency, nor for the reduced capacity of the environment in the new year. Thank you for your patience and understanding, we will be starting MAM shortly, as soon as we are through this issue.

----------------------------------------

[Mar 11, 2025 9:56:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Again no movement of extremes. There are 318 all of which appear to be stuck.

2 accelerated moved of which 1 escaped to normal. There may only be 3 accelerated moving out of 451.

1,348 normals moved out of 27,335 in the generations being released.

There are now 5,957 held up in generation 143.

Mike
[Mar 12, 2025 2:50:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Again no movement of extremes. There are 318 all of which appear to be stuck.

2 accelerated moved. There may only be 2 accelerated moving out of 451.

913 normals moved out of 26,769 in the generations being released.

There are now 6,523 held up in generation 143.

Mike
[Mar 13, 2025 2:11:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 924
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I felt like I won the lottery I got a 142 resend. Neither of the 2 original WUs were sent back. I'm still waiting on my wingman.
[Mar 13, 2025 3:52:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 937
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Neither of the 2 original WUs were sent back.
Why am I not surprised by this :-)

Judging by the daily excess of returned results (Project Statistics mid-day to mid-day) over processed work units (generations.txt), there have been (and still may be) hundreds of users missing deadlines for one reason or another but returning and validating after the retries created on their "failure" had started (so someone else's CPU time could be considered wasted...) -- at the moment, one in 10 WUs seems to have three valid results :-) [and that's without any Extremes (3 tasks anyway) and hardly any Accelerated tasks (shorter deadlines)...]

As for the lack of new work, it would be interesting to know if this is a side-effect of the currently reduced resources available to WCG, or whether it is just a natural part of the normal processing cycle.

Cheers - Al.
[Mar 13, 2025 4:31:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Al

I don't think we have ever been informed as to what constitutes a returned result. However, they certainly include all valid results, so at least 2 per processed work unit.

There are far too many copies that miss their deadline and have to be resent. As I understand it, if a copy has achieved at least 1 checkpoint when an extra copy is sent out it is allowed to continue and might validate, hence the 10% that you refer to. Any that have not reached a checkpoint are Aborted By Server.

Now that we are working almost entirely on normals, there should not be a problem with slower machines finish within deadlines but maybe the problem is with machines that are only working intermittently.

The lack of work could be down to the hardware problems but could also be down to prioritising MCM, or could be that the hardware problems are causing more units to get stuck.

Mike
[Mar 13, 2025 6:07:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7642
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

The lack of work could be down to the hardware problems but could also be down to prioritising MCM, or could be that the hardware problems are causing more units to get stuck.

They are probably not prioritizing MCM, at least at the moment because they are now not sending out work from either project.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Mar 13, 2025 9:33:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Speedy51
Veteran Cruncher
New Zealand
Joined: Nov 4, 2005
Post Count: 1277
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Any that have not reached a checkpoint are Aborted By Server.

How does the server know whether a task/work unit has reached a checkpoint? To my knowledge when a checkpoint is made no information is sent to the server
----------------------------------------

[Mar 13, 2025 10:09:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 3152   Pages: 316   [ Previous Page | 307 308 309 310 311 312 313 314 315 316 | Next Page ]
[ Jump to Last Post ]
Post new Thread