| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 387
|
|
| Author |
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
It's not entirely clear to me what the hold-up is with the "stuck" tasks, not just the 21 and 21 Ultras but the other Extreme stragglers. Can't they be manually kicked off? Were they all tasks with errors or something or just ones that took 30+ hours to compute? Not sure if we ever got a root cause analysis.
----------------------------------------gl82854 said: WCG should ignore the ultras. Let Delft run those on their systems. Indeed. Running them 24/7 through an automated process on a couple machines would eliminate the hassle of deadlines and really speed them up.
|
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
hchc
I don't think we ever had a reason for the ultras. Many units that stuck initially were restarted by shortening the TimeStep which meant slicing the 2 day period into smaller slices and it worked. How much shorter they tried I don't know. It did have the effect of lengthening the CPU time needed. It could be that they kept the shortened deadline for resends while increasing the CPU time because of the TimeStep. That would mean failing to meet deadlines. If the TimeStep is reduced, the deadline should be increased. My guess is that the problem might stem from difficult terrain where the patch is located causing problem calculating the weather. This is a matter which is specific to ARP so should be posted on the ARP forum. This is a general forum. Mike |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1294 Status: Recently Active Project Badges:
|
MCM is flowing well. Hopefully everyone has as many MCM as they want and caches are full.
ARP was increasing its flow, but seems to have topped out (reference number between 400-500) |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1294 Status: Recently Active Project Badges:
|
MCM is flowing well.
ARP may be stopped or have a lighter than usual flow. We have been working up to a reference number of 400-500 and now it is in the 300s, so hopefully someone will throw more ARPs into the hopper. |
||
|
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2494 Status: Offline Project Badges:
|
MCM tasks has dried up totally. I do not know about ARP tasks though, since I do not crunch those.
|
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1294 Status: Recently Active Project Badges:
|
MCM seem to be flowing again. I did run out of my short queue of MCM WUs and get the dreaded " Tasks are committed to other platforms" message, but I've got MCM again.
ARP seems to be not flowing at all except for the occasional resend. Hopefully the team will fix that on Monday morning. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
Between about 17:30 and 22:00 (UTC) on 2025-03-23 I was getting nothing but retries for missed deadline cases (and, of course, about 20% of them got Server Aborted).
Fortunately, with the retries my buffers were just large enough to not have completely run out anyway, but it's irritating to have idle cores because they [currently] don't seem to be able to send out missed deadline retries for MCM1 unless they stop feeding new work. Hopefully, once all the new hardware is in place it might be possible to do something about that, though it might require some nifty scheduling of multiple feeders to mix tasks for the complete range of active WU numbers at any given time :-) Cheers - Al. |
||
|
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 1294 Status: Recently Active Project Badges:
|
MCM flow is higher than I've seen it in a while. Reference number in the 70k range.
Al is right that the system has a problem sending out MCM resends integrated with new WUs, which when the resends go out in a batch causes some of us (non window people??) to get the dreaded "committed to other platforms message" ARP has some retries rattling around, but no fresh WUs in a while. I hope the techs are getting some time to work on good things and not bandaiding the old system stuff. |
||
|
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2494 Status: Offline Project Badges:
|
It's not only non windows people that get the dreaded "committed to other platforms message". I'm Windows only, and I often get those messages too.
|
||
|
|
StFreddy
Cruncher Joined: Oct 2, 2019 Post Count: 3 Status: Offline Project Badges:
|
I recently recognized that new MCM workunits now take 2 hours 30 min to complete on my machine, it is now 1 hour longer than before. For years, it took 1:30 on my machine to complete 1 wu. Did anyone else also noticed this difference?
|
||
|
|
|