| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 65
|
|
| Author |
|
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges:
|
My company uses virtual servers. I think in this case my admins would just re-allocate some processors to the virtual validator from somewhere else until everything was re-balanced.
----------------------------------------![]() ![]() |
||
|
|
supdood
Senior Cruncher USA Joined: Aug 6, 2015 Post Count: 333 Status: Offline Project Badges:
|
While my total PVs are still increasing, I'm finally starting to see some tasks returned from those Linux 5.15.107+ systems. There is either something wrong with their setup or they are trying to grab CPU cycles in between other loads. Here are some examples:
----------------------------------------Cpu time/ Elapsed time 0.6 / 5.89 (Linux 5.15.107+) 1.3 / 1.31 (me) 0.46 / 4.7 (Linux 5.15.107+) 0.95 / 0.96 (me) 0.39 / 4.21 (Linux 5.15.107+) 0.88 / 0.88 (me) Thankfully I've now achieved reliable status on SCC and can run without a wingman, but there doesn't seem to be much hope of that cluster getting through all the work with such terrible CPU utilization rates. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
@supdood,
Neat observation -- I was wondering if/when we'd get any evidence that those systems weren't effectively a black hole :-) I haven't seen a single one return a task for SCC1 yet, but after seeing your message I went through my MCM1 records and found three from the [apparent] early days of those systems firing up - two of them managed to have elapsed time less than double the CPU time, but this wonderful example also showed up (output is from one of my wingman monitoring scripts, slightly edited...) Task MCM1_0203538_9998_1 was returned by [My Ryzen 5600H] at 2023-09-12T15:59:37+0000: That is one seriously stalling CPU! I'm in total agreement about "getting through the work" -- I currently have over 250 MCM1 tasks and 550 SCC1 tasks Pending Validations, and 720 of them are waiting on replies from the cluster :-( The first 100 or so of those tasks aren't far off initial deadline now... I have this horrible vision of the BOINC transitioner setting tens or hundreds of thousands of No Reply markers over the next few days as more of the cluster nodes start missing deadlines, with the same sort of consequences we saw a few weeks ago (in particular, retries taking ages to be sent out because there were just so many of them...) Here's hoping the WCG folks are aware of this and can reach out to whoever is running those systems[*1]. There might even be expertise amongst forum members that could help resolve any issues regarding configuration of those systems (but as we can't identify the owners we can't offer to do it ourselves!) Cheers - Al. [*1] Whilst finding a way to simply block them would reduce the problem, that wouldn't be polite, would it? :-) |
||
|
|
Bryn Mawr
Senior Cruncher Joined: Dec 26, 2018 Post Count: 384 Status: Offline Project Badges:
|
*1) Sadly, they are not being polite grabbing several thousand WUs and not processing them.
|
||
|
|
supdood
Senior Cruncher USA Joined: Aug 6, 2015 Post Count: 333 Status: Offline Project Badges:
|
Thankfully I've now achieved reliable status on SCC and can run without a wingman Or not. Some of my tasks as going through as reliable quorum 1, but most are still unreliable quorum 2 and getting this cluster as wingman. PV tasks still climbing... |
||
|
|
Vester
Senior Cruncher USA Joined: Nov 18, 2004 Post Count: 325 Status: Offline Project Badges:
|
I have about 75 SCC work units "Pending Validation" with a quorum of 2. Examples are SCC1_0004350_KLF15-A_35102 and SCC1_0004417_brachyury_33359.
----------------------------------------![]() |
||
|
|
Bryn Mawr
Senior Cruncher Joined: Dec 26, 2018 Post Count: 384 Status: Offline Project Badges:
|
And, sadly, the majority of the tasks released as “no reply” are now “waiting to be sent”.
|
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
Concerning SCC, I had not bothered to look at the OS of the wingman for any of the pending validation tasks I have (over 4000), but there is a proliferation of Linux 5.15.107+. If they are grabbing thousands of these and basically never completing them, that could be quite a bottleneck. I did get a rash of re-sends yesterday, both 2's and 3's(more than 500) but have not seen very many today. I thought some tech may have broken the logjam yesterday, but it may require some periodic monitoring. If hey are not working on the weekend, it may be Monday before w see another burst of re-sends.
----------------------------------------The pending validations seem to be holding steady at the moment and I only have about 80 pending verifications. Thanks to supdood and AlanB1951 for the information. Edit: Upon a little more poking about I noticed all of the SCC work units ending in zero are minimum quorum 1 whereas all of the Scc units ending 1 are minimum quorum 2. All of the latter that I looked at were from the notorious OS - Linux 5.15.107+ . This tells me that system or cluster has been deemed "unreliable." Now this leads me to believe there is now a lot more work which has become necessary because someone has gobbled up untold numbers of work units and probably not returned any substantial number of valid results. Even if they have returned some valid results, all of their results are subject to a validation check by someone else, creating extra work which is not necessary. And here is an interesting one:5.15.107 Cheers
Sgt. Joe
----------------------------------------*Minnesota Crunchers* [Edit 2 times, last edit by Sgt.Joe at Sep 23, 2023 9:34:43 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Just a thought, but in the latter IBM days, Kevin or Keith said that there was a big mainframe that crunched intermittently when it didn't have any of its official work to do.
Could this be the problem? Mike |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
Just a thought, but in the latter IBM days, Kevin or Keith said that there was a big mainframe that crunched intermittently when it didn't have any of its official work to do. Could this be the problem? Mike If that is the case it is just churning and creating more work than necessary. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|