Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 65
Posts: 65   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 135451 times and has 64 replies Next Thread
NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

My company uses virtual servers. I think in this case my admins would just re-allocate some processors to the virtual validator from somewhere else until everything was re-balanced.

coffee
----------------------------------------

[Sep 20, 2023 3:28:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
supdood
Senior Cruncher
USA
Joined: Aug 6, 2015
Post Count: 333
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

While my total PVs are still increasing, I'm finally starting to see some tasks returned from those Linux 5.15.107+ systems. There is either something wrong with their setup or they are trying to grab CPU cycles in between other loads. Here are some examples:

Cpu time/ Elapsed time
0.6 / 5.89 (Linux 5.15.107+)
1.3 / 1.31 (me)

0.46 / 4.7 (Linux 5.15.107+)
0.95 / 0.96 (me)

0.39 / 4.21 (Linux 5.15.107+)
0.88 / 0.88 (me)


Thankfully I've now achieved reliable status on SCC and can run without a wingman, but there doesn't seem to be much hope of that cluster getting through all the work with such terrible CPU utilization rates.
----------------------------------------
Crunch with BOINC team USA
www.boincusa.com

[Sep 20, 2023 9:56:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

@supdood,

Neat observation -- I was wondering if/when we'd get any evidence that those systems weren't effectively a black hole :-)

I haven't seen a single one return a task for SCC1 yet, but after seeing your message I went through my MCM1 records and found three from the [apparent] early days of those systems firing up - two of them managed to have elapsed time less than double the CPU time, but this wonderful example also showed up (output is from one of my wingman monitoring scripts, slightly edited...)

Task MCM1_0203538_9998_1 was returned by [My Ryzen 5600H] at 2023-09-12T15:59:37+0000:
Work-unit 374236575 created 2023-09-12T11:53:58+0000
Sent date 2023-09-12T11:54:00+0000, deadline 2023-09-18T11:54:00+0000.
CPU time 1.54008 hours, elapsed time 1.54102 hours,
status is Valid
The workunit has 2 potential results: wingman data follows.
MCM1_0203538_9998_0 assigned to [redacted] on Linux
O/S version is 5.15.107+
time sent was 2023-09-12T11:54:00+0000
due time was 2023-09-18T11:54:00+0000
returned time was 2023-09-14T11:56:16+0000
status is Valid
CPU time 5.52996 hours, elapsed time 28.67212 hours

That is one seriously stalling CPU!

I'm in total agreement about "getting through the work" -- I currently have over 250 MCM1 tasks and 550 SCC1 tasks Pending Validations, and 720 of them are waiting on replies from the cluster :-( The first 100 or so of those tasks aren't far off initial deadline now...

I have this horrible vision of the BOINC transitioner setting tens or hundreds of thousands of No Reply markers over the next few days as more of the cluster nodes start missing deadlines, with the same sort of consequences we saw a few weeks ago (in particular, retries taking ages to be sent out because there were just so many of them...)

Here's hoping the WCG folks are aware of this and can reach out to whoever is running those systems[*1]. There might even be expertise amongst forum members that could help resolve any issues regarding configuration of those systems (but as we can't identify the owners we can't offer to do it ourselves!)

Cheers - Al.

[*1] Whilst finding a way to simply block them would reduce the problem, that wouldn't be polite, would it? :-)
[Sep 21, 2023 6:15:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 384
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

*1) Sadly, they are not being polite grabbing several thousand WUs and not processing them.
[Sep 21, 2023 6:40:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
supdood
Senior Cruncher
USA
Joined: Aug 6, 2015
Post Count: 333
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Thankfully I've now achieved reliable status on SCC and can run without a wingman

Or not. Some of my tasks as going through as reliable quorum 1, but most are still unreliable quorum 2 and getting this cluster as wingman. PV tasks still climbing...
----------------------------------------
Crunch with BOINC team USA
www.boincusa.com

[Sep 21, 2023 1:40:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Vester
Senior Cruncher
USA
Joined: Nov 18, 2004
Post Count: 325
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

I have about 75 SCC work units "Pending Validation" with a quorum of 2. Examples are SCC1_0004350_KLF15-A_35102 and SCC1_0004417_brachyury_33359.
----------------------------------------

[Sep 23, 2023 10:13:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 384
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

And, sadly, the majority of the tasks released as “no reply” are now “waiting to be sent”.
[Sep 23, 2023 11:11:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Concerning SCC, I had not bothered to look at the OS of the wingman for any of the pending validation tasks I have (over 4000), but there is a proliferation of Linux 5.15.107+. If they are grabbing thousands of these and basically never completing them, that could be quite a bottleneck. I did get a rash of re-sends yesterday, both 2's and 3's(more than 500) but have not seen very many today. I thought some tech may have broken the logjam yesterday, but it may require some periodic monitoring. If hey are not working on the weekend, it may be Monday before w see another burst of re-sends.
The pending validations seem to be holding steady at the moment and I only have about 80 pending verifications.
Thanks to supdood and AlanB1951 for the information.

Edit: Upon a little more poking about I noticed all of the SCC work units ending in zero are minimum quorum 1 whereas all of the Scc units ending 1 are minimum quorum 2. All of the latter that I looked at were from the notorious OS - Linux 5.15.107+ . This tells me that system or cluster has been deemed "unreliable." Now this leads me to believe there is now a lot more work which has become necessary because someone has gobbled up untold numbers of work units and probably not returned any substantial number of valid results. Even if they have returned some valid results, all of their results are subject to a validation check by someone else, creating extra work which is not necessary.

And here is an interesting one:5.15.107

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
----------------------------------------
[Edit 2 times, last edit by Sgt.Joe at Sep 23, 2023 9:34:43 PM]
[Sep 23, 2023 3:47:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Just a thought, but in the latter IBM days, Kevin or Keith said that there was a big mainframe that crunched intermittently when it didn't have any of its official work to do.

Could this be the problem?

Mike
[Sep 23, 2023 10:42:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Just a thought, but in the latter IBM days, Kevin or Keith said that there was a big mainframe that crunched intermittently when it didn't have any of its official work to do.

Could this be the problem?

Mike

If that is the case it is just churning and creating more work than necessary.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 24, 2023 1:30:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 65   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread