World Community Grid - View Thread - Is there any way of finding your wingman’s host-Id?

World Community Grid Forums

Category: Completed Research

Forum: Smash Childhood Cancer

Thread: Is there any way of finding your wingman’s host-Id?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 65

[ ]

Author

This topic has been viewed 135451 times and has 64 replies

NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Is there any way of finding your wingman’s host-Id?

My company uses virtual servers. I think in this case my admins would just re-allocate some processors to the virtual validator from somewhere else until everything was re-balanced.

coffee

----------------------------------------

[Sep 20, 2023 3:28:18 PM]

supdood
Senior Cruncher
USA
Joined: Aug 6, 2015
Post Count: 333
Status: Offline
Project Badges:

180 day badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Is there any way of finding your wingman’s host-Id?

While my total PVs are still increasing, I'm finally starting to see some tasks returned from those Linux 5.15.107+ systems. There is either something wrong with their setup or they are trying to grab CPU cycles in between other loads. Here are some examples:

Cpu time/ Elapsed time
0.6 / 5.89 (Linux 5.15.107+)
1.3 / 1.31 (me)

0.46 / 4.7 (Linux 5.15.107+)
0.95 / 0.96 (me)

0.39 / 4.21 (Linux 5.15.107+)
0.88 / 0.88 (me)

Thankfully I've now achieved reliable status on SCC and can run without a wingman, but there doesn't seem to be much hope of that cluster getting through all the work with such terrible CPU utilization rates.

----------------------------------------

Crunch with BOINC team USA
www.boincusa.com

[Sep 20, 2023 9:56:46 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

10 year badge for Africa Rainfall Project


Re: Is there any way of finding your wingman’s host-Id?

@supdood,

Neat observation -- I was wondering if/when we'd get any evidence that those systems weren't effectively a black hole :-)

I haven't seen a single one return a task for SCC1 yet, but after seeing your message I went through my MCM1 records and found three from the [apparent] early days of those systems firing up - two of them managed to have elapsed time less than double the CPU time, but this wonderful example also showed up (output is from one of my wingman monitoring scripts, slightly edited...)

Task MCM1_0203538_9998_1 was returned by [My Ryzen 5600H] at 2023-09-12T15:59:37+0000:
  Work-unit 374236575 created 2023-09-12T11:53:58+0000
  Sent date 2023-09-12T11:54:00+0000,  deadline 2023-09-18T11:54:00+0000.
  CPU time 1.54008 hours,  elapsed time 1.54102 hours,
  status is Valid
  The workunit has 2 potential results:  wingman data follows.
  MCM1_0203538_9998_0 assigned to [redacted] on Linux
    O/S version is 5.15.107+
    time sent was 2023-09-12T11:54:00+0000
    due time was 2023-09-18T11:54:00+0000
    returned time was 2023-09-14T11:56:16+0000
    status is Valid
    CPU time 5.52996 hours,  elapsed time 28.67212 hours

That is one seriously stalling CPU!

I'm in total agreement about "getting through the work" -- I currently have over 250 MCM1 tasks and 550 SCC1 tasks Pending Validations, and 720 of them are waiting on replies from the cluster :-( The first 100 or so of those tasks aren't far off initial deadline now...

I have this horrible vision of the BOINC transitioner setting tens or hundreds of thousands of No Reply markers over the next few days as more of the cluster nodes start missing deadlines, with the same sort of consequences we saw a few weeks ago (in particular, retries taking ages to be sent out because there were just so many of them...)

Here's hoping the WCG folks are aware of this and can reach out to whoever is running those systems[*1]. There might even be expertise amongst forum members that could help resolve any issues regarding configuration of those systems (but as we can't identify the owners we can't offer to do it ourselves!)

Cheers - Al.

[*1] Whilst finding a way to simply block them would reduce the problem, that wouldn't be polite, would it? :-)

[Sep 21, 2023 6:15:17 AM]

Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 384
Status: Offline
Project Badges:

14 day badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Is there any way of finding your wingman’s host-Id?

*1) Sadly, they are not being polite grabbing several thousand WUs and not processing them.

[Sep 21, 2023 6:40:34 AM]

supdood
Senior Cruncher
USA
Joined: Aug 6, 2015
Post Count: 333
Status: Offline
Project Badges:


Re: Is there any way of finding your wingman’s host-Id?

Thankfully I've now achieved reliable status on SCC and can run without a wingman

Or not. Some of my tasks as going through as reliable quorum 1, but most are still unreliable quorum 2 and getting this cluster as wingman. PV tasks still climbing...

----------------------------------------

Crunch with BOINC team USA
www.boincusa.com

[Sep 21, 2023 1:40:23 PM]

Vester
Senior Cruncher
USA
Joined: Nov 18, 2004
Post Count: 325
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

14 day badge for Outsmart Ebola Together

1 year badge for Microbiome Immunity Project


Re: Is there any way of finding your wingman’s host-Id?

I have about 75 SCC work units "Pending Validation" with a quorum of 2. Examples are SCC1_0004350_KLF15-A_35102 and SCC1_0004417_brachyury_33359.

----------------------------------------

[Sep 23, 2023 10:13:18 AM]

Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 384
Status: Offline
Project Badges:


Re: Is there any way of finding your wingman’s host-Id?

And, sadly, the majority of the tasks released as “no reply” are now “waiting to be sent”.

[Sep 23, 2023 11:11:45 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

45 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Outsmart Ebola Together

100 year badge for Smash Childhood Cancer

100 year badge for OpenPandemics - COVID-19


Re: Is there any way of finding your wingman’s host-Id?

Concerning SCC, I had not bothered to look at the OS of the wingman for any of the pending validation tasks I have (over 4000), but there is a proliferation of Linux 5.15.107+. If they are grabbing thousands of these and basically never completing them, that could be quite a bottleneck. I did get a rash of re-sends yesterday, both 2's and 3's(more than 500) but have not seen very many today. I thought some tech may have broken the logjam yesterday, but it may require some periodic monitoring. If hey are not working on the weekend, it may be Monday before w see another burst of re-sends.
The pending validations seem to be holding steady at the moment and I only have about 80 pending verifications.
Thanks to supdood and AlanB1951 for the information.

Edit: Upon a little more poking about I noticed all of the SCC work units ending in zero are minimum quorum 1 whereas all of the Scc units ending 1 are minimum quorum 2. All of the latter that I looked at were from the notorious OS - Linux 5.15.107+ . This tells me that system or cluster has been deemed "unreliable." Now this leads me to believe there is now a lot more work which has become necessary because someone has gobbled up untold numbers of work units and probably not returned any substantial number of valid results. Even if they have returned some valid results, all of their results are subject to a validation check by someone else, creating extra work which is not necessary.

And here is an interesting one:5.15.107

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

----------------------------------------
[Edit 2 times, last edit by Sgt.Joe at Sep 23, 2023 9:34:43 PM]

[Sep 23, 2023 3:47:06 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Computing for Sustainable Water

5 year badge for FightAIDS@Home - Phase 2


Re: Is there any way of finding your wingman’s host-Id?

Just a thought, but in the latter IBM days, Kevin or Keith said that there was a big mainframe that crunched intermittently when it didn't have any of its official work to do.

Could this be the problem?

Mike

[Sep 23, 2023 10:42:00 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7844
Status: Offline
Project Badges:


Re: Is there any way of finding your wingman’s host-Id?

If that is the case it is just churning and creating more work than necessary.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Sep 24, 2023 1:30:36 AM]

[ ]