Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 65
Posts: 65   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 90161 times and has 64 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

I note that after a brief pause in SCC1 I'm now getting nothing but retries, and those retries are all for systems from the [hypothetical?] cluster that have gone No Reply. Now there's a surprise :-)

I also note that right up to the [apparent] end of new work for SCC1 there were nodes of that cluster grabbing new tasks, so that suggests that new nodes are still being fired up. Over this period, counting both MCM1 and SCC1, I have seen just under 1,500 tasks from those nodes, with nearly 1,400 different device names.

Of the 3,600 SCC1 tasks I've processed this month, just over 1,200 didn't validate without a wingman, and 850 of those had at least one task from that cluster (and only one of those has returned something to validate!) Over the same time period, over 40% of my [just over] 1400 MCM1 tasks have had at least one task from that cluster (and only 3 returned something to validate, right at the beginning...) I dread to think what the numbers would be for people with substantial numbers of CPUs, as I'm seeing those numbers when only offering a total of 16 threads to MCM1/SCC1 -- there could be huge numbers of retries out there!

If those nodes with "No Reply" tasks were still active they should've been returning something (probably "Not Started by Deadline", which WCG reports as "Error"), so it seems these nodes either shut down or go offline for some other reason. Surely a well-behaved node should detach before shutting down?

What I worry about now that the "Waiting to be sent" retries are getting out is that some of them might be picked up by new nodes from that group, and we end up with cascading No Reply tasks. As mentioned earlier, someone at WCG needs to find out what is happening out there, and perhaps the time has come for some "polite but forceful" communication :-)

Cheers - Al.
[Sep 25, 2023 6:03:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7580
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Al, My thoughts exactly.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 25, 2023 7:49:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
thunder7
Senior Cruncher
Netherlands
Joined: Mar 6, 2013
Post Count: 232
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Since about half september, my runtime is not correctly registered. For example, this 88 core machine is running 100% 24x7. And I see

2023-09-25 0:036:17:21:16 151600 316
2023-09-24 0:028:07:40:54 169338 501
2023-09-23 0:036:14:37:18 145140 413
2023-09-22 0:057:14:35:14 187800 519
2023-09-21 0:043:22:57:02 172834 615
2023-09-20 0:043:06:51:00 167228 521
2023-09-19 0:056:23:45:14 212153 670
2023-09-18 0:072:09:49:56 244179 879
2023-09-17 0:079:03:01:17 298888 472
2023-09-16 0:092:12:03:55 235380 387
2023-09-15 0:085:03:02:34 159144 238
2023-09-14 0:086:03:11:34 183005 252
2023-09-13 0:079:20:43:50 322765 521
2023-09-12 0:094:16:34:23 255929 425
2023-09-11 0:082:18:21:29 302790 509
2023-09-10 0:095:13:55:08 342080 611
2023-09-09 0:089:13:57:47 175861 263
2023-09-08 0:087:07:16:53 162507 246

Is that caused by these thieving clusters? I'm running MCM and SCC both, with no preference. I have no units with errors, but 7114 units 'pending validation ' and a random sampling sees this 5.15 cluster all the time.
----------------------------------------
[Edit 1 times, last edit by thunder7 at Sep 26, 2023 7:54:25 PM]
[Sep 26, 2023 7:32:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

thunder7,
Yesterday, I did a survey for only one of my machines that has been running SCC1-tasks lately and what I wanted to know was if I could find any Valid tasks from the 5.15.107+-cluster.

3565 workunits were examined (in which my device participated).
3257 (91%) were coupled with a device from the 5.15.107+-cluster.
There were 239 (6%) singletons (without the need for a wingman). Of these singletons, 43 were In Progress and 196 were Valid.

The final results in short:
5.15.107+      : 3257/3565
Their statuses:
In Progress    : 1198
No Reply : 2059

In this case, each task that is still In Progress means that a device from the 5.15.107+-cluster hasn't replied yet while my device is Pending Validation for that particular task.

So, I couldn't find any Valid (or even Pending) task from the 5.15.107+-cluster.

Adri
[Sep 26, 2023 12:38:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7580
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

So, I couldn't find any Valid (or even Pending) task from the 5.15.107+-cluster. Adri


I suspected as much. That cluster is a black hole vacuuming up work units which are never to be seen again. Once their deadline is reached, two work units are issued where before if the first one had been completed successfully, the second and third work units would never have had to be issued. Thus, twice as much work is being required. Whoever they are, they either need to be contacted and cut off, or be required to mend their ways and start completing work units.
Definitely a rogue installation.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 26, 2023 1:09:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Hi alanb1951,

I have passed your concerns about this hypothetical cluster on to the rest of the team.
[Sep 26, 2023 1:32:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7580
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Hi alanb1951,

I have passed your concerns about this hypothetical cluster on to the rest of the team.

Thank you.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 26, 2023 4:52:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Blount
Senior Cruncher
Joined: Aug 19, 2005
Post Count: 399
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Thunder7: did you upgrade the bounce client? Later versions take lots of 20 second 'cpu busy' suspends. This adds up to a strange amount of total runtime reported. I went back to level 7.16.11 (the level in wcg download page - windows machies)..
[Sep 27, 2023 11:31:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
thunder7
Senior Cruncher
Netherlands
Joined: Mar 6, 2013
Post Count: 232
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

There is no newer version for linux - 7.4.22 is still the stable, recommended version.
[Sep 27, 2023 4:07:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12146
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is there any way of finding your wingman’s host-Id?

Tiger Lily

There is another aspect to this cluster problem. It would seem to be Linux and as a consequence, all those re-sends have to go to Linux machines. Those of us with Windows or Android machines do not get a chance for the re-sends.

Mike
[Sep 27, 2023 10:43:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 65   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread