Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Active Research Forum: Smash Childhood Cancer Thread: Is there any way of finding your wingman’s host-Id? |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 65
|
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
I note that after a brief pause in SCC1 I'm now getting nothing but retries, and those retries are all for systems from the [hypothetical?] cluster that have gone No Reply. Now there's a surprise :-)
I also note that right up to the [apparent] end of new work for SCC1 there were nodes of that cluster grabbing new tasks, so that suggests that new nodes are still being fired up. Over this period, counting both MCM1 and SCC1, I have seen just under 1,500 tasks from those nodes, with nearly 1,400 different device names. Of the 3,600 SCC1 tasks I've processed this month, just over 1,200 didn't validate without a wingman, and 850 of those had at least one task from that cluster (and only one of those has returned something to validate!) Over the same time period, over 40% of my [just over] 1400 MCM1 tasks have had at least one task from that cluster (and only 3 returned something to validate, right at the beginning...) I dread to think what the numbers would be for people with substantial numbers of CPUs, as I'm seeing those numbers when only offering a total of 16 threads to MCM1/SCC1 -- there could be huge numbers of retries out there! If those nodes with "No Reply" tasks were still active they should've been returning something (probably "Not Started by Deadline", which WCG reports as "Error"), so it seems these nodes either shut down or go offline for some other reason. Surely a well-behaved node should detach before shutting down? What I worry about now that the "Waiting to be sent" retries are getting out is that some of them might be picked up by new nodes from that group, and we end up with cascading No Reply tasks. As mentioned earlier, someone at WCG needs to find out what is happening out there, and perhaps the time has come for some "polite but forceful" communication :-) Cheers - Al. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7580 Status: Recently Active Project Badges: |
Al, My thoughts exactly.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 232 Status: Offline Project Badges: |
Since about half september, my runtime is not correctly registered. For example, this 88 core machine is running 100% 24x7. And I see
----------------------------------------2023-09-25 0:036:17:21:16 151600 316 2023-09-24 0:028:07:40:54 169338 501 2023-09-23 0:036:14:37:18 145140 413 2023-09-22 0:057:14:35:14 187800 519 2023-09-21 0:043:22:57:02 172834 615 2023-09-20 0:043:06:51:00 167228 521 2023-09-19 0:056:23:45:14 212153 670 2023-09-18 0:072:09:49:56 244179 879 2023-09-17 0:079:03:01:17 298888 472 2023-09-16 0:092:12:03:55 235380 387 2023-09-15 0:085:03:02:34 159144 238 2023-09-14 0:086:03:11:34 183005 252 2023-09-13 0:079:20:43:50 322765 521 2023-09-12 0:094:16:34:23 255929 425 2023-09-11 0:082:18:21:29 302790 509 2023-09-10 0:095:13:55:08 342080 611 2023-09-09 0:089:13:57:47 175861 263 2023-09-08 0:087:07:16:53 162507 246 Is that caused by these thieving clusters? I'm running MCM and SCC both, with no preference. I have no units with errors, but 7114 units 'pending validation ' and a random sampling sees this 5.15 cluster all the time. [Edit 1 times, last edit by thunder7 at Sep 26, 2023 7:54:25 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Offline Project Badges: |
thunder7,
Yesterday, I did a survey for only one of my machines that has been running SCC1-tasks lately and what I wanted to know was if I could find any Valid tasks from the 5.15.107+-cluster. 3565 workunits were examined (in which my device participated). 3257 (91%) were coupled with a device from the 5.15.107+-cluster. There were 239 (6%) singletons (without the need for a wingman). Of these singletons, 43 were In Progress and 196 were Valid. The final results in short: 5.15.107+ : 3257/3565Their statuses: In Progress : 1198 In this case, each task that is still In Progress means that a device from the 5.15.107+-cluster hasn't replied yet while my device is Pending Validation for that particular task. So, I couldn't find any Valid (or even Pending) task from the 5.15.107+-cluster. Adri |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7580 Status: Recently Active Project Badges: |
So, I couldn't find any Valid (or even Pending) task from the 5.15.107+-cluster. Adri I suspected as much. That cluster is a black hole vacuuming up work units which are never to be seen again. Once their deadline is reached, two work units are issued where before if the first one had been completed successfully, the second and third work units would never have had to be issued. Thus, twice as much work is being required. Whoever they are, they either need to be contacted and cut off, or be required to mend their ways and start completing work units. Definitely a rogue installation. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TigerLily
Senior Cruncher Joined: May 26, 2023 Post Count: 280 Status: Offline Project Badges: |
Hi alanb1951,
I have passed your concerns about this hypothetical cluster on to the rest of the team. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7580 Status: Recently Active Project Badges: |
Hi alanb1951, I have passed your concerns about this hypothetical cluster on to the rest of the team. Thank you. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Blount
Senior Cruncher Joined: Aug 19, 2005 Post Count: 399 Status: Offline Project Badges: |
Thunder7: did you upgrade the bounce client? Later versions take lots of 20 second 'cpu busy' suspends. This adds up to a strange amount of total runtime reported. I went back to level 7.16.11 (the level in wcg download page - windows machies)..
|
||
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 232 Status: Offline Project Badges: |
There is no newer version for linux - 7.4.22 is still the stable, recommended version.
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12146 Status: Offline Project Badges: |
Tiger Lily
There is another aspect to this cluster problem. It would seem to be Linux and as a consequence, all those re-sends have to go to Linux machines. Those of us with Windows or Android machines do not get a chance for the re-sends. Mike |
||
|
|