Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 77
Posts: 77   Pages: 8   [ 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 299524 times and has 76 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 874
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Are all MCM1 assimilators running?

I've noticed that since the assimilator(s) restarted on 2024-01-11 I haven't seen a single result for an even-numbered workunit being removed. Indeed, after a couple of days it got even more odd than that -- all WUs removed had IDs that give remainder 1 when divided by 4. If WCG normally deploy 4 assimilators for MCM1 and some of them haven't been working, that would cause what I've observed.

Of course, I have no idea whether there's anything in the WCG set-up that means the wrappers for the validator and assimilator are non-standard (e.g. changed database queries), so some (or all?) of the analysis that led to this question may not be valid after all. However, I note that I'm not the only person seeing a new build-up of unassimilated work, I thought I'd ask :-)

There is also another interesting aspect to recent assimilations -- huge sequences of WU IDs seem to be ignored (which didn't seem to be happening before the backlog built up, so perhaps there's some correlation there...)

Cheers - Al.

P.S. The above is based on monitoring i've been doing for a while. I'll put some "evidence" in a follow-up...
[Jan 26, 2024 2:12:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 874
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

Following up with explanations and examples..;.

As I've mentioned elsewhere when the topic of stalled assimilation has come up, I've been tryiog to watch assimilator behaviour using data from a daily results statistics collection script I run using the old API, which happens to provide information on the "File Delete State" flag on each result.

Over the months since I started looking at that information, I've refined the toolbox to the point where I can produce a summary of the data about my MCM1 tasks that are still shown as Valid for each day's data collection. Normally, the data is collected at about 02:00 UTC; as well as the summaries (examples below) it allows me to observe daily changes...

Once I realized that since 2024-01-11 only odd-number WUs were going away I adapted one of my scripts to include information about how many odd-numbered WUs were amongst those removed and those still held. Once it became apparent that equation for removal seemed to be
   Workunit ID modulo 4 = 1
so I modified it again to show four counts, one for each grouping.

I will show a few sample reports, ranging from normal behaviour before the backlog built up to what is now happening. As mentioned above, data is collected early in the day so it mostly reflects activity on the day before. (This post is getting long, so most of them will be in follow-ups...)

The mod-4 counts indicate how many WU IDs have each remainder modulo 4. so (for example) "mod-4 counts [1, 1, 0, 4] in the second report shown below indicates 1 WU in the "remainder 0" and "remainder 1" groups and 4 WUs in the "remainder 3" group.

The ModTime values quoted are from the result record in its "file delete state 2" phase; such results are normally purged by the next day...

(Note that during this period I typically returned between 140 and 180 results a day, so if it assimilates 200+ it is [just] ahead of the game. Of course, the validation rates are uneven at the best of times, for various reasons[*1])

Here is a sample from before the problems began -- note the low numbers of WUs and their fairly even distribution:

Examined data collected on 2023-10-01

Found 213 MCM1 items: mod-4 counts [54, 61, 46, 52]

There were 129 items in delete state 2: mod-4 counts [31, 36, 28, 34]
Lowest ModTime was 2023-09-30 02:03:06 (1696039386): WU Id was 387741771
Highest ModTime was 2023-10-01 01:15:38 (1696122938): WU Id was 388835091

And one from just before the assimilator(s) seemed to die completely -- note the low assimilated items cont and a small build-up in waiting tasks.

Examined data collected on 2023-10-24

Found 598 MCM1 items: mod-4 counts [87, 94, 321, 96]

There were 6 items in delete state 2: mod-4 counts [1, 1, 0, 4]
Lowest ModTime was 2023-10-23 04:45:15 (1698036315): WU Id was 404088191
Highest ModTime was 2023-10-23 21:59:30 (1698098370): WU Id was 404161780

More in another follow-up post -- I fear this may already be too long for some :-)

Cheers - Al.

*1 The need for retries seems to vary in intensity, so the number of results that need validation can climb quite steeply on occasion, then drop back down two or three days later (though the current "Waiting to be sent" problem can turn that into 8+ days later...) This is not the right thread for debating why that happens; hence the footnote :-)
[Jan 26, 2024 2:35:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 874
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

After about 3 weeks there were a couple of days with some activity. The first report is for the first day of the two, which did quite well; the next day seemed to stop activity around 11:00 UTC, with a lower number of processed results...

Examined data collected on 2023-10-24

Found 598 MCM1 items: mod-4 counts [87, 94, 321, 96]

There were 6 items in delete state 2: mod-4 counts [1, 1, 0, 4]
Lowest ModTime was 2023-10-23 04:45:15 (1698036315): WU Id was 404088191
Highest ModTime was 2023-10-23 21:59:30 (1698098370): WU Id was 404161780

And that was that until 2024 :-(

There was a brief flurry of assimilations early in January; here's the report on the first day's activity. Note that it seemed to do quite well. The next day did 535 items, also not bad, then back to no activity again...

Examined data collected on 2024-01-05

Found 11942 MCM1 items: mod-4 counts [3000, 2983, 2938, 3021]

There were 792 items in delete state 2: mod-4 counts [186, 220, 195, 191]
Lowest ModTime was 2024-01-04 11:51:48 (1704369108): WU Id was 408691018
Highest ModTime was 2024-01-05 05:12:30 (1704431550): WU Id was 411858465


Now, when the next assimilator restart happened, things were different. It is from this point on that I've not seen a single even-numbered WU disappear :-(
I'll show the first two reports, to show the "WU ID modulo 4 = 3" case working then not working...

Examined data collected on 2024-01-13

Found 12152 MCM1 items: mod-4 counts [3039, 3019, 3013, 3081]

There were 449 items in delete state 2: mod-4 counts [0, 231, 0, 218]
Lowest ModTime was 2024-01-12 02:38:39 (1705027119): WU Id was 411126651
Highest ModTime was 2024-01-13 01:56:28 (1705110988): WU Id was 421492205

Examined data collected on 2024-01-14

Found 11878 MCM1 items: mod-4 counts [3087, 2811, 3060, 2920]

There were 141 items in delete state 2: mod-4 counts [0, 141, 0, 0]
Lowest ModTime was 2024-01-13 05:01:08 (1705122068): WU Id was 422314421
Highest ModTime was 2024-01-14 04:57:03 (1705208223): WU Id was 426275757

From this point on, it wasn't removing my old results as quickly as I was returning new results. This is obviously just how my results are being handled -- it could be that some of the users with much higher daily returns aren't seeing as much of an issue.

To finish up, here's the result of this morning's collection...

Examined data collected on 2024-01-26

Found 12673 MCM1 items: mod-4 counts [3595, 2099, 3574, 3405]

There were 45 items in delete state 2: mod-4 counts [0, 45, 0, 0]
Lowest ModTime was 2024-01-25 02:06:22 (1706148382): WU Id was 454872533
Highest ModTime was 2024-01-26 01:51:38 (1706233898): WU Id was 456580293


I hope all this has been of some interest (and might be of some use)...

Cheers - Al.
[Jan 26, 2024 3:02:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 337
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

A lovely piece of analysis, thank you.
[Jan 26, 2024 5:02:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2091
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

A (hopefully) short followup. ;-)
My research should look concise, thanks to Al's valuable preceding efforts.

I've collected all my results, including results still in my queues, and they are all MCM1-tasks, 68,758 in total, into a file that I downloaded an hour ago.

After reading Al's messages, I decided to run a quick command:

$ sed 's/.*<Result>//' wcgresults.2024-01-26T18\:59\:01.5081.0 |
perl -w00ne 'print if /<FileDeleteState>2/' | sed -n 's/.*\(..\)<\/WorkunitId>/\1/p' | sort | uniq -c
This command will select all results with FileDeleteState = 2 initially, after which it will select two characters from the WorkunitId-field. Those two characters are the last two digits from the Workunit-ID-field and each unique value (00 - 99) should then be counted.

You would then expect about - but not more than - 100 unique values (00 to 99). Let's have a look.

The output should show two columns:
- to the right, the last two digits of a workunit-ID;
- to the left, the number of unique values for each resulting last two digits of a workunit-ID.

And this is the output:
     25 01
14 05
17 09
23 13
22 17
15 21
15 25
12 29
14 33
17 37
14 41
18 45
14 49
15 53
13 57
13 61
15 65
24 69
12 73
16 77
13 81
15 85
17 89
16 93
25 97

This output would show that only IDs modulo 4 = 1 are following the path of FileDeleteState = 2.

So, 75% of the workunit(-ID)s are not being assimilated at the moment, that would be my conclusion. Am I about right?

Adri
----------------------------------------
[Edit 4 times, last edit by adriverhoef at Jan 26, 2024 7:32:58 PM]
[Jan 26, 2024 7:22:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

Hi everyone,

I forwarded your analyses and insights to the tech team and they provided the following update:

Correct, as deduced by alanb1951 and adriverhoef the workunits assigned to an instance of the MCM1 assimilator are determined by applying the modulo operator to the workunit ID, so 3/4 assimilators for MCM1 are not currently running due to problems with the workunits in their assigned ranges triggering a non-zero exit condition for the assimilator. After resolving the initial issue, we encountered another issue that triggers a different non-zero exit condition in the MCM1 assimilator code. We have tested a modified version of the assimilator that is able to handle these errors and continue processing workunits that are not affected in the assigned range. The new assimilator will mark the workunits that flagged as problematic in assimilation with an appropriate error_mask, so that they can be handled separately as a group, and once we get through them all we should be in the clear. In the interim, we should be able to process the vast majority of workunits that pass the checks in the assimilator by catching these two error conditions, but allowing the assimlator to continue. The new version of the assimilator should be up and running Monday, and the expected outcome is that we will return to the "flurry of assimilations" from early January. If all goes well, we should be able to double the rate again, and then again, essentially by starting more assimilators with a higher modulo, after allowing 48h in between to monitor.
[Jan 26, 2024 8:28:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 874
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

TigerLily - thanks for passing our analysis on and getting a response. Communication at its best :-)

Please thank the tech team for that prompt and detailed response! As a retired Computing person, I appreciate technical information...

It's sad to realize that there is such a severe problem but good to know a solution is near. I also hope that recognizing what the problems were (presumably more "bad data") will enable them to work out what caused said problems so they don't recur after the clean-up.

Cheers - Al.

P.S. I do realize that tech team is very busy(!) and could probably do without us "armchair SysAdmins" second-guessing what might be problems. However, I find it is fun to work out what might be broken, and possibly even why :-)
[Jan 26, 2024 8:54:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 874
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

Adri - thanks for the confirmation report. Hope you liked the response!

Now we play "Watch this space"...;

Cheers - Al.
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jan 26, 2024 8:56:48 PM]
[Jan 26, 2024 8:56:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2091
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

Adri - thanks for the confirmation report. Hope you liked the response!

It was certainly fun to do (no trouble at all) and the friendly response from TigerLily (who "forwarded [our] analyses and insights to the tech team") feels really rewarding ("Correct, as deduced by alanb1951 and adriverhoef") as it was prompt and contained a promising solution from the tech team.

Let's not forget your sublime investigations that you didn't keep private - as opposed to these geezers. wink (Sound doesn't work on this laptop, so I have to assume the video plays what it says.)

Adri
[Jan 26, 2024 10:00:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 860
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Are all MCM1 assimilators running?

Thank you TigerLily for communicating the great information provided in this thread to the tech team, and for letting us know the response. It truly makes me happy to see the community working together so well.
[Jan 27, 2024 5:51:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 77   Pages: 8   [ 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread