KQL self-join pattern

In my day job, I spend a lot of time querying and analysing vast amounts of security data in Azure Data Explorer (ADX)/Kusto to find interesting patterns and prototype detections.

A common KQL pattern that I find particularly useful is the self-join - where a cached, pre-filtered table is joined back to itself to further filter.

Taxi Example

For this example, we’ll re-use the Seaborn Taxis.csv sample dataset (loaded with externaldata).

let TaxiData = externaldata(
    Pickup:datetime,
    Dropoff:datetime,
    Passengers:int,
    Distance:real,
    Fare:real,
    Tip:real,
    Tolls:real,
    Total:real,
    Color:string,
    Payment:string,
    PickupZone:string,
    DropoffZone:string,
    PickupBorough:string,
    DropoffBorough:string
) [
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/taxis.csv"
] with(format='csv', ignoreFirstRecord=true);
TaxiData
| where isnotempty(PickupBorough)
| where PickupBorough == DropoffBorough

Suppose (for the sake of an example) that we’re interested in statistics for journeys within a single borough, that start and end in ‘frequently used’ zones (so, we’ll filter out unpopular zones).

Here is the distribution of journeys by PickupZone and DropoffZone:

Taxi Journeys by Pickup Zone

Taxi Journeys by Dropoff Zone

So, ~20 feels like a good threshold to use.

Summarisation

The KQL summarize operator is the best way to generate summary statistics for rows of data. So, we can do something like the following:

TaxiData
| where isnotempty(PickupBorough)
| where PickupBorough == DropoffBorough
| summarize Count = count() by PickupZone
| where Count >= 20
| order by Count desc

Which gives us our set of PickupZones with >= 20 journeys:

Taxi Journeys by Pickup Zone

However, because we’ve summarised, we’ve lost the original data!

Self-Join

This is where self-join is particularly useful - we can summarize within an inner join clause to select the original rows that meet our summarisation criteria - but without losing the original rows.

Typically, the query will proceed as follows:

  1. Get some large set of rows and pre-filter these. Save these pre-filtered rows with a let X = ....
  2. inner join X against itself, based on some summarize ... criteria to select interesting rows.

Simple Self-Join

A basic self-join will look like:

X
| join kind=inner (
    X
    | summarize Count = count() by Y
    | where Count >= 10
  ) on Y

This will give us all rows in X that have a value for Y where that Y appears in >= 10 rows.

Taxi Self-Join

So, as an example for our Taxi data, we can do something like the following:

let TaxiData = externaldata(
    Pickup:datetime,
    Dropoff:datetime,
    Passengers:int,
    Distance:real,
    Fare:real,
    Tip:real,
    Tolls:real,
    Total:real,
    Color:string,
    Payment:string,
    PickupZone:string,
    DropoffZone:string,
    PickupBorough:string,
    DropoffBorough:string
) [
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/taxis.csv"
] with(format='csv', ignoreFirstRecord=true);
let SameBoroughTaxiData = TaxiData
| where isnotempty(PickupBorough)
| where PickupBorough == DropoffBorough;
SameBoroughTaxiData
// Select rows that pickup from a common zone
| join kind=inner (
    SameBoroughTaxiData
    | summarize PickupZoneJourneyCount = count() by PickupZone
    | where PickupZoneJourneyCount >= 20
  ) on PickupZone
| project-away PickupZone1
// Select rows that dropoff to a common zone
| join kind=inner (
    SameBoroughTaxiData
    | summarize DropoffZoneJourneyCount = count() by DropoffZone
    | where DropoffZoneJourneyCount >= 20
  ) on DropoffZone
| project-away DropoffZone1
// Continue analysis ...

Which gives us a table of individual journey rows, where the journey is within the same borough, and where the start and end zones are frequently used. We can now continue our analysis to produce interesting stats based on just these rows.

Interesting analysis left as an exercise for the reader…

Additional Notes

  1. Rather than let X = ..., an alternative is to use the | as operator to capture an intermediate table state.
  2. materialize() or | as hint.materialized=true will come in useful for large or complex datasets.
  3. leftsemi or leftanti are also interesting join types that are often useful with this pattern.