Home
Geospatial Masking & Perturbation Techniques
K-Anonymity Grouping for Location Traces
Calculating K-Anonymity Thresholds for Mobile Tracking

Calculating K-Anonymity Thresholds for Mobile Tracking

Q: How does device density affect the required masking radius?

Higher device density allows a smaller radius to collect k distinct devices. In a dense urban core (ρ ≈ 500 devices/km²/hr) a 250 m radius can satisfy k = 5, while a rural zone (ρ ≈ 5 devices/km²/hr) may require r > 2 km for the same k, making the generalization too coarse for pedestrian-scale analytics.

Q: How do high-frequency pings distort the threshold calculation?

Sub-minute GPS samples create autocorrelated density that inflates the apparent neighbor count without adding independent devices. Downsample to 1–5 minute intervals before running the threshold calculation to avoid overestimating k and under-masking the dataset.

The minimum spatial radius and temporal window that make each GPS ping indistinguishable from at least k-1 other devices is a dynamic function of empirical device density, sampling frequency, and the acceptable loss of location precision — not a fixed integer chosen arbitrarily.

Core Calculation permalink

K-anonymity grouping for location traces is grounded in a spatial-temporal density formula that links the physical parameters you control (radius r, temporal window Δt) to the empirical device density ρ of your dataset:

k \approx \rho \times \pi r^{2} \times \Delta t

Symbol	Meaning	Typical range
$k$	Target anonymity threshold	5 – 20
$\rho$	Device density (devices per km² per hour)	5 (rural) – 2 000 (dense urban)
$r$	Spatial masking radius (km)	0.05 – 5
$\Delta t$	Temporal aggregation window (hours)	0.25 – 4

The formula assumes a uniform spatial distribution, which rarely holds. Treat it as a lower-bound estimator and validate against empirical trace distributions, as described in the verification section below.

Worked numeric example permalink

Suppose you have an urban mobility dataset with $\rho = 200$ devices/km²/hr and you need $k = 5$ :

r = \sqrt{\frac{k}{\rho \times \pi \times \Delta t}} = \sqrt{\frac{5}{200 \times \pi \times 0.5}} \approx 0.126 \text{ km (126 m)}

At a $\Delta t = 30$ minutes and a 126 m radius, the formula predicts an expected cluster size of exactly 5. In practice you will observe variance — some clusters will fall below k — so you must iterate r upward until a target coverage fraction (typically 90–95 %) of records meets k >= target.

Diagram: threshold calculation workflow permalink

Python Implementation permalink

The function below finds the smallest spatial radius that satisfies k across a user-specified coverage fraction. Coordinates are projected via a cosine-corrected degree-to-metre approximation suitable for regional datasets; for continental-scale work replace that step with a pyproj projection to the relevant UTM zone. The implementation uses scipy.spatial.cKDTree for O(log n) radius queries and handles the temporal window by comparing pandas.Timestamp values directly — no pre-binning is required at this scale.

import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
from typing import Tuple

def calculate_k_anonymity_threshold(
    df: pd.DataFrame,
    target_k: int = 5,
    time_window_hours: float = 1.0,
    max_radius_km: float = 10.0,
    radius_step_km: float = 0.25,
    coverage_target: float = 0.95,
) -> Tuple[float, dict]:
    """
    Find the minimum spatial radius (km) that achieves k-anonymity for
    mobile GPS pings within a temporal window.

    Args:
        df: DataFrame with columns ['lat', 'lon', 'timestamp', 'device_id'].
            'timestamp' must be parseable by pd.to_datetime.
            Coordinates must be in WGS84 (EPSG:4326).
        target_k: Minimum number of distinct devices per spatial-temporal cell.
        time_window_hours: Half-width of the temporal window (Δt) in hours.
            Each ping is compared with neighbours whose timestamps fall within
            ±time_window_hours of the centre ping.
        max_radius_km: Upper bound on the search radius; if no radius achieves
            coverage_target the function returns max_radius_km with a warning.
        radius_step_km: Radius increment for each iteration.
        coverage_target: Fraction of pings that must satisfy unique_devices >= k.

    Returns:
        (optimal_radius_km, stats_dict)
        stats_dict keys are candidate radii (float); values are dicts with
        'coverage' (float) and 'valid_count' (int).
    """
    df = df.copy()

    # --- CRS note ---
    # Approximate metre coordinates via cosine-corrected degree scaling.
    # Accuracy degrades beyond ~500 km; for larger datasets project with pyproj:
    #   from pyproj import Transformer
    #   t = Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True)
    #   df["x"], df["y"] = t.transform(df["lon"].values, df["lat"].values)
    lat_centre = np.radians(df["lat"].mean())
    df["x"] = df["lon"] * 111_320 * np.cos(lat_centre)
    df["y"] = df["lat"] * 111_320

    df = df.sort_values("timestamp").reset_index(drop=True)
    timestamps = pd.to_datetime(df["timestamp"])
    half_window = pd.Timedelta(hours=time_window_hours)
    coords = df[["x", "y"]].values

    # Build the tree once; reuse it for every candidate radius.
    tree = cKDTree(coords)

    optimal_r: float = max_radius_km
    stats: dict = {}

    for r_km in np.arange(radius_step_km, max_radius_km + radius_step_km, radius_step_km):
        r_m = r_km * 1000.0
        # query_ball_point returns all neighbours within r_m for every point.
        neighbour_lists = tree.query_ball_point(coords, r_m)

        valid_count = 0
        for i, neighbour_idxs in enumerate(neighbour_lists):
            t_centre = timestamps.iloc[i]
            # Temporal filter: keep only pings within the ±Δt window.
            in_window = (
                (timestamps.iloc[neighbour_idxs] >= t_centre - half_window) &
                (timestamps.iloc[neighbour_idxs] <= t_centre + half_window)
            )
            windowed_idxs = np.array(neighbour_idxs)[in_window.values]

            # Privacy check: count *distinct* device IDs, not raw pings.
            # A single device appearing multiple times adds no anonymity.
            unique_devices = df.iloc[windowed_idxs]["device_id"].nunique()
            if unique_devices >= target_k:
                valid_count += 1

        coverage = valid_count / len(df)
        stats[round(r_km, 4)] = {"coverage": coverage, "valid_count": valid_count}

        if coverage >= coverage_target:
            optimal_r = r_km
            break

    if optimal_r == max_radius_km and stats.get(round(max_radius_km, 4), {}).get("coverage", 0) < coverage_target:
        import warnings
        warnings.warn(
            f"Coverage target {coverage_target:.0%} not reached at max_radius={max_radius_km} km. "
            "Consider widening time_window_hours or lowering target_k for sparse zones.",
            UserWarning,
            stacklevel=2,
        )

    return optimal_r, stats

Verification Snippet permalink

After finding optimal_r, confirm the result before writing generalised output to any downstream system:

def verify_k_anonymity(
    df: pd.DataFrame,
    optimal_r_km: float,
    target_k: int,
    time_window_hours: float,
) -> dict:
    """
    Audit the k-anonymity result: report coverage, mean cluster size,
    and the fraction of records that must be suppressed.
    """
    r_m = optimal_r_km * 1000.0
    lat_centre = np.radians(df["lat"].mean())
    x = df["lon"] * 111_320 * np.cos(lat_centre)
    y = df["lat"] * 111_320
    coords = np.column_stack([x, y])
    timestamps = pd.to_datetime(df["timestamp"])
    half_window = pd.Timedelta(hours=time_window_hours)

    tree = cKDTree(coords)
    neighbour_lists = tree.query_ball_point(coords, r_m)

    cluster_sizes = []
    compliant = 0
    for i, idxs in enumerate(neighbour_lists):
        t_c = timestamps.iloc[i]
        in_win = (
            (timestamps.iloc[idxs] >= t_c - half_window) &
            (timestamps.iloc[idxs] <= t_c + half_window)
        )
        n_devices = df.iloc[np.array(idxs)[in_win.values]]["device_id"].nunique()
        cluster_sizes.append(n_devices)
        if n_devices >= target_k:
            compliant += 1

    total = len(df)
    return {
        "optimal_radius_km": optimal_r_km,
        "coverage_fraction": compliant / total,
        "suppression_fraction": (total - compliant) / total,
        "mean_cluster_size": float(np.mean(cluster_sizes)),
        "min_cluster_size": int(np.min(cluster_sizes)),
        "records_to_suppress": total - compliant,
    }

# Expected output for a well-calibrated threshold:
# coverage_fraction   >= 0.95
# suppression_fraction <= 0.05
# min_cluster_size    >= target_k (for compliant records)

A healthy result shows coverage_fraction >= 0.95, mean_cluster_size well above target_k (indicating headroom for sparse sub-zones), and suppression_fraction low enough to preserve analytic utility. If suppression exceeds 10 %, investigate whether a non-uniform density zone is pulling results down and consider splitting the dataset by administrative boundary before rerunning.

Edge Cases & Adjustments permalink

High-frequency sampling bias. Sub-minute GPS pings create autocorrelated artificial density that inflates apparent cluster size. Downsample trajectories to 1–5 minute intervals before running threshold calculation; otherwise k appears satisfied by the same device sampled many times rather than by truly distinct individuals.
Non-uniform density zones. A single city-wide r value will over-generalise dense urban areas and under-protect sparse rural ones. Segment the dataset by H3 hexagonal grid resolution 5 or by administrative boundary, compute a separate optimal_r per segment, and apply zone-specific masking. This approach aligns with grid aggregation and spatial binning strategies that partition space before applying privacy transforms.
Projection accuracy for large extents. The cosine-corrected degree-to-metre conversion in the code above accumulates error beyond ~500 km. For national or continental datasets, project to the relevant UTM zone (or EPSG:3857 as a fallback) using pyproj.Transformer before computing distances. An incorrect CRS can cause the computed r to satisfy k in projected space while failing in on-the-ground distances.
Re-identification risk from auxiliary data. K-anonymity guarantees indistinguishability within the anonymised dataset but does not bound re-identification risk from auxiliary sources. When the anonymised traces could be joined with public venue check-ins or road-network topology, supplement spatial k-anonymity with a calibrated privacy budget (ε) applied through the Laplace or Gaussian mechanism to bound worst-case disclosure.

FAQ permalink

What k value satisfies GDPR for mobile location data?

GDPR does not mandate a specific k value, but Data Protection Authorities in Germany and France have informally accepted k ≥ 5 for aggregate mobility statistics, with k ≥ 10 recommended for sensitive categories such as health or home-location traces. Document your reasoning and the empirical device density used to derive the threshold in your Data Protection Impact Assessment (DPIA).

How does device density affect the required masking radius?

Higher density allows a smaller radius to collect k distinct devices. In a dense urban core ( $\rho \approx 500$ devices/km²/hr) a 250 m radius can satisfy k = 5; a rural zone ( $\rho \approx 5$ devices/km²/hr) may require r > 2 km for the same k, which can make generalization too coarse for pedestrian-scale analytics. In that case, either reduce k for the sparse zone (with documented justification) or suppress records that cannot meet the threshold.

Can I use k-anonymity alone, or do I also need differential privacy?

K-anonymity alone is vulnerable to homogeneity and background-knowledge attacks. For high-sensitivity traces — home-location inference, medical facility visits, or commute-pattern reconstruction — combine spatial k-anonymity with a calibrated privacy budget via the Laplace and Gaussian noise mechanisms for coordinate data to bound worst-case re-identification risk beyond what grouping alone provides.

How do high-frequency pings distort the threshold calculation?

Sub-minute samples create autocorrelated density that inflates the apparent neighbour count without adding independent devices. Downsample to 1–5 minute intervals before running the threshold calculation to avoid overestimating k and inadvertently under-masking the dataset. Validate by checking that mean_cluster_size / target_k does not drop below ~1.5 after downsampling — a large ratio indicates the downsampled density still provides headroom.

K-Anonymity Grouping for Location Traces — the parent technique: algorithmic specification, prerequisites, and step-by-step implementation
Re-identification Risk Assessment for Geospatial Datasets — how to quantify auxiliary-join attack risk after applying k-anonymity
Privacy Budget Allocation for Spatial Queries — composing ε across multiple spatial releases to complement k-anonymity grouping
Grid Aggregation & Spatial Binning Strategies — zone-based partitioning strategies that pair with per-zone k-anonymity thresholds
Coordinate Jittering & Noise Injection Methods — perturbation alternative when k-anonymity spatial generalization is too coarse for the analytic use case

← Back to K-Anonymity Grouping for Location Traces

Calculating K-Anonymity Thresholds for Mobile Tracking

Core Calculation # permalink

Worked numeric example # permalink

Diagram: threshold calculation workflow # permalink

Python Implementation # permalink

Verification Snippet # permalink

Edge Cases & Adjustments # permalink

FAQ # permalink

Related # permalink

Related topics