How to Calculate Re-identification Risk for GPS Logs

To calculate re-identification risk for GPS logs, discretize continuous spatiotemporal coordinates into equivalence classes, compute the size distribution of those classes, and derive risk as the inverse of the smallest class size (1/k) or as the proportion of uniquely identifiable trajectories. The standard workflow applies spatial-temporal binning (e.g., hexagonal grids + hourly windows), groups trajectories by identical bin sequences, and measures how many records fall into each group. If an adversary possesses partial background knowledge, risk is estimated via Monte Carlo sampling that simulates partial queries against the binned dataset and records the match rate. This quantitative approach aligns with established Re-identification Risk Assessment for Geospatial Datasets methodologies used by public-sector tech teams and compliance officers.

Core Calculation Methodology

GPS logs are high-dimensional and temporally continuous, making exact coordinate matching impractical for risk modeling. Calculation requires three deterministic steps:

  1. Discretization: Convert raw (lat, lon, timestamp) tuples into categorical bins. Spatial resolution typically ranges from 100m to 1km depending on use case; temporal resolution ranges from 15 minutes to 24 hours. Coarser bins reduce precision but increase anonymity.
  2. Equivalence Class Formation: Group trajectories by identical bin sequences. Each unique sequence forms an equivalence class. Users sharing the same sequence are indistinguishable at the chosen resolution.
  3. Risk Quantification:
  • k-anonymity risk: Risk = 1 / min(class_sizes)
  • Uniqueness rate: Risk = count(class_size == 1) / total_trajectories
  • Adversarial simulation: Sample n random spatiotemporal points per trajectory, query the dataset, and measure the probability of a singleton match.

The foundational threat model assumes adversaries possess auxiliary datasets (e.g., social media check-ins, public transit records, or commercial mobility feeds). As outlined in Spatial Privacy Fundamentals & Threat Modeling, risk thresholds must be defined before release. k ≥ 5 is common for low-risk internal sharing, while k ≥ 10 or differential privacy guarantees are required for public datasets. These thresholds align with NIST SP 800-122 guidelines for PII confidentiality, which mandate quantifiable de-identification metrics prior to cross-agency data sharing.

Python Implementation

The following script calculates baseline re-identification risk using spatial-temporal binning and equivalence class analysis. It assumes a pandas DataFrame with user_id, lat, lon, and timestamp columns.

import pandas as pd
import numpy as np
from collections import Counter

def calculate_gps_reidentification_risk(df, spatial_res_km=0.5, temporal_res_h=1.0):
    """
    Calculate baseline re-identification risk for GPS logs.
    df must contain: user_id, lat, lon, timestamp
    Returns dict with risk metrics and class distribution.
    """
    required_cols = {"user_id", "lat", "lon", "timestamp"}
    if not required_cols.issubset(df.columns):
        raise ValueError(f"DataFrame must contain {required_cols}")

    df = df.copy()
    df["timestamp"] = pd.to_datetime(df["timestamp"])

    # 1. Spatial binning: 1° latitude ≈ 111.32 km
    lat_bin_size = spatial_res_km / 111.32
    # Longitude degrees shrink toward poles; use mean latitude for approximation
    lon_bin_size = spatial_res_km / (111.32 * np.cos(np.radians(df["lat"].mean())))

    df["lat_bin"] = np.floor(df["lat"] / lat_bin_size)
    df["lon_bin"] = np.floor(df["lon"] / lon_bin_size)

    # 2. Temporal binning
    df["time_bin"] = df["timestamp"].dt.floor(f"{temporal_res_h}h")

    # 3. Create trajectory sequence per user (sorted chronologically)
    df = df.sort_values(["user_id", "timestamp"])
    trajectory_seq = (
        df.groupby("user_id")[["lat_bin", "lon_bin", "time_bin"]]
        .apply(lambda g: tuple(g.itertuples(index=False, name=None)))
    )

    # 4. Equivalence class distribution
    class_counts = Counter(trajectory_seq)
    sizes = np.array(list(class_counts.values()))

    # 5. Risk metrics
    min_k = sizes.min() if len(sizes) > 0 else 0
    k_anonymity_risk = 1.0 / min_k if min_k > 0 else 1.0
    uniqueness_rate = np.sum(sizes == 1) / len(sizes) if len(sizes) > 0 else 0.0

    return {
        "k_anonymity_risk": round(k_anonymity_risk, 4),
        "uniqueness_rate": round(uniqueness_rate, 4),
        "min_class_size": int(min_k),
        "total_unique_trajectories": len(sizes),
        "class_size_distribution": dict(class_counts)
    }

Production Notes:

  • The grid approximation uses mean latitude for longitude scaling. For global datasets or high-precision requirements, replace manual binning with H3 hexagonal indexing or Google’s S2 library.
  • 1/k represents worst-case risk. If your dataset contains 10,000 trajectories and the smallest equivalence class contains 4 users, the baseline re-identification probability is 25%.
  • Always strip or hash user_id before risk calculation to prevent accidental linkage during analysis.

Adversarial Simulation & Partial Knowledge

Baseline k-anonymity assumes an adversary knows the full trajectory sequence. Real-world attackers typically possess partial background knowledge (e.g., two timestamps and approximate locations). To model this, implement a Monte Carlo simulation:

  1. Randomly sample m spatiotemporal points from each trajectory (typically m = 2 or 3).
  2. Query the binned dataset for all records matching those points.
  3. Record whether the query returns exactly one user (singleton match).
  4. Repeat across 10,000+ iterations to estimate the empirical match rate.

This approach captures the combinatorial explosion of partial matches and aligns with ISO/IEC 20889 privacy-enhancing de-identification techniques, which emphasize attacker capability modeling over static metric thresholds. Simulation results typically reveal higher risk than baseline 1/k calculations, especially in sparse urban or rural regions where trajectory overlap is naturally low.

Compliance Thresholds & Operational Guidance

Risk thresholds are not universal; they depend on data sensitivity, jurisdiction, and intended use. Public-sector releases generally require stricter controls than internal analytics. Follow these operational guardrails:

  • Define k before processing: Determine acceptable risk during the data governance phase. Retroactively adjusting bins to meet thresholds introduces selection bias and reduces utility.
  • Report distribution, not just minimum: A dataset with min(k) = 5 but 40% singleton trajectories fails practical anonymity. Always publish the full class-size histogram alongside summary metrics.
  • Apply suppression or generalization: If risk exceeds thresholds, suppress trajectories with k < threshold or coarsen spatial/temporal resolution iteratively until compliance is met.
  • Document auxiliary data assumptions: Explicitly state which external datasets were considered during threat modeling. Compliance officers require this context to validate risk acceptance.

By combining deterministic equivalence class analysis with adversarial simulation, teams can quantify re-identification risk transparently and align geospatial data releases with modern privacy engineering standards.