Implementing FedAvg for Spatial Time-Series

Federated Averaging (FedAvg) over spatial time-series is the point where a global model is rebuilt from per-silo weight deltas without any raw coordinate, timestamp, or trajectory ever leaving its jurisdiction. It is the canonical synchronization primitive inside model synchronization strategies, the operational layer of federated learning workflows for geospatial data. Spatial time-series make the textbook FedAvg loop unsafe out of the box: clients poll at irregular cadences, sit in different time zones, and carry sharp spatial autocorrelation, so the IID assumption behind plain gradient averaging collapses into convergence instability and re-identification risk. This guide fixes the exact tunable knobs for a temporally-aligned, spatially-weighted FedAvg round, gives one production-ready coordinator with a runnable validation harness, and walks the failures — leakage spikes, variance divergence, and budget exhaustion — that break synchronization across regulated silos.

Parameter Configuration and Calibration

Every knob below moves either convergence stability or the privacy guarantee, so calibrate each against your silo cadence and the sensitivity of the underlying mobility layer rather than copying defaults. Two formulas anchor the round. The decoupled local learning rate scales inversely with sensor density so dense urban silos do not dominate the descent:

\eta^{k}_{\text{local}} = \frac{\eta_0}{\sqrt{n^{k}_{\text{sensors}}}}, \qquad \eta_0 = 10^{-3}

The server reconstructs the global weights as a trust-weighted mean rather than a uniform average, where $\tau_k$ is the administrative-boundary trust score of silo $k$ :

w_{t+1} = \sum_{k=1}^{K} \frac{\tau_k}{\sum_{j} \tau_j}\, w^{k}_{t}

eta_0 — base learning rate (typ. 1e-3). The pre-density step size. Spatial time-series are noisier than tabular streams, so start an order of magnitude below a centralized baseline and raise only if per-silo loss plateaus before the variance gate trips.
n_sensors — density scaling. Dividing by $\sqrt{n_{\text{sensors}}}$ down-weights high-density silos whose gradients would otherwise steer the global objective toward dense regions. This is the per-round expression of the geographic balancing enforced upstream by client selection algorithms.
momentum = 0.0. Disable optimizer momentum on the client. Across asynchronous, irregularly-sampled silos, momentum carries stale temporal direction into the next window and produces oscillation rather than convergence — a non-IID failure mode covered in depth under handling non-IID geospatial data.
clip_norm (C, typ. 1.0). The per-client $\ell_2$ gradient-clipping bound, applied before transmission. It caps any single silo’s influence (bounding the gradient-poisoning blast radius) and fixes the sensitivity $\Delta = C$ that the differential privacy noise is calibrated against.
noise_multiplier (z, typ. 1.1). The Gaussian noise scale is $\sigma = z \cdot C$ . For an $(\varepsilon, \delta)$ guarantee on a clipped update, $z \ge \frac{\sqrt{2\ln(1.25/\delta)}}{\varepsilon}$ ; at $\varepsilon = 2.0$ , $\delta = 10^{-5}$ this floors $z$ near 1.1. Choosing the $\varepsilon$ regime is a deliberate trade documented in the central vs local differential privacy comparison.
window_hours — temporal alignment (typ. 24, UTC). Clients must aggregate to a common rolling window normalized to UTC before training. Skipping normalization lets a silo’s local-midnight boundary leak its timezone_offset, and misaligned windows average gradients computed over non-overlapping intervals.
trust_score (τ). The spatial weighting factor — inverse-distance or administrative-boundary trust — that replaces the sample-count weighting of vanilla FedAvg, so a low-trust or sparsely-validated jurisdiction cannot dominate the merge.
loss_variance_threshold (typ. 0.05 over 3 rounds). The convergence gate. Sustained global-loss variance above this band signals divergence or overfitting to a dense sensor cluster and triggers a client-selection audit.
morans_i_threshold (typ. 0.3). Moran’s I on residual gradients measures spatial autocorrelation leakage; above this cutoff the residuals encode recoverable geography and the round is treated as a disclosure incident.
checkpoint_interval (k, typ. 5 rounds). How often the global state is snapshotted so a failed gate or exhausted budget can roll back to a known-stable model rather than ship a poisoned one.

Validate the configuration against a replay of historical rounds before promotion: if more than 5% of windows fail UTC alignment or the trust weights sum to a degenerate distribution, fix calibration before any gradient is exchanged.

Reference Implementation

The coordinator below runs one spatially-weighted FedAvg round end to end: density-scaled local training with clipping on each client, trust-weighted aggregation that broadcasts across any parameter rank, calibrated server-side Gaussian noise, and periodic rollback snapshots. Dtype handling and every privacy-relevant choice are marked inline.

python

from dataclasses import dataclass, field
from typing import Dict, List, Tuple

import numpy as np
import torch
from torch.nn.utils import clip_grad_norm_


@dataclass
class SpatialClientConfig:
    """Per-silo metadata the round needs to weight and align an update."""
    client_id: str
    n_sensors: int          # sensor density; drives the decoupled learning rate
    trust_score: float      # administrative-boundary / inverse-distance weight (tau)
    timezone_offset: float  # hours from UTC; used only to align windows, never shared


class SpatialFedClient:
    """Runs density-scaled, gradient-clipped local training on aligned windows."""

    def __init__(self, config: SpatialClientConfig, model: torch.nn.Module,
                 eta_0: float = 1e-3, clip_norm: float = 1.0) -> None:
        self.config = config
        self.model = model
        self.clip_norm = clip_norm
        # Decoupled learning rate: eta_0 / sqrt(n_sensors) so dense silos do not
        # steer the global objective toward high-density geography.
        self.lr = eta_0 / np.sqrt(max(config.n_sensors, 1))
        # momentum=0.0: stale temporal direction across irregular windows
        # oscillates instead of converging on spatial time-series.
        self.optimizer = torch.optim.SGD(
            self.model.parameters(), lr=self.lr, momentum=0.0
        )

    def train_step(
        self, batch: Tuple[torch.Tensor, torch.Tensor]
    ) -> Dict[str, torch.Tensor]:
        self.model.train()
        inputs, targets = batch
        self.optimizer.zero_grad()
        loss = torch.nn.functional.mse_loss(self.model(inputs), targets)
        loss.backward()
        # Clip BEFORE the step so this silo's sensitivity is bounded by clip_norm,
        # which is exactly what the server's DP noise is calibrated against.
        clip_grad_norm_(self.model.parameters(), max_norm=self.clip_norm)
        self.optimizer.step()
        # Detached copy of the post-step state for trust-weighted aggregation.
        return {k: v.clone().detach() for k, v in self.model.state_dict().items()}


class SpatialFedServer:
    """Trust-weighted FedAvg aggregator with DP noise and rollback snapshots."""

    def __init__(self, global_model: torch.nn.Module,
                 clip_norm: float = 1.0, noise_multiplier: float = 1.1,
                 checkpoint_interval: int = 5) -> None:
        self.global_model = global_model
        self.clip_norm = clip_norm
        self.noise_multiplier = noise_multiplier
        self.checkpoint_interval = checkpoint_interval
        self.round_idx = 0
        self.global_loss_history: List[float] = []
        self.rollback_checkpoints: Dict[int, Dict[str, torch.Tensor]] = {}

    def aggregate(self, client_states: List[Dict[str, torch.Tensor]],
                  configs: List[SpatialClientConfig]) -> None:
        self.round_idx += 1
        # Trust-weighted mean: tau_k / sum(tau) replaces vanilla sample weighting
        # so a low-trust jurisdiction cannot dominate the merged model.
        weights = np.array([c.trust_score for c in configs], dtype=np.float64)
        if weights.sum() <= 0:
            raise ValueError("trust weights sum to zero; degenerate distribution")
        weights = weights / weights.sum()

        aggregated: Dict[str, torch.Tensor] = {}
        for key in client_states[0]:
            stacked = torch.stack([s[key] for s in client_states])
            w = torch.tensor(weights, dtype=stacked.dtype)
            # Broadcast across the parameter's full rank (1-D biases, 2-D linears,
            # 4-D conv kernels) instead of hard-coding trailing singleton dims.
            view = [-1] + [1] * (stacked.dim() - 1)
            aggregated[key] = torch.sum(stacked * w.view(view), dim=0)

        self.global_model.load_state_dict(aggregated)
        self._apply_dp_noise()
        self._snapshot()

    def _apply_dp_noise(self) -> None:
        # Gaussian mechanism: sigma = z * C, added AFTER aggregation so the
        # published global model carries the (epsilon, delta) guarantee.
        sigma = self.noise_multiplier * self.clip_norm
        for param in self.global_model.parameters():
            param.data.add_(torch.randn_like(param) * sigma)

    def _snapshot(self) -> None:
        # Keep a clean rollback target every k rounds regardless of gate status.
        if self.round_idx % self.checkpoint_interval == 0:
            self.rollback_checkpoints[self.round_idx] = {
                k: v.clone() for k, v in self.global_model.state_dict().items()
            }

Validation Checkpoint

Synchronization logic must never reach production untested — a dropped clip, a degenerate weight vector, or a missing rollback is a direct path to either divergence or an unprotected release. The convergence gate and the harness below assert the invariants that matter on synthetic rounds; run them in CI.

python

def validate_convergence(server: SpatialFedServer,
                         threshold: float = 0.05) -> bool:
    """True while the last 3 rounds' global-loss variance stays under threshold."""
    if len(server.global_loss_history) < 3:
        return True
    recent_variance = float(np.var(server.global_loss_history[-3:]))
    return recent_variance <= threshold


def _test_spatial_fedavg() -> None:
    torch.manual_seed(0)

    def make_model() -> torch.nn.Module:
        return torch.nn.Linear(4, 1)

    configs = [
        SpatialClientConfig("urban", n_sensors=400, trust_score=0.6, timezone_offset=0.0),
        SpatialClientConfig("rural", n_sensors=16, trust_score=0.4, timezone_offset=-5.0),
    ]
    server = SpatialFedServer(make_model(), clip_norm=1.0, noise_multiplier=0.0,
                              checkpoint_interval=1)

    # 1. The dense silo's learning rate is strictly smaller than the sparse one's.
    clients = [SpatialFedClient(c, make_model()) for c in configs]
    assert clients[0].lr < clients[1].lr, "density scaling did not down-weight urban lr"

    # 2. Aggregation produces a finite, correctly shaped global state.
    batch = (torch.randn(8, 4), torch.randn(8, 1))
    states = [c.train_step(batch) for c in clients]
    server.aggregate(states, configs)
    merged = server.global_model.state_dict()["weight"]
    assert merged.shape == (1, 4) and torch.isfinite(merged).all(), "bad aggregate"

    # 3. A rollback checkpoint exists after a snapshot round.
    assert server.round_idx in server.rollback_checkpoints, "no rollback target saved"

    # 4. Degenerate (all-zero) trust weights must fail loudly, never average blindly.
    bad = [SpatialClientConfig("a", 10, 0.0, 0.0), SpatialClientConfig("b", 10, 0.0, 0.0)]
    try:
        server.aggregate(states, bad)
        raise AssertionError("zero trust weights were accepted")
    except ValueError:
        pass

    # 5. The convergence gate trips once variance exceeds the threshold.
    server.global_loss_history = [0.10, 0.55, 0.05]
    assert validate_convergence(server, threshold=0.05) is False, "variance gate missed divergence"

    print("All spatial FedAvg invariants hold.")


if __name__ == "__main__":
    _test_spatial_fedavg()

Incident Response and Edge Cases

A correct weighting formula does not guarantee a safe or convergent round. The failures below surface across real cross-silo deployments; each maps to a vector in the threat mapping for GIS data catalogue.

Spatial autocorrelation leakage in residual gradients. Moran’s I on the residual gradients exceeds 0.3, meaning the updates still encode recoverable geography and expose membership-inference or attribute-disclosure risk. Remediation: halt aggregation for the round, re-apply the Gaussian mechanism at the calibrated $\sigma = zC$ , freeze any spatial_embedding layer (requires_grad = False) to block gradient-inversion on coordinates, and roll back to the last checkpoint if the next round still breaches the cutoff.
Convergence variance divergence. Global-loss variance stays above 0.05 for three consecutive rounds because a dense sensor cluster is dominating descent. Remediation: run a client-selection audit, cap participation from high-variance regions at 30% per round, and guarantee minimum representation from low-density zones so geographic bias cannot capture the objective — the same balancing logic applied in client selection algorithms.
Privacy budget exhausted mid-training. A long-running deployment silently crosses its cumulative $(\varepsilon, \delta)$ ceiling, after which further releases carry no real guarantee. Remediation: account composition per round and hold parameters when the running $\varepsilon$ reaches the ceiling set by your compliance framework mapping; resume only after a budget reset or by raising $z$ on a coarser model layer.
Window misalignment after a DST or timezone shift. A silo whose timezone_offset changes mid-deployment submits a gradient computed over a window that no longer overlaps its peers, dragging the merge toward a stale interval. Remediation: normalize every window to UTC before training, reject any delta whose window boundary disagrees with the global round’s, and route stragglers into the staleness-aware buffer described under async execution patterns rather than averaging them blindly. Cross-silo healthcare and finance pipelines should additionally wrap the exchange in secure multiparty computation from secure multi-party computation in spatial analytics so gradients are never readable in transit.

Frequently Asked Questions

Why divide the local learning rate by the square root of sensor density?

Dense silos generate larger, more confident gradients simply because they have more samples, and plain FedAvg lets that volume steer the global model toward dense geography. Scaling the step size by $\eta_0/\sqrt{n_{\text{sensors}}}$ damps that pull so a sparse rural silo still shapes the shared model. It is a per-round complement to upstream client selection, not a replacement: selection decides who participates, the decoupled rate decides how hard each participant pushes.

How is FedAvg for spatial time-series different from plain FedAvg?

Three additions. First, every client aggregates to a common UTC rolling window so gradients are computed over overlapping intervals. Second, aggregation is weighted by an administrative-trust score rather than raw sample count, because spatial silos differ in validation quality, not just size. Third, the round carries a Moran’s-I leakage gate and rollback snapshots, since spatial autocorrelation makes residual gradients far more re-identifying than tabular ones.

Why disable optimizer momentum on the clients?

Momentum accumulates gradient direction across steps, which assumes the underlying signal is stationary. Spatial time-series sampled at irregular cadences are not: carrying last window’s direction into a non-overlapping next window injects stale temporal bias and produces oscillation around the optimum instead of convergence. Setting momentum=0.0 keeps each client’s update tied to its current aligned window.

When does Moran’s I on residual gradients indicate a privacy incident?

Moran’s I measures spatial autocorrelation; computed on the residual gradients after aggregation, a value above 0.3 means neighbouring regions still share structure that an adversary can invert back into geography. That is the trigger to halt the round, re-noise, freeze spatial embeddings, and — if it persists — roll back. Values near zero mean the residuals are spatially unstructured and the round is safe to publish.

Model synchronization strategies — the broader synchronization workflow where the global round and staleness window are defined.
Handling non-IID geospatial data in federated learning — diagnosing and correcting the spatial drift that destabilizes FedAvg.
Client selection algorithms — upstream balancing that caps high-variance regions before they reach aggregation.
Async execution patterns — where stragglers and misaligned windows are buffered instead of averaged.
Comparing central vs local differential privacy for GIS — choosing the ε regime that sets the noise multiplier.

Up: Model Synchronization Strategies · Federated Learning Workflows for Geospatial Data