Extra page

Ekko extended deep dive

Academic-oriented technical narrative from internal notes: problem formulation, mechanism details, optimization framing, safety controls, and deployment evidence.

Reader guide and abstract

This page is written for readers who want a systems-level understanding first, then detailed mechanisms and constraints.

Abstract (short version)

  • Ekko targets second-level model freshness for large-scale recommender systems under geo-distributed deployment.
  • Core system idea: replace heavy checkpoint and validation path with direct P2P dissemination plus inference-time safety controls.
  • Main technical axes: VV and DVV-based dissemination, SLO-aware update scheduling, optimization-based storage placement, and fast rollback.

How to read this page

  • Five-minute pass: Read sections 1, 3, and 7 for problem, key mechanism, and results.
  • Systems deep read: Follow sections 2 through 6 sequentially; each section maps to one contribution.
  • Replication mindset: Use sections 4 to 6 for formulas, constraints, and rollout caveats.

Notation quick reference

Symbol Meaning in this page
N Total parameter keys considered during anti-entropy comparison.
Nr Recently touched or cached hot keys; Nr is much smaller than N.
pu Freshness priority signal from policy admission.
pg Update significance signal from normalized gradient magnitude.
pm Model priority signal from request share.
x_{p,s} Binary variable: partition p assigned to server s in MILP layout.
d_{t,s} Aggregated table-to-server variable used to reduce solver scale.

1) Problem statement and scaling challenge

Early scale-up improved offline metrics, but online engagement dropped because model updates became stale under non-stationary workloads.

Workload context

  • WeChat DLRS supports short videos, games, passages, and e-shop workloads.
  • User growth moved from millions to over one billion.
  • Content supply reached billions of short videos.
  • Training data exceeded 10 billion samples per day.
  • User action sequence length grew beyond 10,000.

Project targets

  • Scale individual models from a few GB to tens of TB (about 10,000x).
  • Support 10,000+ simultaneous models for large A/B experiment spaces.
  • Keep model freshness in geo-distributed serving under strict SLOs.

System model

  • Embedding tables dominate model size (tens of TB); DNN blocks are typically GB level.
  • Parameters are stored in geo-distributed parameter servers for low-latency inference and local fault tolerance.
  • Update loop: collect training data -> train online -> disseminate updates -> serve inference.

2) Update pipeline diagram

Traditional production paths often include checkpointing, validation, and staged broadcast. Ekko compresses this path to reduce freshness delay, then compensates with stronger inference-time monitoring.

Traditional path

Checkpoint
Validation
Broadcast
Serve

Ekko path

Online training updates
Leaderless P2P sync
SLO-aware scheduling
Serve with rollback guard

Why this path change matters

  • Each extra gate in update dissemination increases end-to-end freshness delay in non-stationary recommendation workloads.
  • The notes map this bottleneck to prior checkpoint and deployment pipelines (for example Check-N-Run, Conveyor, and Owl references).
  • Ekko therefore shifts complexity from staged rollout to selective dissemination plus rollback-safe serving.

3) P2P dissemination and DVV cache

Contribution 1 combines leaderless version-vector synchronization with workload-aware caching to reduce anti-entropy overhead under sparse hot updates.

Design highlights

  • Version-vector synchronization removes a central leader bottleneck.
  • Overlay-friendly P2P exchange uses available WAN paths.
  • Only necessary parameters are transferred between replicas.
  • Dominator Version Vector (DVV) summarizes non-cached keys and cuts comparison cost from O(N) to O(Nr), where Nr << N.

Reported results

  • Only 0.13% to 0.2% of parameters kept in cache in production.
  • Cache hit ratio around 99.4% in production.
  • About 7x faster dissemination than Project Adam in a 10-DC evaluation setting.

Diagram: VV exchange with hot-parameter cache

P2P synchronization diagram Server A and Server B exchange version vectors and compare only hot cached keys while DVV summarizes the rest. Server A VV_A Hot cache keys DVV summary Server B VV_B Hot cache keys DVV summary Compare VV only on cached keys Complexity shifts: O(N) -> O(Nr) VV exchange Needed updates only

4) SLO-aware update prioritization

Contribution 2 schedules updates by impact instead of sending every update immediately. This reduces WAN pressure while preserving online quality.

Freshness signal

Policy-admitted urgent parameters receive immediate priority (pu).

Gradient significance signal

Larger normalized gradient magnitude implies larger expected impact (pg).

Model traffic signal

Models with higher request share get higher update priority (pm).

SLO-critical updates in practice

  • Newly created embedding items after sufficient training data accumulation.
  • Updates with large gradients, which often carry larger loss impact.
  • High-traffic models that influence more online requests.

Priority formulation (high-level)

A practical scoring view from the notes is:

pu = freshness gate (policy admitted updates are highest priority)
pg = |g| / mean(|g|) within a model
pm = cm / sum(ci) across models
priorityScore = wu * pu + wg * pg + wm * pm

A later refinement in production prioritizes by estimated loss impact over an update window; under a local approximation, delta loss scales with squared parameter drift.

Reported outcomes

  • Network usage reduced by about 92% without visible quality degradation in reported results.
  • SLO drops were reduced versus non-prioritized dissemination.
  • Later production refinement used loss-estimate windows; under a local Taylor approximation, delta loss is proportional to squared parameter drift.
  • Fewer update writes also improve viability of asymmetric read and write storage devices.

Diagram: scheduler dataflow

SLO-aware scheduler dataflow Freshness, gradient, and traffic signals feed a priority scorer, then a constrained queue emits updates. Freshness signal (pu) Gradient signal (pg) Traffic signal (pm) Priority scorer rank updates by expected impact and urgency Constrained queue WAN budget and SLO guard P2P send priority-first

5) SLO-aware shard manager (storage)

Contribution 3 uses optimization-based scheduling to balance resource efficiency, inference latency, and operational stability under dynamic traffic.

Optimization objectives

  • Load balancing across CPU, memory, and network resources.
  • Strict capacity constraints during scheduling and rescheduling.
  • System stability by minimizing data migration.
  • Low-latency inference by limiting over-fragmented table span.
  • Scalability and reservation by capping per-node footprint of each model.

Hard constraints

  • Within one replica, each scheduling unit is assigned to exactly one machine.
  • Per-machine CPU, memory, and network utilization stay under strict quotas.
  • Table span bounds are enforced so one model does not over-fragment across machines.
  • Moving-data volume is bounded to avoid instability during rescheduling.

Solver acceleration details

  • Naive MILP construction can reach about 4.6 billion variables and 370 billion constraints.
  • A distributed caching design reduces hotspots, making per-unit resource coefficients within one table more homogeneous.
  • Homogeneous coefficients allow replacing binary x_{p,s} variables with aggregated d_{t,s} variables.
  • Divide-and-conquer and heuristics are used for smaller models to accelerate scheduling cycles.

Reported outcomes

  • Variable reduction and heuristics improved solve time at production scale.
  • Solver produced satisfactory plans in about 180 seconds in reported production usage.
  • Storage cost reduction reached about 49% versus Slicer baseline in reported comparison.

MILP framing (simplified)

minimize: migrationCost + overloadPenalty(cpu, mem, net) + spanPenalty
subject to:
  each scheduling unit is assigned once
  per-node quota constraints are respected
  model span is bounded for low-latency inference
  migration pressure remains under an operation threshold

Diagram: MILP scaling and reduction path

MILP acceleration path Naive huge MILP is reduced by coefficient homogenization and variable aggregation before solving. Workload and resource profiles Naive MILP build about 4.6B variables about 370B constraints Coefficient homogenization Variable reduction x(p,s) to d(t,s) about 180s solve

6) Inference model state manager

Contribution 4 protects online quality after skipping validation by detecting harmful updates and rolling back rapidly.

Common anomaly triggers

  • Gradient overflow under low precision training.
  • Biased or outlier data from the data pipeline.
  • Skipped validation stage in low-latency path can surface harmful updates online.

Safety outcomes

  • Baseline models are used as clean references for health monitoring.
  • Usually less than 1% of traffic is routed to baseline models.
  • If anomaly is detected, traffic is redirected first, then rollback is triggered.
  • Reported rollback speed: 6.4 seconds for 113 GB parameters.

Diagram: safety control path

Model safety rollback path Monitor online and baseline models, detect anomaly, reroute traffic, and rollback parameters. Monitor online and baseline models Anomaly detector flags corruption Redirect traffic to safe alternative Rollback Serve

7) Deployment impact and interpretation

Reported outcomes indicate large operational gains. Some product metrics should be read as rollout-correlated rather than purely system-caused.

Selected outcomes

  • Model-update latency reported at 2.4 seconds in OSDI 2022 deployment context.
  • Update-path optimizations and storage scheduling helped push scaling from early 100x to later 10,000x model-size range.
  • WeChat Channels DAU and VV growth (40% and 87% over six months) was reported alongside rollout, product iteration, and operations.
  • Retrospective A/B tests reported 1.30% to 3.82% SLO gain in specific reranking-delay setups.

Deployment milestones

  • OSDI 2022: Published Ekko with second-level freshness results and early large-scale deployment.
  • Post OSDI rollout: ML-aware dissemination and scheduling refinements expanded model scale from about 100x to about 10,000x.
  • Ongoing production: Storage scheduling and inference safety controls remained active under dynamic traffic and frequent model updates.

Interpretation caveats

  • Product metrics are rollout-correlated and should not be interpreted as single-cause system effects.
  • Cross-system comparisons should be read with scenario differences in mind (workload, latency target, and deployment assumptions).

Retrospective A/B numbers in the notes come from a controlled setup where reranking freshness is delayed while other stages remain fresh.

8) References and evidence

Primary paper and public write-ups, followed by related systems context mentioned in the notes.

Related works cited in notes

  • Monolith: Real-Time Recommendation System With Collisionless Embedding Table (RecSys 2022).
  • Check-N-Run: checkpointing for deep recommendation training (NSDI 2022).
  • Conveyor: continuous software deployment at scale (OSDI 2023).
  • Owl: large-scale hot content distribution (OSDI 2022).
  • Project Adam: efficient distributed deep learning training (OSDI 2014).
  • Slicer: auto-sharding for datacenter applications (OSDI 2016).
  • QuickUpdate: real-time personalization for large recommendation models.