How to Scale an Identity Graph to 100M+ Matches/Day

An identity graph that works at 10M lookups a day is a different system from one that works at 100M+. Same principles, different engineering. The problems at small scale are correctness problems. The problems at large scale are choreography problems, and they are harder.

I am George, founder of Leadpipe. This is the architectural walkthrough for how we think about scale on Leadpipe’s identity graph: 280M profiles, 60B intent signals, 5M websites contributing through the Orbit pixel network, 24-hour refresh cycle. The goal of this post is not to expose proprietary internals. The goal is to explain the principles that drive the architecture, the tradeoffs we live with, and what changes when the volume grows.

The scale, stated plainly

The verified numbers we run at:

Metric	Current
Verified profiles	280M
Behavioral signals ingested	60B
Websites contributing through Orbit pixel	5M
Refresh cycle	24 hours
API surface	23 REST endpoints, real-time webhooks, MCP server with 27 tools

Every one of those numbers drives an architectural decision. Not every one maps to a single datastore or a single cluster. The graph is distributed across several systems, each tuned to its own access pattern.

The four scaling axes

Every identity graph has four scaling axes that tighten as volume grows. The architecture has to address all four. Pick three and the fourth eventually breaks.

1. Ingest throughput

The Orbit pixel fires across 5M sites. Partner feeds land on a schedule. Batch uploads come in from data partnerships. All of it has to be ingested without dropping events and without head-of-line blocking during traffic spikes.

The principle: ingest is a streaming problem, not a batch problem, even when most of the data was historically batched. Backpressure is designed in, not bolted on.

2. Storage volume and access patterns

280M profiles is not a large dataset by modern standards. 60B signals indexed by multiple keys (browser, device, cookie, HEM, IP) is. A single storage engine cannot serve both bulk analytical scans for the nightly refresh and per-key point lookups during real-time match.

The principle: split storage by access pattern, not by data type. The same record may live in three different engines, each tuned to a different read path.

3. Match latency

A visitor pixel fires. A webhook has to deliver within a window the customer considers real-time. Meeting that target across 100M+ daily matches means the match path cannot touch any system that is not hot.

The principle: the hot path is its own engineering problem. Anything that is not pre-computed, indexed, or cached gets evicted from the hot path.

4. Refresh cycle

Every record re-verified, every signal re-scored, every audience re-materialized within a 24-hour window that cannot bleed into the next day’s ingest. This is the deadline nobody else talks about. Miss it and the graph starts drifting.

The principle: the refresh job has its own compute and its own SLA. It does not share resources with serve-path traffic.

The two paths: ingest and serve

A simplified view of how data flows. Every identity graph at scale separates these paths because they have opposite optimization goals.

INGEST PATH
─────────────
Orbit pixels  ─┐
Partner feeds ─┼─> Ingest -> Normalize -> Cluster -> Verify -> Graph Write
Batch uploads ─┘                                                    │
                                                                    v
                                                            ┌───────────────┐
                                                            │ Graph Storage │
                                                            │  (multi-tier) │
                                                            └───────┬───────┘
SERVE PATH                                                          │
─────────────                                                       │
Customer pixel -> Match index -> Person fetch -> Enrich -> Webhook <┘
                                    ▲
                                    │
                           Suppression check

The ingest path is throughput-dominated. It can be a few seconds slow on an individual event without anyone noticing. It cannot be slow on the aggregate.

The serve path is latency-dominated. It has to answer in milliseconds, every time, even while the ingest path is running at full tilt. They share data but not compute.

For the conceptual primer on how the graph itself is structured, see how identity graphs work and the graph API explanation.

The hot path

The hot path is the serve path: customer pixel fires, match happens, webhook dispatches. Everything on this path has to be pre-computed, indexed, or cached. No batch queries, no full scans, no cross-shard fanout.

The components in order:

Match index. Keyed on the identifiers a pixel can provide (cookie, device ID, IP range, fingerprint). Returns a person ID or nothing, in microseconds.
Person fetch. Keyed on person ID. Returns the full person record with firmographics and behavior. The match index is the latency-critical part. The person fetch is the correctness-critical part.
Enrichment. Company data, industry classification, intent score, and the 100+ downstream data points. Most of this is pre-joined into the person record at refresh time, not computed on demand.
Suppression check. Applied before webhook dispatch. See the suppression lists post for why suppression has to live on the hot path, not in a UI filter.
Webhook dispatch. Fire-and-track with exponential backoff retry. The retry queue lives off the hot path so a failing customer endpoint cannot cascade.

The rule: every step on the hot path costs latency. Adding a step means something else has to come off it, or the SLA gets missed. For the latency engineering side specifically, see sub-second webhook delivery engineering.

Sharding without picking a fight you cannot finish

No single machine holds the whole graph. No single database does either. The data is sharded across multiple storage instances, and the sharding scheme is one of the most consequential decisions in the whole system.

The choices, in the abstract:

Strategy	Best for	Weakness
Hash-based (e.g., by person ID)	Uniform load distribution, simple placement	Cross-shard joins are expensive
Range-based (e.g., by creation date)	Time-bounded scans, archiving	Hot ranges can become bottlenecks
Hybrid	Flexibility	More complex ops

The principle that matters more than the choice itself: sharding decisions compound. The scheme you pick at launch locks you into certain access patterns for years, because rebalancing a live graph is genuinely hard, and you do it as rarely as possible. Anyone who tells you they got the sharding right on the first try is either unusually lucky or running at smaller volume than they think.

Decay: the work nobody talks about

A graph that only ingests, never decays, turns into a landfill. People change jobs, change emails, change companies. A record from 18 months ago that has not been touched is almost certainly stale, and serving it to customers is worse than not serving anything.

The decay structure:

Record state	Trigger	What happens
Active	Fresh signal in recent window	Primary tier, served at match time
Warm	Older but still re-verifiable	Secondary tier, may serve with reduced confidence
Archived	No signal for an extended window	Off the hot path, not served unless re-verified

Decay is an engineering problem too. Moving records between tiers on a 24-hour cadence, without disrupting serve latency, is most of the complexity in the nightly refresh job. The principle: decay is policy, not an accident. Define it explicitly, automate it, audit it.

For the argument on why the refresh cadence has to be daily, see why intent data needs daily refresh.

Backpressure and graceful failure

At 100M+ matches a day, failure modes are not “the system crashes.” They are “one shard gets hot,” “the ingest queue backs up,” “the webhook retry queue fills up.” The architecture has to degrade gracefully.

A few designed-in behaviors that any large-scale graph needs:

Ingest backpressure. If the ingest queue depth crosses a threshold, the system sheds lower-priority signals (duplicates, already-seen sessions) before it drops primary events.
Serve fail-fast. If the match index takes longer than a configurable timeout, the match is abandoned rather than delayed. We would rather return “no match” than “a match two seconds late.”
Webhook retry queue isolation. Failed webhook deliveries go to a separate queue with its own compute. A broken customer endpoint cannot cascade into serve-path latency for other customers.
Nightly refresh isolation. The refresh job runs on dedicated compute, not shared with serve. A slow refresh does not slow down customer-facing matches.

The general rule: each system has its own failure domain, and one domain’s problem cannot become another’s. This is a design discipline, not a feature.

The tradeoffs you do not get to avoid

Every architectural decision closes one door and opens another. The honest tradeoffs at this scale.

Match rate vs accuracy

You can push match rate higher by accepting probabilistic signals. The cost is accuracy. We chose deterministic matching, which means our match rate lands at 30-40%+ rather than the 50-70% some probabilistic tools advertise. The payoff is the 8.7/10 accuracy result that comes with deterministic verification.

Freshness vs stability

A 24-hour refresh window is aggressive. Customers occasionally notice a person’s title changed between two webhook deliveries. That is working as intended. Daily truth beats quarterly stability for the go-to-market use case.

Coverage vs latency

The graph could return more matches by searching deeper tiers of evidence. It does not. The serving layer is tuned to return fast or return nothing. Slow matches that arrive after a webhook is useful are worse than no match at all.

US-first vs global

US B2B is where the deepest first-party partnerships exist and where most of our customers operate. GDPR defaults to company-level for EU/UK visitors. Person-level coverage outside the US would mean different partner relationships and a different compliance posture, and we have chosen not to rush it.

Shared infrastructure vs per-customer

Every customer queries the same graph. The $147/mo Pro plan and the enterprise plan hit the same infrastructure, the same 23 REST endpoints, the same real-time webhooks, the same 27-tool MCP server. The economics work because the graph is a shared asset with shared cost. Per-customer graphs would price out everyone except enterprise.

The architectural pivots a graph goes through

Scaling is a series of pivots, each one swapping one tradeoff for another. Three patterns I have seen in our own evolution and in conversations with other infrastructure teams.

Single-shard to multi-shard. Early graphs run on a single logical database. Works until it does not. The migration to multi-shard is one of the most expensive engineering projects you will ever run, with a lot of dual-writing and validation, because the graph is a live system that you cannot stop while you change it.

Batch-only refresh to hybrid. Refresh starts as a pure batch job. That breaks as soon as you have customers who need sub-daily freshness on specific records. The hot-path verify flow is the fix.

Company-scoped match to person-scoped. Early match indexes optimize for company-level lookups because that is the easier problem. Person-level is a harder index problem and gets rebuilt later when the customer demand pulls you that way.

Each pivot is not a clean rewrite. It is a careful migration with a lot of dual-writing and validation. You do not get to stop the graph while you redesign it.

What this means for customers

Every customer queries the same graph. The economics work because the graph is a shared asset with shared cost, not per-customer infrastructure.

At the scale we are at, the engineering is not a marketing story. It is how we can offer 30-40%+ match rates, 8.7/10 accuracy, and daily refresh at a price point that undercuts enterprise vendors by an order of magnitude. The graph has to work at 100M+ matches a day, because if it did not, none of the rest of it would be affordable.

The four-axis principle is the one to take away. If you are evaluating an identity vendor, ask them about all four scaling axes: ingest, storage, match latency, refresh. A vendor who can only describe three of them clearly is missing one, and the missing one is the one that will eventually break.

Every plan ships with the same identity graph, 23 REST endpoints, webhooks, and a 27-tool MCP server. Start in 5 minutes →