Product

How We Built the Leadpipe Identity Graph

The principles behind 280M verified profiles and 30-40%+ match rates: deterministic matching, first-party signals, daily refresh, build vs license.

George Gogidze George Gogidze · · 11 min read
How We Built the Leadpipe Identity Graph

Most visitor identification tools do not build an identity graph. They license one. That is the single biggest reason match rates cluster around 10-20% across the category, and why tools running on the same licensed data return roughly the same people.

I am George, founder of Leadpipe. Building our own graph was the most expensive decision we ever made. It is also the only decision that made the product competitive. This post is a plain walkthrough of the principles behind how we did it, why we chose to build instead of license, the tradeoffs we live with, and what the verified scale looks like today.


The problem, stated plainly

A B2B identity graph has to answer one question at query time. Given a set of anonymous browser signals captured on a website pixel, which real person do they belong to, and what do we know about that person?

The verified scale we run at:

MetricCurrent
Verified profiles280M
Behavioral signals ingested60B
Websites contributing to the Orbit pixel network5M
Refresh cycle24 hours
API surface23 REST endpoints, real-time webhooks, 27-tool MCP server
Setup time2-5 minute JavaScript pixel

The shape of a single lookup looks simple. The work behind the lookup is where every shortcut costs you match rate, accuracy, or both.

INPUT                              GRAPH                         OUTPUT
cookie id, device id, ip,    ->    signal -> cluster    ->      person +
hashed email, fingerprint          cluster -> person             company +
                                                                 100+ fields

If that lookup takes too long, the webhook is late. If the cluster boundaries are sloppy, you return someone’s coworker. If the person record is stale, the email bounces and the customer blames you. Every architectural decision in the rest of this post is a decision about one of those three failure modes.


What a graph actually is

A graph is three layers of data with a matching engine on top.

LayerWhat lives hereWhy it matters
Signal layerCookies, device IDs, IP ranges, hashed emails (HEMs), fingerprintsThe fuel. Without rich signals, no match is possible.
Cluster layerGroups of signals we believe belong to one person across devicesThe hard part. Boundaries here drive accuracy.
Identity layerThe person record: name, emails, phone, employer, job, demographicsThe output. Customers only see this.

The identity graph primer covers the concept at a high level. This post is about the principles that drive how the layers stack together.


Build vs license: the strategic choice

The commodity path is to buy access to a shared identity graph from a third-party data provider. Five visitor-identification tools all license the same graph, and five tools return roughly the same people because they are reading from the same database. The pixel implementation differs. The pricing differs. The dashboard differs. The underlying coverage does not.

Building means starting from zero. Direct data partnerships, proprietary signal collection, custom matching logic, continuous verification, an in-house compliance posture. Years of relationship work and engineering investment before the first match rate climbs into competitive territory.

We built. Three reasons.

Match rate ceiling. Licensed graphs hit a ceiling because they are constrained by the same upstream data. A built graph can keep adding partnerships, signals, and verification paths that licensed competitors do not have access to.

Accuracy floor. A licensed graph inherits the matching policy of the upstream provider, including the share of probabilistic versus deterministic matches. Building gives you control over the accuracy floor. We chose deterministic-by-default. That is the policy decision behind the 8.7/10 accuracy result versus 5.2 (RB2B) and 4.0 (Warmly) in independent testing.

Compounding moat. Licensed graphs do not compound. Built graphs do. Every new partnership, every new signal, every new customer who identifies on your pixel makes the graph slightly better. Five years in, the gap between built and licensed is wider than at year one. That is the long bet.

The cost is real. Building means you are wrong on the first try at almost everything, and you have to redo expensive infrastructure when the volume grows. The payoff is a graph nobody else can replicate by writing a check.


The four foundations

We did not set out to build “an identity graph.” We set out to answer “who is this visitor.” The graph is the shape that answer took after four foundations stacked on top of each other.

1. Direct data partnerships, not licensed feeds

The commodity path is to pay LiveRamp or Tapad and resell the same graph everyone else is reselling. We took the harder route. Partnerships with publishers, apps, and platforms that produce first-party identity data with consent. Each partnership is months of legal, technical, and QA work.

The payoff: our graph sees signals that licensed graphs do not, because those signals never leave the partner ecosystems that feed us.

2. A proprietary pixel network

The Orbit pixel sits across 5M websites. Every page view produces a lightweight signal event. Over time, a browser reveals a cluster: the same cookie across sites, the same IP range at business hours, the same device fingerprint on mobile and desktop.

The pixel network is what lets us close the loop between “unknown browser session” and “known person.” The inside Orbit post goes deeper on that side.

3. Deterministic matching as the default

We do not guess. A match enters the graph when there is verified linkage. A hashed email from a consented signup that ties to a cookie. A device ID registered during an app install with a real name. A login event on a partner property. If the evidence is statistical rather than verified, it goes into a lower tier and never drives primary identifications.

This is why we score 8.7/10 in the independent accuracy test while probabilistic tools score 4.0 to 5.2. The difference is not the pixel. It is the policy on what counts as a match. See deterministic vs probabilistic matching for the long version.

4. Continuous verification

A graph that never re-verifies its records becomes a lie within a year. People change jobs, change emails, change phone numbers, move states. We re-verify records on a 24-hour cycle wherever we have a fresh signal. Old records without fresh signals fall out of primary serving and into archive.

The 24-hour refresh cadence is not a marketing line. It is an engineering constraint. The nightly refresh job has to complete inside its budget every night, or the graph starts drifting. That deadline drives almost as many decisions as the match-time latency target.


The five-stage pipeline

At a high level, every new signal goes through five stages before it affects a customer identification.

INGEST   ->   NORMALIZE   ->   CLUSTER   ->   VERIFY   ->   SERVE
  1. Ingest. Events land from the Orbit pixel, partner feeds, and batch uploads. The ingest layer is throughput-dominated and has to absorb traffic spikes without dropping primary events.
  2. Normalize. IP addresses checked against known VPN and proxy ranges. Bot traffic filtered. User agents canonicalized. Hashed emails checked against known hash types (SHA256, SHA1, MD5).
  3. Cluster. The hardest step. Signals are grouped into person-level clusters using verified linkages. A weak linkage does not merge two clusters. A strong linkage does. Cluster boundaries drive accuracy more than any other single decision.
  4. Verify. Each cluster is checked against existing identity records. If a match is found, the new signals attach to the existing person. If not, a new person record is created, but only if there is enough evidence to warrant one.
  5. Serve. The cluster is indexed for query time. When a customer’s pixel fires, the query hits the index, returns the matched person, and wraps the result with firmographics and the 100+ data points customers expect.

For the scaling architecture across these stages, see scaling the identity graph.


Tradeoffs we made

Every design choice closed one door and opened another. The honest tradeoffs.

Match rate vs accuracy

We could push match rate higher by letting probabilistic signals drive matches. We do not. We would rather deliver 100 verified visitors than 200 “probably” visitors. The independent test is the exhibit. Our match rate of 30-40%+ on US B2B traffic comes with 8.7/10 accuracy, and that is the ratio we optimize.

Freshness vs stability

A 24-hour refresh window is aggressive. Customers occasionally notice that a person’s title changed between two webhook deliveries. That is working as intended. Daily truth beats quarterly stability for the go-to-market use case. See why intent data needs daily refresh for the full argument.

Coverage vs latency

The graph could return more matches by searching deeper tiers of evidence. It does not. The serving layer is tuned to return fast or return nothing. A late match that arrives after the webhook is useful is worse than no match.

US-first vs global

We are US-first because US B2B is where our customers are and where first-party partnerships are densest. The GDPR default is company-level for EU/UK visitors. Person-level identification requires affirmative consent there. Expanding person-level coverage outside the US would mean different partner relationships and a different compliance posture, and we have chosen not to rush it.

Shared infrastructure vs per-customer

Every customer queries the same graph. The $147/mo Pro plan and the enterprise plan hit the same infrastructure, the same 23 REST endpoints, the same real-time webhooks, the same 27-tool MCP server. The economics work because the graph is a shared asset with shared cost. Per-customer infrastructure would price out everyone except enterprise.


What I would prioritize today

Three things that would be different if we were starting in 2026 instead of compounding from an earlier stack.

  1. Cluster boundaries as a first-class data model. Today clusters are a byproduct of signals. A cluster-native data model would let us reason about confidence, evidence, and decay at the cluster level rather than rebuilding the logic across every downstream job.
  2. Real-time verification path. A portion of verification is still batch. The moments where verification needs to be hot (a pixel fire for a customer we have not seen in 30 days) deserve their own hot path.
  3. Explainability surfaces. When a customer asks “why did you identify this person as Sarah Chen?”, I want the answer to come back as a short evidence trail. Today that answer lives in logs that only the engineering team can read. Explainability is a product surface, not a debug surface.

What this means for customers

Every customer queries the same graph. The reason we can offer self-serve identification at price points that undercut enterprise vendors is that the graph is a shared asset, built once, refreshed daily, served to everyone.

The practical implications:

  • Match rates of 30-40%+ on US B2B traffic, deterministic by default. See the accuracy test.
  • 100+ data points per identified person, delivered as a single webhook payload. See the webhook payload reference.
  • 23 REST endpoints for programmatic access, plus a 27-tool MCP server for AI agents. See the developer guide.
  • Daily refresh across the entire graph. Yesterday’s record is not today’s record, and that is by design.
  • Suppression and consent at the graph layer, not the dashboard. See suppression lists.

The graph is the product. Everything else is delivery.


Every plan ships with the same identity graph, 23 REST endpoints, webhooks, and a 27-tool MCP server. Start in 5 minutes →