Product

How to Parse JavaScript-Rendered Intent Signals

Most of the modern web renders via JavaScript. How a serious intent pipeline parses SPA pages, headless rendering tradeoffs, and what naive crawlers miss.

George Gogidze George Gogidze · · 10 min read
How to Parse JavaScript-Rendered Intent Signals

The open web used to be a pile of server-rendered HTML you could parse with a few regex rules. It is not anymore. Most of the sites that matter for intent data render content through JavaScript, and treating them like static pages means missing most of the signal.

I am George, founder of Leadpipe. Parsing intent from JavaScript-rendered pages is one of those problems nobody talks about publicly, but it is the difference between intent data that reflects 2016 and intent data that reflects 2026. This post is the conceptual walkthrough. The two parsing problems, why the easy approach is not enough, the headless rendering tradeoffs, and how a pixel network plus a crawler complement each other.


The problem

A static HTML page is easy. You fetch the URL, parse the body, classify the content, store the signal. Done.

A JavaScript-rendered page looks like this to a naive crawler:

<!DOCTYPE html>
<html>
<head><title>Loading…</title></head>
<body>
  <div id="root"></div>
  <script src="/bundle.js"></script>
</body>
</html>

There is no content in the body. The content is produced by executing bundle.js against a backend API. A crawler that only reads the HTML sees nothing. If you have 5 million websites in your pixel network and a material chunk of them are single-page applications (SPAs), you are losing the classification data that drives topic scoring.


Two signal sources, two parsing problems

A modern intent pipeline has two different kinds of signal sources, each with its own parsing problem.

SourceWhat it capturesParsing challenge
PixelClient-side page events on partner sitesPixel runs in the browser after JS has already rendered, so the DOM is available
CrawlerServer-side classification of page content for topic taxonomy updatesHas to execute JavaScript itself to see rendered content

The pixel has it easier. By the time the pixel fires, the visitor’s browser has already executed the site’s JavaScript. The DOM is real, the text is real, the URL pattern is stable. The pixel observes the rendered page directly.

The crawler has it harder. To maintain the topic taxonomy, you have to render pages the way a browser would, then read the result. That is more expensive per page than a plain HTTP fetch, and at the scale of the modern web, the cost adds up.


What the pixel captures on a rendered page

A well-designed intent pixel is browser-side, lightweight, and respectful of the host site’s performance. What it captures:

  • URL with normalized path. The router may have updated the URL client-side. The pixel captures the visible URL.
  • Page title. Already populated by the time the pixel fires.
  • Visible text of structural elements. Enough context to classify into the topic taxonomy without capturing form content.
  • Referrer. Where the visitor came from.
  • Session timing. When the page was entered and how long it was in view.

The pixel does not capture form content, keystrokes, DOM mutations beyond the initial render, or anything a visitor types. That is a deliberate boundary for privacy and for performance. Capturing every DOM mutation on a modern SPA would be a measurable performance cost on the host site, and the pixel is a guest there.

For the architectural context on Orbit’s pixel network, see inside the Orbit pixel network.


The crawler: headless browsers at scale

The crawler side is harder. You cannot depend on a user’s browser to render pages. You have to render them yourself.

Why a plain HTTP fetcher is not enough

A plain HTTP fetcher (curl, a Go HTTP client, anything that just reads the raw response) gets the pre-render HTML. For an SPA, that is a shell with no content. For a server-rendered site, it is fine. The problem is the crawler cannot tell ahead of time which is which, and getting it wrong means silently losing classification signal on a chunk of the web.

Headless browser execution

The solution is a headless browser. Chromium running in a server environment, with no visible window, controlled via something like Puppeteer or Playwright. It loads the page, executes JavaScript, waits for the DOM to settle, and hands back the rendered HTML.

That works. It is also expensive. A headless browser instance uses far more memory and CPU than an HTTP fetcher, and it is slower per page. Running at scale means managing a pool of browsers, recycling them to prevent memory leaks, and load-balancing across them. The operational complexity is real.

When to render, when to skip

Rendering every page through a headless browser is cost-prohibitive. The crawler has to decide, per URL, whether to render.

The heuristic stack:

  1. Pre-fetch the raw HTML. Cheap, fast.
  2. Check for SPA markers. Minimal <body>, common SPA framework signatures in <script> tags, routing library markers.
  3. Check content length after raw fetch. If the raw HTML has meaningful text already, skip headless.
  4. Fall back to headless only when the raw HTML is insufficient.

The economics work because most of the web is still cheap to crawl. The expensive minority is exactly the part where intent signal lives, so the headless investment pays off where it counts.


Topic classification on rendered content

Once a page is rendered (either by the pixel in the visitor’s browser or by the crawler in a headless environment), the content goes into the topic classifier.

The classifier maps page text and structure into entries in the 20,000+ topic taxonomy. The pieces:

Structural signals

Heading hierarchy, navigation labels, meta tags. These are strong hints about what kind of page it is. A page with an <h1> of “CRM Software Comparison” and navigation links to “Pricing” and “Features” is probably a vendor evaluation page.

Textual signals

Visible paragraph text, product names, category keywords. These are the fine-grained topic signals.

URL patterns

URL paths are surprisingly informative. /blog/crm-software-alternatives/ is cleaner signal than a lot of the page body on many sites.

Exclusions

Gated content, login walls, error pages, 404s, admin dashboards. These are filtered out before classification so they do not pollute the taxonomy.

The output is not “visited a page.” It is “showed interest in cloud cost optimization” or “researched CRM alternatives.” That is what makes intent data actionable downstream.


Dynamic content: the edge cases

A few specific cases that cost real engineering work.

Infinite scroll

Pages that load more content as you scroll. The initial render captures only the top of the page. We do not emulate scroll in the crawler because the cost-benefit is poor at scale. Accept partial classification on these pages and lean on repeat visits for signal.

Lazy-loaded media

Images and embeds that load as they become visible. Same tradeoff. Not worth emulating viewport events at crawl scale.

Heavily personalized content

Pages that render differently based on cookies, geography, or login state. The crawler sees one version. Another crawler would see another. Accept the version your crawler sees and note the personalization exposure in the data.

Content behind authentication

Do not log in to sites you are crawling. If content is behind a login wall, it is not classified. This is a deliberate scope limit for ethical and operational reasons.

Sites that block headless browsers

Some sites detect headless browsers and serve different content or block them entirely. Respect robots.txt and site-level blocks. Do not work around detection systems. If a site does not want to be crawled, do not crawl it.


The feedback loop with the pixel network

The crawler and the pixel do not live in separate worlds. The pixel provides the volume signal (what pages are actually getting visited), and the crawler provides the depth signal (what those pages actually contain).

The loop:

  1. Pixel fires on a partner site page the crawler has not seen yet.
  2. URL is queued for crawling. If it looks like an SPA, it gets routed to the headless pool.
  3. Crawler renders and classifies. The result lands in the topic taxonomy with a signal attached.
  4. Pixel events that happened before the classification get retroactively associated with the topic.
  5. Topic scores for people who visited the page update on the next refresh.

This loop is how the taxonomy stays current with content changes. A site that redesigned last week is reclassified within days, not quarters.

For the intent-scoring side of this loop, see person-level intent data: how it works and the intent API post.


Tradeoffs we made

Headless rendering over script-execution shortcuts

You could approximate JS rendering with lighter tools. Extracting JSON from initial-state script tags, scraping API endpoints directly. We do not. Those shortcuts break as soon as a site changes its internals, and at 5M sites, any shortcut that only works on a subset is operational debt. Full headless rendering is more expensive but more robust.

Respect for robots.txt and rate limits

The crawler obeys robots.txt. It rate-limits per domain. It identifies itself via user-agent. This is not a competitive advantage. It is table stakes. A crawler that ignores these norms is one that gets banned from large parts of the web and ends up with worse data.

Classification over extraction

We classify pages into topics. We do not extract structured data (product prices, review counts) from classified pages. That is a different product with different tradeoffs. The scope is intent. Detailed content extraction belongs elsewhere.

Curated taxonomy over free-form

20,000+ topics in a curated taxonomy versus free-form keyword tagging. The curated taxonomy is harder to maintain (someone has to add new topics as they emerge) but easier to filter and reason about downstream. Free-form keywords scale automatically but produce a dataset that is hard to use without heavy post-processing.


What I would prioritize today

  1. Classification as a service, not a stage. Today classification happens inside the crawl pipeline. Exposing it as its own service would let the pixel trigger on-demand classification for edge cases, instead of waiting for the batch to reach the URL.
  2. Browser pool auto-scaling. Headless rendering volume is uneven across the day. A smarter auto-scaling policy would shed cost during off-peak and scale up during peak without hand-tuning.
  3. Explainable classifications. A page gets classified into topic X. The customer asks why. Today the answer lives in debug logs. Exposing classification evidence as a queryable surface would help both internal debugging and external trust.

What this means for customers

You do not see the crawler. You see the topic scores, the audiences, the intent signals across 20,000+ topics. The crawler is the piece that makes the topic taxonomy reflect the actual web, not the web of ten years ago when static HTML was the default.

When you query Orbit intent data, you are querying a taxonomy that was built against JavaScript-rendered pages, updated from both the pixel network and an active crawler. That is why coverage is current and why niche topics show up in the data. If the crawler cut corners on rendering, the coverage would look like 2018.

The principle behind the architecture is the same one behind the rest of Leadpipe. Build the harder version that gets the data right, instead of the easier version that gets the headline metric.


Orbit resolves intent at the person level against a deterministic identity graph — the difference between “an account is in-market” and “this director at this account is researching today.” Try Orbit →