Data

How Much of B2B Traffic Is Bots? The Dirty Number

Bot traffic on B2B sites is higher than your analytics tool reports. Why GA4 misses the gap, what AI crawlers changed, and how to measure cleanly.

George Gogidze George Gogidze · · 10 min read
How Much of B2B Traffic Is Bots? The Dirty Number

Your Google Analytics report says you had 80,000 monthly visitors. Your paid ad platforms say you bought 20,000 clicks. Your server logs say 4 million requests. Somewhere in that stack, reality lives. The question is how much of it is humans.

I am George, founder of Leadpipe. We run a deterministic identity graph against 280M verified profiles and observe traffic across 5M sites. Because matching requires real identity signal, we see directly which traffic is matchable and which is structurally not. That is not a perfect bot detector, but it is a second opinion to GA4’s filter, and the gap between the two is consistently larger than teams expect.

This post is the framework for thinking about how dirty the B2B traffic number really is, why GA4’s default filter misses most of it, and what changed when AI crawlers showed up.

The honest framing

There is no single “B2B bot traffic %” that applies to every site. The number depends on your traffic mix, your paid-vs-organic ratio, your geography, and how aggressively your team has hardened bot defenses. What is consistent across every B2B site I have looked at is the gap between GA4’s default bot filter and a composite detection method.

GA4’s filter catches roughly the published IAB/ABC spider list. That covers Googlebot, Bingbot, and a handful of other declared crawlers. It does not cover:

  • Scripted sessions from cloud VMs hitting your site to scrape or compete-shop.
  • AI crawlers that emerged faster than the spider list has been updated.
  • Commercial scrapers running headless Chromium with real-looking headers.
  • Click farms hitting paid campaigns.
  • Browser automation tools that rotate user-agents.

The most common failure mode is exactly the bot categories that have grown most in the last 24 months.

Why this matters more than it sounds

Every metric you compute from “total visitors” is off by the bot share. If your dashboard says you have a 2% form conversion rate on 100K sessions, the math changes substantially when you discover a quarter of those sessions were not human.

MetricComputed on raw sessionsComputed on human sessions
Form CVRLooks lowLooks reasonable
Match rateLooks lowLooks normal
Bounce rateLooks highLooks normal
Time on pageLooks shortLooks longer
Paid CPC effectiveLooks better than realityLooks worse

The same site, the same campaigns, the same numerators. Different denominators. The denominator question is the one most teams skip.

A cleaner framework: composite detection

The way to measure bot share more honestly is composite detection. A session is flagged as bot if it meets at least two of these criteria:

  1. User-agent signature matches a known bot or crawler.
  2. IP belongs to a known data center or cloud hosting range (AWS, GCP, Azure, OVH, Hetzner, etc.).
  3. Behavioral signals indicate non-human. Zero scroll on a multi-viewport page, impossibly short session durations, perfectly regular request intervals, headless browser fingerprints.
  4. Identity-graph mismatch. The session presents identity signals that fail to resolve against a deterministic graph of real people, in patterns inconsistent with a privacy-conscious human.

No single signal is silver-bullet. The composite is robust because real bots usually fail two or three of these simultaneously, and real humans almost never fail two. The point is not pixel-accurate classification. The point is that the gap between composite detection and GA4’s default filter is large and consistent enough that even imperfect detection beats no detection.

What GA4 misses

GA4’s default “Exclude known bots and spiders” filter is keyed primarily off the IAB/ABC spider list. It does a decent job on declared, well-behaved crawlers. It does a poor job on:

  • Scripted sessions from cloud VMs. Anyone with a $10/month server can run a scraper that GA4 does not recognize.
  • AI crawlers that updated faster than the list. Until very recently, GPTBot, ClaudeBot, and PerplexityBot were not reliably filtered.
  • Commercial scrapers running headless Chromium. They run real browsers with real-looking headers. GA4 sees them as humans.
  • Click farms hitting paid campaigns. Paid clicks are a separate beast (see below).

If you need a reliable human-only view, layer a second detection method on top of GA4. Either a commercial bot detection service or an identity-graph cross-reference.

AI crawlers specifically

AI crawler traffic is the fastest-growing bot category. It is legitimate traffic in a narrow sense (it is how LLMs learn and retrieve from your content). It is not human traffic, and you should not count it as engaged sessions in a marketing dashboard.

The major declared AI crawlers as of 2026:

AI crawlerOperatorPurpose
GPTBotOpenAITraining data for GPT models
ClaudeBot / Claude-WebAnthropicTraining and retrieval for Claude
PerplexityBotPerplexityReal-time retrieval for Perplexity answers
Google-ExtendedGoogleGemini training opt-in signal
BytespiderByteDanceDoubao and TikTok-related models
CCBotCommon CrawlOpen-source web archive used by many AI labs
Applebot-ExtendedAppleApple Intelligence training

For sellers who care about LLM-referred traffic, this is also a signal. AI crawlers are the proxy for how visible your content is to LLM-assisted buyers. Two years ago this category was effectively zero. It is now a meaningful share of bot sessions on most B2B sites, and trending up.

For the broader argument on why analytics tools mishandle attribution in this environment, see Google Analytics is lying about pipeline.

The most uncomfortable finding for paid-marketing teams. Paid clicks consistently have higher bot share than organic.

The structural reasons:

  • The incentive is to click. People and scripts get paid per click. Click farms exist. Competitor click-farming exists.
  • Ad platforms’ invalid-click filters catch a share, not all. Especially on display and programmatic.
  • Display and programmatic are the worst. Bot share on display is consistently higher than on search or social.
  • LinkedIn Ads tend to be the cleanest paid channel. Professional context, harder to bot-stuff, but not bot-free.

If you measure ROAS against total paid-click volume, your numerator (conversions) is counted against an inflated denominator (clicks-including-bots). Your true human-click CVR is higher than your report says, and your effective CPC is higher too. See our broader analysis in what % of paid-ad clicks are returning visitors for the related identity-side problem.

Implications for the reader

Clean your denominator before debugging your funnel. If your form conversion rate looks bad, your match rate looks bad, or your bounce rate looks high, check your bot share first. A 2% form CVR on 100K raw sessions might be a 2.8% CVR on 72K human sessions after subtracting 28K bots. That is the difference between “our page is broken” and “our page is fine.”

Do not trust GA4’s bot filter. It is better than nothing. It is not close to comprehensive. Layer a second method on top, either a commercial bot detection service or an identity-graph cross-reference.

Report traffic metrics net of bots, not gross. Executive dashboards that show raw session counts overstate reality. Every team I know that has transitioned to human-only session reporting has seen executives immediately recalibrate what “traffic growth” means.

Check your paid-click data monthly. Paid-click bot share is volatile. A sudden jump usually means a campaign is being scraped or click-farmed. Catching it in week one instead of month three saves real budget.

Audit your CRM for bot-originated leads. Any form that is not captcha-protected is getting some bot form-fills. Those end up in your CRM as MQLs and pollute pipeline metrics. See Salesforce is full of bad data for the downstream version of the same hygiene problem.

Why this matters for visitor identification specifically

Two sides.

First, your match rate denominator. If you are using visitor identification and your match rate looks underwhelming, check whether the denominator includes bot sessions. Match rates go up several points when you compute against human sessions only. The match rate by industry guide assumes a clean denominator.

Second, a deterministic visitor identification system naturally filters bot traffic as a byproduct of matching. Deterministic matching requires a real identity signal, and bots do not carry real identity signals. Sessions that fail to match are often sessions that were never human. This is not a replacement for a dedicated bot detection layer, but it is a useful cross-check.

If your current vendor is reporting suspiciously high match rates, one thing to check is whether they are dividing by human sessions or total sessions. A 65% match rate on a clean denominator might be a 45% match rate on the real denominator. Or the inverse: if a vendor’s match rate looks low, check whether they are computing against a bot-inflated denominator.

What changed in 2025-2026

Three structural shifts pushed bot share up.

AI crawler adoption. GPTBot and ClaudeBot are now common, declared, and widely deployed. PerplexityBot retrieves in real-time when users ask Perplexity questions. These are not bad bots, but they are bots, and counting them as humans skews everything downstream.

LLM-driven scraping. Smaller AI projects scrape the web to build datasets. Many do not declare themselves. They show up as headless browser sessions with rotating user-agents and look more human than legacy scrapers ever did.

Cheap automation. A motivated competitor can scrape your pricing page, your case studies, and your product pages with a $20/month server. The cost of running a low-grade scraper has collapsed.

The trend is not reversing. Bot share is going to be higher in 2027 than in 2026. The framework you adopt now determines whether your dashboards keep up.

Limitations

  • Composite detection is heuristic. Over-flagging and under-flagging both happen. The magnitude of the gap versus GA4 is the robust finding, not the precise share.
  • Sample skew. B2B SaaS, services, and fintech behave differently from media, e-commerce, and consumer properties. Bot profiles are industry-dependent.
  • AI crawler attribution. Detection relies on declared user-agents. Non-declaring scrapers (common) get bucketed into “other bot” rather than “AI crawler.”
  • Point-in-time. Bot ecosystems change quarterly. AI crawler share in particular keeps moving.

Leadpipe identifies 30-40%+ of your US B2B visitors with full contact data on the Pro plan at $147/mo. No credit card to start the 500-lead trial. Start identifying visitors →