14% of the Web Is Actually Dead — But Not How You Think (We Scanned 10M Domains)

Originally published at crawlora.net.

When you hit a dead URL in production, do you know whether the domain is gone — or whether an anti-bot system just blocked your crawler? They look identical from a failed request, but they’re completely different failures, and most tools don’t tell them apart.

We scanned the DomCop top 10 million domains to find out how much of the popular web is actually dead. The short version: about 14% — not the ~27% you’ve probably seen quoted.

Dead and blocked are not the same failure

A domain that won’t load failed for one of two reasons:

  • It’s gone. No DNS record, or nothing accepts a TCP connection. Genuinely dead.
  • It’s alive and blocking you. A real server returning a 403 or 429 to anything that looks like a bot.

Most “dead web” studies count both as dead. They shouldn’t, because the right response to each is opposite:

  • A dead domain never comes back. Retrying it — rotating proxies, escalating clients — is wasted compute.
  • A blocked domain is live. It needs a different client, not more retries.

The numbers

Probing every domain over HTTP and classifying each as alive / redirect / blocked / dead:

  • 14.1% genuinely dead — overwhelmingly vanished DNS (76% of the dead bucket). The server is gone.
  • 8.9% blocked — live servers returning 403/429 to automated clients.
  • 76.6% alive, 0.3% redirect.

The widely-cited “~27% of the web has rotted” figure conflates blocked-but-live servers (and 404/5xx responses — still a live server answering) with the genuinely gone. Separate them honestly and the truly-dead web is about half what people assume.

Proof: same domains, different client

To show the 8.9% “blocked” really are alive, we re-probed them with a real Chrome TLS/JA3 fingerprint — an HTTP client that speaks Chrome’s exact TLS handshake and header order (not a headless browser, no canvas/WebGL).

~72,000 of the blocked domains served content normally. Same URLs, same network — the only thing “dead” was the wall. That dropped the blocked rate from 8.9% to 8.2%.

The takeaway for anyone building crawlers or link-checkers: when a tool reports a dead domain, ~9% of the time it’s a live server with anti-bot deployed. NXDOMAIN/REFUSED → dead, skip it. 403/429 → alive, recheck with a real browser TLS context before you mark it dead.

The web rots unevenly

Death rate isn’t uniform. By country-code TLD:

  • China’s .cn: 33% dead
  • Germany’s .de: 7.6% dead

A 4× gap. Institutional TLDs fare badly too — .gov 26%, .edu 22% — matching Pew Research’s finding that government and reference pages suffer the worst link rot.

The famous dead

The casualties are all in the data: Grooveshark, Gfycat, del.icio.us, Yahoo Pipes, AddThis, DMOZ, OpenSolaris, GeoCities. Two decades of the social and developer web’s graveyard.

The open dataset

Every domain, both probe arms, is open under CC BY 4.0 (one JSON row per domain per arm: domain, tld, rank, mode, outcome, reason, HTTP statuses, redirect hops, parked flag):

(Disclosure: we build a web-scraping API, which is why the dead-vs-blocked distinction bites us daily.)

Leave a Reply