Most LLM pipelines have a data quality problem that nobody talks about at conferences.

You’re fetching web content — documentation, knowledge base articles, competitor pages, product data — and feeding it directly into your AI pipeline. The content looks fine in your browser. But what your model actually receives is something else entirely.

It’s a context window full of <div class="wrapper"><div class="inner"><div class="content">, navigation menus, cookie banners, JavaScript snippets, tracking pixels, and somewhere in the middle, the three paragraphs of actual content you wanted.

The Token Math

A typical web page runs about 40,000 characters of HTML for maybe 3,000 characters of readable content. You’re paying for 40,000 tokens’ worth of noise to extract 3,000 tokens’ worth of signal. With GPT-4 class models at $10–30 per million tokens, that’s not a rounding error — it’s a 13× cost multiplier on your retrieval layer.

But cost isn’t even the main problem. Context windows are finite. When you’re building a RAG system that needs to reason across 20 documents simultaneously, spending 90% of your context budget on <nav> and </nav> and aria-label="breadcrumb" is a design failure. You’re not giving your model room to think.

Why “Just Strip the Tags” Doesn’t Work

The obvious fix is to strip HTML tags client-side. This is what most teams do first. It doesn’t work well.

Naively stripped HTML loses structure. A table of pricing tiers becomes a run-on paragraph. A numbered troubleshooting sequence loses its ordering. Code blocks merge with surrounding prose. The semantic relationships that make content useful — this is a heading, this is a list, this is a code example — disappear with the tags.

Good HTML-to-Markdown conversion isn’t stripping. It’s translation. <h2> becomes ##. <code> becomes backticks. <table> becomes a Markdown table. <strong> becomes **bold**. The structure is preserved, the noise is removed, and the output is something an LLM can actually reason about.

Getting that translation right at scale — handling malformed HTML, inconsistent site structure, dynamic content, edge cases in table rendering — is more work than it looks. It’s the kind of thing that takes a few hours to get to 80% and months to get past 95%.

What We Built

UnWeb is our answer to this problem: an HTML-to-Markdown API and SaaS built specifically for LLM pipeline use cases.

You send us a URL or raw HTML. We send back clean, structured Markdown. The conversion handles:

  • Semantic HTML to Markdown mapping (headings, lists, tables, code blocks, blockquotes)
  • Noise removal (navigation, footers, sidebars, ads, cookie banners)
  • Content extraction heuristics that identify the main article body
  • Consistent output formatting that plays nicely with chunking strategies

The API is designed for pipeline integration, not one-off conversion: batch endpoints, webhook delivery, rate limits that work for high-volume crawlers, and a CLI for local workflows.

The Numbers in Practice

We’ve been running UnWeb in production since March. On documentation sites — the most common use case for RAG pipelines — average compression is 12–15×. A 45KB documentation page becomes a 3KB Markdown file. That’s 15 pages of context recovered per document retrieved.

For a pipeline retrieving 20 documents per query, that’s the difference between needing a 128K context window and getting by with 16K. It’s the difference between fitting your full knowledge base in one call and having to make architectural compromises.

The Practical Upshot

If you’re building anything that involves fetching web content for AI processing, clean markdown at the input layer is the highest-leverage optimization you can make. It’s upstream of everything — chunking strategy, embedding quality, retrieval accuracy, generation quality.

Getting the HTML-to-Markdown conversion wrong puts a ceiling on everything downstream. Getting it right removes a constraint you might not have realized you had.

UnWeb is live at unweb.mbsoftsystems.com. There’s a free tier for evaluation and an API built for production throughput. If you’re hitting token budget problems on web content, start there.