© 2025
Back to writing

Docpull

842 downloads in two weeks. Scrapes what traditional crawlers can't.

2 min readNovember 13, 2025

Modern documentation sites are notoriously difficult to scrape. They're built with JavaScript frameworks, load content dynamically, and structure things in ways that break traditional crawlers. Most sites have 10-20% duplicate content. Incremental updates on large sites take hours.

Docpull solves these problems. Since launch two weeks ago: 842 downloads. Fully open source.

8 Reasons Why Docpull is Useful

  1. Handles Modern Web Docs – Crawls JavaScript-heavy sites with sitemaps, link extraction, and full browser rendering.
  2. Streaming Architecture – Processes pages as they're found, ready for real-time pipelines.
  3. Security-First Design – Built-in protections and mandatory robots.txt compliance keep scraping safe.
  4. Smart Deduplication – Detects and skips duplicate content across URLs using streaming SHA-256 hashes.
  5. Cache-Forward – Persistent caching enables fast, incremental updates on large sites.
  6. Built-in Profiles – Pre-configured settings for AI training, full site mirrors, or quick sampling.
  7. Rich Metadata Extraction – Outputs Markdown with full YAML frontmatter, including Open Graph and JSON-LD.
  8. Async-Native Performance – Efficient, concurrent fetching with aiohttp and configurable concurrency.

1. Handles Modern Web Docs

Most scrapers fail on JavaScript-heavy sites. Docpull uses a three-layer discovery system:

  • Sitemap parsing – the fastest path to content
  • Enhanced link extraction – pulls URLs from data-* attributes, onclick handlers, JSON-LD, Next.js prefetch hints
  • Full browser rendering via Playwright for SPAs

Works seamlessly with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom documentation sites.

2. Streaming Architecture

Docpull streams results as they're found:

async for event in fetcher.fetch_docs():
    if event.type == "PAGE_SAVED":
        print(f"Got {event.path}")  # Process immediately

Pipe this directly into a RAG ingestion pipeline. No need to wait for the full crawl to finish.

3. Security-First Design

Docpull comes with built-in protections against common scraper vulnerabilities.

  • SSRF, XXE, path traversal, and DoS protections included
  • Mandatory robots.txt compliance ensures safe and ethical crawling

4. Smart Deduplication

StreamingDeduplicator computes SHA-256 hashes on-the-fly:

  • Checks duplicates before writing to disk
  • O(1) hash lookups
  • Detects duplicate content across URLs
  • Typical sites have 10–20% duplicate content, which this avoids

5. Cache-Forward

--cache                    # Enable persistent cache
--cache-ttl 30             # 30-day expiry
--no-skip-unchanged        # Override cache

CacheManager features:

  • O(1) set lookups for URLs
  • ETag/Last-Modified support
  • Batched writes for speed
  • TTL-based eviction

Incremental updates on a 10k-page site take seconds, not hours.

6. Built-in Profiles

docpull https://docs.example.com --profile rag

Pre-configured sets of settings optimized for different use cases. They save you the hassle of manually tuning concurrency, caching, deduplication, and depth for your scraping tasks.

7. Rich Metadata Extraction

Output isn't just HTML→Markdown. Every file includes Markdown + YAML frontmatter:

---
title: Getting Started
source: https://docs.example.com/getting-started
og_description: Learn how to...
json_ld:
  "@type": HowTo
---

Pulls Open Graph, JSON-LD, and microdata via extruct, giving your RAG pipeline rich context for embeddings.

8. Async-Native Performance

Built on aiohttp with:

  • Per-host semaphores (respects rate limits)
  • Connection pooling
  • Exponential backoff retries
  • Configurable concurrency

Fetching 1,000 pages typically takes 2–5 minutes, even with rate limits in place.


Docpull makes scraping modern documentation simple, fast, and reliable - producing clean Markdown with metadata, ready for AI pipelines.

Learn More

  • GitHub
  • PyPI

Comments

Loading comments...
All articles

Articles

crypto
D'audio: Powered by ShelbyCrypto-ReposThe Fundamental FlawMnemonic PhrasesUnbanked to Bankless
tech
DocpullClaude StarterKernel AccessBlue Screen of DeathSearch Engine Turbulence
finance
Digital GoldA Bird's Eye ViewEasy Money and Veblen GoodsDerivatives vs Spot
music
MusicIDE

Docpull

tech

842 downloads in two weeks. Scrapes what traditional crawlers can't.

2 min readNovember 13, 2025
python
documentation
ai-training
web-scraping

Modern documentation sites are notoriously difficult to scrape. They're built with JavaScript frameworks, load content dynamically, and structure things in ways that break traditional crawlers. Most sites have 10-20% duplicate content. Incremental updates on large sites take hours.

Docpull solves these problems. Since launch two weeks ago: 842 downloads. Fully open source.

8 Reasons Why Docpull is Useful

  1. Handles Modern Web Docs – Crawls JavaScript-heavy sites with sitemaps, link extraction, and full browser rendering.
  2. Streaming Architecture – Processes pages as they're found, ready for real-time pipelines.
  3. Security-First Design – Built-in protections and mandatory robots.txt compliance keep scraping safe.
  4. Smart Deduplication – Detects and skips duplicate content across URLs using streaming SHA-256 hashes.
  5. Cache-Forward – Persistent caching enables fast, incremental updates on large sites.
  6. Built-in Profiles – Pre-configured settings for AI training, full site mirrors, or quick sampling.
  7. Rich Metadata Extraction – Outputs Markdown with full YAML frontmatter, including Open Graph and JSON-LD.
  8. Async-Native Performance – Efficient, concurrent fetching with aiohttp and configurable concurrency.

1. Handles Modern Web Docs

Most scrapers fail on JavaScript-heavy sites. Docpull uses a three-layer discovery system:

  • Sitemap parsing – the fastest path to content
  • Enhanced link extraction – pulls URLs from data-* attributes, onclick handlers, JSON-LD, Next.js prefetch hints
  • Full browser rendering via Playwright for SPAs

Works seamlessly with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom documentation sites.

2. Streaming Architecture

Docpull streams results as they're found:

async for event in fetcher.fetch_docs():
    if event.type == "PAGE_SAVED":
        print(f"Got {event.path}")  # Process immediately

Pipe this directly into a RAG ingestion pipeline. No need to wait for the full crawl to finish.

3. Security-First Design

Docpull comes with built-in protections against common scraper vulnerabilities.

  • SSRF, XXE, path traversal, and DoS protections included
  • Mandatory robots.txt compliance ensures safe and ethical crawling

4. Smart Deduplication

StreamingDeduplicator computes SHA-256 hashes on-the-fly:

  • Checks duplicates before writing to disk
  • O(1) hash lookups
  • Detects duplicate content across URLs
  • Typical sites have 10–20% duplicate content, which this avoids

5. Cache-Forward

--cache                    # Enable persistent cache
--cache-ttl 30             # 30-day expiry
--no-skip-unchanged        # Override cache

CacheManager features:

  • O(1) set lookups for URLs
  • ETag/Last-Modified support
  • Batched writes for speed
  • TTL-based eviction

Incremental updates on a 10k-page site take seconds, not hours.

6. Built-in Profiles

docpull https://docs.example.com --profile rag

Pre-configured sets of settings optimized for different use cases. They save you the hassle of manually tuning concurrency, caching, deduplication, and depth for your scraping tasks.

7. Rich Metadata Extraction

Output isn't just HTML→Markdown. Every file includes Markdown + YAML frontmatter:

---
title: Getting Started
source: https://docs.example.com/getting-started
og_description: Learn how to...
json_ld:
  "@type": HowTo
---

Pulls Open Graph, JSON-LD, and microdata via extruct, giving your RAG pipeline rich context for embeddings.

8. Async-Native Performance

Built on aiohttp with:

  • Per-host semaphores (respects rate limits)
  • Connection pooling
  • Exponential backoff retries
  • Configurable concurrency

Fetching 1,000 pages typically takes 2–5 minutes, even with rate limits in place.


Docpull makes scraping modern documentation simple, fast, and reliable - producing clean Markdown with metadata, ready for AI pipelines.

Learn More

  • GitHub
  • PyPI

Comments

Loading comments...

Category

tech

Published

November 13, 2025

Reading Time

2 min read

Tags

python
documentation
ai-training
web-scraping

All Tags (17)

crypto(18)
web(3)
computing(2)
decentralized-streaming
shelby
aptos
web3
python
documentation
ai-training
web-scraping
bitcoin
gold
inflation
theory
ai
trading

Contents

8 Reasons Why Docpull is Useful
1. Handles Modern Web Docs
2. Streaming Architecture
3. Security-First Design
4. Smart Deduplication
5. Cache-Forward
6. Built-in Profiles
7. Rich Metadata Extraction
8. Async-Native Performance
Learn More