842 downloads in two weeks. Scrapes what traditional crawlers can't.
Modern documentation sites break traditional scrapers. JavaScript frameworks, dynamic content loading, client-side routing—the patterns that make docs sites nice to use make them hard to crawl.
Most sites have 10-20% duplicate content across URLs. Incremental updates on large sites take hours. The tools that exist either can't handle modern web architecture or output messy HTML instead of clean markdown.
Docpull solves these problems. 842 downloads in the first two weeks. Fully open source.
Three-layer content discovery: Sitemap parsing (fastest), enhanced link extraction (data-* attributes, JSON-LD, Next.js prefetch hints), and full browser rendering via Playwright for SPAs. Works with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom sites.
Streaming architecture: Process pages as they're found—pipe directly into RAG ingestion pipelines without waiting for full crawls to complete.
Smart deduplication: SHA-256 hashes computed on-the-fly detect duplicate content before writing to disk. O(1) lookups catch the 10-20% duplication typical on most sites.
Cache-forward design: Persistent caching with ETag/Last-Modified support makes incremental updates on 10k-page sites take seconds, not hours.
Rich metadata extraction: Outputs markdown with YAML frontmatter including Open Graph, JSON-LD, and microdata—giving RAG pipelines context for better embeddings.
pip install docpull
# Basic usage
docpull https://docs.example.com
# With RAG-optimized settings
docpull https://docs.example.com --profile rag
# Streaming into a pipeline
async for event in fetcher.fetch_docs():
if event.type == "PAGE_SAVED":
process(event.path) # Handle immediately
Built-in profiles handle common configurations:
rag: Optimized for AI training pipelinesmirror: Full site archivalsample: Quick content samplingBuilt-in protections against SSRF, XXE, path traversal, and DoS. Mandatory robots.txt compliance. The goal is scraping that's both effective and responsible.
842 downloads in two weeks. Scrapes what traditional crawlers can't.
Modern documentation sites break traditional scrapers. JavaScript frameworks, dynamic content loading, client-side routing—the patterns that make docs sites nice to use make them hard to crawl.
Most sites have 10-20% duplicate content across URLs. Incremental updates on large sites take hours. The tools that exist either can't handle modern web architecture or output messy HTML instead of clean markdown.
Docpull solves these problems. 842 downloads in the first two weeks. Fully open source.
Three-layer content discovery: Sitemap parsing (fastest), enhanced link extraction (data-* attributes, JSON-LD, Next.js prefetch hints), and full browser rendering via Playwright for SPAs. Works with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom sites.
Streaming architecture: Process pages as they're found—pipe directly into RAG ingestion pipelines without waiting for full crawls to complete.
Smart deduplication: SHA-256 hashes computed on-the-fly detect duplicate content before writing to disk. O(1) lookups catch the 10-20% duplication typical on most sites.
Cache-forward design: Persistent caching with ETag/Last-Modified support makes incremental updates on 10k-page sites take seconds, not hours.
Rich metadata extraction: Outputs markdown with YAML frontmatter including Open Graph, JSON-LD, and microdata—giving RAG pipelines context for better embeddings.
pip install docpull
# Basic usage
docpull https://docs.example.com
# With RAG-optimized settings
docpull https://docs.example.com --profile rag
# Streaming into a pipeline
async for event in fetcher.fetch_docs():
if event.type == "PAGE_SAVED":
process(event.path) # Handle immediately
Built-in profiles handle common configurations:
rag: Optimized for AI training pipelinesmirror: Full site archivalsample: Quick content samplingBuilt-in protections against SSRF, XXE, path traversal, and DoS. Mandatory robots.txt compliance. The goal is scraping that's both effective and responsible.