842 downloads in two weeks. Scrapes what traditional crawlers can't.
Modern documentation sites are notoriously difficult to scrape. They're built with JavaScript frameworks, load content dynamically, and structure things in ways that break traditional crawlers. Most sites have 10-20% duplicate content. Incremental updates on large sites take hours.
Docpull solves these problems. Since launch two weeks ago: 842 downloads. Fully open source.
Most scrapers fail on JavaScript-heavy sites. Docpull uses a three-layer discovery system:
Works seamlessly with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom documentation sites.
Docpull streams results as they're found:
async for event in fetcher.fetch_docs():
if event.type == "PAGE_SAVED":
print(f"Got {event.path}") # Process immediately
Pipe this directly into a RAG ingestion pipeline. No need to wait for the full crawl to finish.
Docpull comes with built-in protections against common scraper vulnerabilities.
StreamingDeduplicator computes SHA-256 hashes on-the-fly:
--cache # Enable persistent cache
--cache-ttl 30 # 30-day expiry
--no-skip-unchanged # Override cache
CacheManager features:
Incremental updates on a 10k-page site take seconds, not hours.
docpull https://docs.example.com --profile rag
Pre-configured sets of settings optimized for different use cases. They save you the hassle of manually tuning concurrency, caching, deduplication, and depth for your scraping tasks.
Output isn't just HTML→Markdown. Every file includes Markdown + YAML frontmatter:
---
title: Getting Started
source: https://docs.example.com/getting-started
og_description: Learn how to...
json_ld:
"@type": HowTo
---
Pulls Open Graph, JSON-LD, and microdata via extruct, giving your RAG pipeline rich context for embeddings.
Built on aiohttp with:
Fetching 1,000 pages typically takes 2–5 minutes, even with rate limits in place.
Docpull makes scraping modern documentation simple, fast, and reliable - producing clean Markdown with metadata, ready for AI pipelines.
842 downloads in two weeks. Scrapes what traditional crawlers can't.
Modern documentation sites are notoriously difficult to scrape. They're built with JavaScript frameworks, load content dynamically, and structure things in ways that break traditional crawlers. Most sites have 10-20% duplicate content. Incremental updates on large sites take hours.
Docpull solves these problems. Since launch two weeks ago: 842 downloads. Fully open source.
Most scrapers fail on JavaScript-heavy sites. Docpull uses a three-layer discovery system:
Works seamlessly with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom documentation sites.
Docpull streams results as they're found:
async for event in fetcher.fetch_docs():
if event.type == "PAGE_SAVED":
print(f"Got {event.path}") # Process immediately
Pipe this directly into a RAG ingestion pipeline. No need to wait for the full crawl to finish.
Docpull comes with built-in protections against common scraper vulnerabilities.
StreamingDeduplicator computes SHA-256 hashes on-the-fly:
--cache # Enable persistent cache
--cache-ttl 30 # 30-day expiry
--no-skip-unchanged # Override cache
CacheManager features:
Incremental updates on a 10k-page site take seconds, not hours.
docpull https://docs.example.com --profile rag
Pre-configured sets of settings optimized for different use cases. They save you the hassle of manually tuning concurrency, caching, deduplication, and depth for your scraping tasks.
Output isn't just HTML→Markdown. Every file includes Markdown + YAML frontmatter:
---
title: Getting Started
source: https://docs.example.com/getting-started
og_description: Learn how to...
json_ld:
"@type": HowTo
---
Pulls Open Graph, JSON-LD, and microdata via extruct, giving your RAG pipeline rich context for embeddings.
Built on aiohttp with:
Fetching 1,000 pages typically takes 2–5 minutes, even with rate limits in place.
Docpull makes scraping modern documentation simple, fast, and reliable - producing clean Markdown with metadata, ready for AI pipelines.