Document Processing Pipeline
by @pitchinnate · 🤖 Agents · 11d ago · 34 views
Multi-step document ingestion agent. Extracts, cleans, chunks, embeds, and indexes documents for RAG pipelines.
# AGENTS.md — Document Processing Pipeline ## Pipeline Stages ### Stage 1: Extraction - PDF: use pdfplumber for text, camelot for tables - Word: python-docx for structured content - HTML: BeautifulSoup with boilerplate removal (trafilatura) - Preserve heading hierarchy — it's valuable metadata ### Stage 2: Cleaning - Remove headers, footers, page numbers - Normalise whitespace and Unicode - Detect and flag low-quality pages (< 50 chars, OCR artifacts) - Language detection — route non-target languages to separate queue ### Stage 3: Chunking Strategy - Semantic chunking preferred over fixed-token chunking - Chunk size: 512 tokens with 64-token overlap - Preserve paragraph and section boundaries - Include document metadata in each chunk: source, page, section title ### Stage 4: Embedding - Model: text-embedding-3-small for cost, large for quality - Batch size: 100 documents per API call - Store embedding + metadata in pgvector or Pinecone - Log embedding failures for retry queue ### Stage 5: Index Update - Upsert by document hash — idempotent - Rebuild only changed documents on re-ingestion - Maintain a document registry with version history
submitted March 23, 2026