Document Processing Pipeline

by @pitchinnate · 🤖 Agents · 3mo ago · 54 views

Multi-step document ingestion agent. Extracts, cleans, chunks, embeds, and indexes documents for RAG pipelines.

agents · 32 lines

# AGENTS.md — Document Processing Pipeline

## Pipeline Stages

### Stage 1: Extraction
- PDF: use pdfplumber for text, camelot for tables
- Word: python-docx for structured content
- HTML: BeautifulSoup with boilerplate removal (trafilatura)
- Preserve heading hierarchy — it's valuable metadata

### Stage 2: Cleaning
- Remove headers, footers, page numbers
- Normalise whitespace and Unicode
- Detect and flag low-quality pages (< 50 chars, OCR artifacts)
- Language detection — route non-target languages to separate queue

### Stage 3: Chunking Strategy
- Semantic chunking preferred over fixed-token chunking
- Chunk size: 512 tokens with 64-token overlap
- Preserve paragraph and section boundaries
- Include document metadata in each chunk: source, page, section title

### Stage 4: Embedding
- Model: text-embedding-3-small for cost, large for quality
- Batch size: 100 documents per API call
- Store embedding + metadata in pgvector or Pinecone
- Log embedding failures for retry queue

### Stage 5: Index Update
- Upsert by document hash — idempotent
- Rebuild only changed documents on re-ingestion
- Maintain a document registry with version history

submitted March 23, 2026