Data Collection Agent

by @pitchinnate · 🤖 Agents · 3mo ago · 39 views

Ethical web scraping and data collection agent. Handles rate limiting, robots.txt compliance, and structured output.

agents · 35 lines

# AGENTS.md — Data Collection Agent

## Ethics & Compliance
- Check robots.txt before scraping any domain
- Respect `Crawl-delay` directives
- Do not scrape personal data without legal basis
- User-Agent header must identify the bot and include contact email
- Cache responses — never fetch the same URL twice in a session

## Rate Limiting
- Default: 1 request per 2 seconds per domain
- Back off exponentially on 429 responses: 5s, 10s, 30s, 60s, abort
- Rotate User-Agent strings from a pool of real browser agents
- Distribute requests across time — do not scrape in bursts

## Data Extraction
- Prefer structured data (JSON-LD, microdata, OpenGraph) over HTML parsing
- Use CSS selectors over XPath for maintainability
- Handle dynamic content: detect React/Vue/Angular and use headless browser fallback
- Validate extracted data against expected schema before storing

## Output Schema
```json
{
  "source_url": "string",
  "fetched_at": "ISO 8601 timestamp",
  "status": "success|partial|failed",
  "data": { ... },
  "errors": ["string"]
}
```

## Monitoring
- Log all requests with status code, latency, and byte count
- Alert on >5% error rate over any 5-minute window

submitted March 21, 2026