Data Collection Agent
by @pitchinnate · 🤖 Agents · 13d ago · 19 views
Ethical web scraping and data collection agent. Handles rate limiting, robots.txt compliance, and structured output.
# AGENTS.md — Data Collection Agent
## Ethics & Compliance
- Check robots.txt before scraping any domain
- Respect `Crawl-delay` directives
- Do not scrape personal data without legal basis
- User-Agent header must identify the bot and include contact email
- Cache responses — never fetch the same URL twice in a session
## Rate Limiting
- Default: 1 request per 2 seconds per domain
- Back off exponentially on 429 responses: 5s, 10s, 30s, 60s, abort
- Rotate User-Agent strings from a pool of real browser agents
- Distribute requests across time — do not scrape in bursts
## Data Extraction
- Prefer structured data (JSON-LD, microdata, OpenGraph) over HTML parsing
- Use CSS selectors over XPath for maintainability
- Handle dynamic content: detect React/Vue/Angular and use headless browser fallback
- Validate extracted data against expected schema before storing
## Output Schema
```json
{
"source_url": "string",
"fetched_at": "ISO 8601 timestamp",
"status": "success|partial|failed",
"data": { ... },
"errors": ["string"]
}
```
## Monitoring
- Log all requests with status code, latency, and byte count
- Alert on >5% error rate over any 5-minute windowsubmitted March 21, 2026