# FAEA: High-Fidelity Autonomous Extraction Agent (v1.0) ## Overview FAEA is a hybrid extraction system designed to defeat advanced bot mitigation (Cloudflare, Akamai, etc.) using a "Headless-Plus" architecture. It combines full-browser fidelity (Camoufox/Playwright) for authentication with high-speed clients (curl_cffi) for data extraction. **Status**: Released v1.0 **Docs**: [Architecture Definition v0.2](docs/ADD_v0.2.md) ## Features - **Bifurcated Execution**: Browser for Auth, Curl for Extraction. - **TLS Fingerprint Alignment**: Browser and Extractor both mimic `Chrome/124`. - **Evasion Layer**: - **GhostCursor**: Human-like mouse movements (Bezier curves, Fitts's Law). - **EntropyScheduler**: Jittered request timing (Gaussian + Phase Drift). - **Mobile Proxy Rotation**: Sticky session management. - **Production Ready**: - Docker Swarm/Compose scaling. - Redis-backed persistent task queues. - Prometheus/Grafana monitoring. ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `REDIS_URL` | Connection string for Redis | `redis://redis:6379` | | `BROWSERFORGE_SEED` | Seed for consistent canvas fingerprinting | (Optional) | | `PROXY_API_KEY` | API Key for mobile proxy provider | (Required for production) | ### Resource Requirements - **Camoufox**: Requires `shm_size: 2gb` to prevent Chrome crashing on complex pages. - **Memory**: Ensure host has at least 4GB RAM for a basic 5-browser cluster. ## Production Usage ### 1. Scaling the Cluster Start the stack with recommended production replicas: ```bash docker-compose up -d --scale camoufox-pool=5 --scale curl-pool=20 ``` ### 2. Monitoring Access the observability stack: - **Grafana**: `http://localhost:3000` (Default: `admin` / `admin`). - Dashboards: "FAEA Overview", "Extraction Health". - **Prometheus**: `http://localhost:9090`. - **Metrics**: - `auth_attempts_total`: Success/Failure counters. - `session_duration_seconds`: Histogram of session validity. ### 3. Task Dispatch Push tasks to the `task_queue` in Redis. **Python Example:** ```python import redis import json r = redis.from_url("redis://localhost:6379") payload = { "type": "auth", "url": "https://example.com/login", "session_id": "session_001" } r.rpush("task_queue", json.dumps(payload)) print("Task dispatched!") ``` **Curl Example:** Use `redis-cli`: ```bash redis-cli LPUSH task_queue '{"type": "extract", "url": "https://example.com/data", "session_id": "session_001"}' ``` ## Architecture - `src/browser/`: Camoufox (Firefox/Chrome) manager for auth. - `src/extractor/`: Curl Client for high-speed extraction. - `src/core/`: Shared logic (Session, Scheduler, Recovery, Monitoring). - `src/orchestrator/`: Worker loops and task management. ## Testing Run unit tests: ```bash ./venv/bin/pytest tests/unit/ ```