- README.md: Add Production Usage, Configuration, and Monitoring guide - docs/ADD_v0.2.md: Architecture Definition v0.2 (Released v1.0)
88 lines
2.8 KiB
Markdown
88 lines
2.8 KiB
Markdown
# FAEA: High-Fidelity Autonomous Extraction Agent (v1.0)
|
|
|
|
## Overview
|
|
FAEA is a hybrid extraction system designed to defeat advanced bot mitigation (Cloudflare, Akamai, etc.) using a "Headless-Plus" architecture. It combines full-browser fidelity (Camoufox/Playwright) for authentication with high-speed clients (curl_cffi) for data extraction.
|
|
|
|
**Status**: Released v1.0
|
|
**Docs**: [Architecture Definition v0.2](docs/ADD_v0.2.md)
|
|
|
|
## Features
|
|
- **Bifurcated Execution**: Browser for Auth, Curl for Extraction.
|
|
- **TLS Fingerprint Alignment**: Browser and Extractor both mimic `Chrome/124`.
|
|
- **Evasion Layer**:
|
|
- **GhostCursor**: Human-like mouse movements (Bezier curves, Fitts's Law).
|
|
- **EntropyScheduler**: Jittered request timing (Gaussian + Phase Drift).
|
|
- **Mobile Proxy Rotation**: Sticky session management.
|
|
- **Production Ready**:
|
|
- Docker Swarm/Compose scaling.
|
|
- Redis-backed persistent task queues.
|
|
- Prometheus/Grafana monitoring.
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `REDIS_URL` | Connection string for Redis | `redis://redis:6379` |
|
|
| `BROWSERFORGE_SEED` | Seed for consistent canvas fingerprinting | (Optional) |
|
|
| `PROXY_API_KEY` | API Key for mobile proxy provider | (Required for production) |
|
|
|
|
### Resource Requirements
|
|
- **Camoufox**: Requires `shm_size: 2gb` to prevent Chrome crashing on complex pages.
|
|
- **Memory**: Ensure host has at least 4GB RAM for a basic 5-browser cluster.
|
|
|
|
## Production Usage
|
|
|
|
### 1. Scaling the Cluster
|
|
Start the stack with recommended production replicas:
|
|
```bash
|
|
docker-compose up -d --scale camoufox-pool=5 --scale curl-pool=20
|
|
```
|
|
|
|
### 2. Monitoring
|
|
Access the observability stack:
|
|
- **Grafana**: `http://localhost:3000` (Default: `admin` / `admin`).
|
|
- Dashboards: "FAEA Overview", "Extraction Health".
|
|
- **Prometheus**: `http://localhost:9090`.
|
|
- **Metrics**:
|
|
- `auth_attempts_total`: Success/Failure counters.
|
|
- `session_duration_seconds`: Histogram of session validity.
|
|
|
|
### 3. Task Dispatch
|
|
Push tasks to the `task_queue` in Redis.
|
|
|
|
**Python Example:**
|
|
```python
|
|
import redis
|
|
import json
|
|
|
|
r = redis.from_url("redis://localhost:6379")
|
|
|
|
payload = {
|
|
"type": "auth",
|
|
"url": "https://example.com/login",
|
|
"session_id": "session_001"
|
|
}
|
|
|
|
r.rpush("task_queue", json.dumps(payload))
|
|
print("Task dispatched!")
|
|
```
|
|
|
|
**Curl Example:**
|
|
Use `redis-cli`:
|
|
```bash
|
|
redis-cli LPUSH task_queue '{"type": "extract", "url": "https://example.com/data", "session_id": "session_001"}'
|
|
```
|
|
|
|
## Architecture
|
|
|
|
- `src/browser/`: Camoufox (Firefox/Chrome) manager for auth.
|
|
- `src/extractor/`: Curl Client for high-speed extraction.
|
|
- `src/core/`: Shared logic (Session, Scheduler, Recovery, Monitoring).
|
|
- `src/orchestrator/`: Worker loops and task management.
|
|
|
|
## Testing
|
|
Run unit tests:
|
|
```bash
|
|
./venv/bin/pytest tests/unit/
|
|
```
|