- Scale docker-compose.yml (5 browser, 20 extractor replicas) - Add Prometheus and Grafana monitoring services - Implement persistent Redis TaskWorker in src/orchestrator/worker.py - Implement MetricsCollector in src/core/monitoring.py - Implement SessionRecoveryManager in src/core/recovery.py - Update README.md with production usage guide - Update root documentation (implementation_plan.md, walkthrough.md)
66 lines
2 KiB
Markdown
66 lines
2 KiB
Markdown
# FAEA: High-Fidelity Autonomous Extraction Agent
|
|
|
|
## Overview
|
|
FAEA is a hybrid extraction system designed to defeat advanced bot mitigation (Cloudflare, Akamai, etc.) using a "Headless-Plus" architecture. It combines full-browser fidelity (Camoufox/Playwright) for authentication with high-speed clients (curl_cffi) for data extraction.
|
|
|
|
## Features
|
|
- **Bifurcated Execution**: Browser for Auth, Curl for Extraction.
|
|
- **TLS Fingerprint Alignment**: Browser and Extractor both mimic `Chrome/124`.
|
|
- **Evasion**:
|
|
- **GhostCursor**: Human-like mouse movements (Bezier curves, Fitts's Law).
|
|
- **EntropyScheduler**: Jittered request timing (Gaussian + Phase Drift).
|
|
- **Mobile Proxy Rotation**: Sticky session management.
|
|
- **Production Ready**:
|
|
- Docker Swarm/Compose scaling.
|
|
- Redis-backed persistent task queues.
|
|
- Prometheus/Grafana monitoring.
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
- Docker & Docker Compose
|
|
- Redis (optional, included in compose)
|
|
|
|
### Quick Start (Dev)
|
|
```bash
|
|
docker-compose up --build
|
|
```
|
|
|
|
## Production Usage
|
|
|
|
### 1. Scaling the Cluster
|
|
The infrastructure is designed to scale horizontally.
|
|
```bash
|
|
# Scale to 5 Browsers and 20 Extractors
|
|
docker-compose up -d --scale camoufox-pool=5 --scale curl-pool=20
|
|
```
|
|
|
|
### 2. Monitoring
|
|
Access the dashboards:
|
|
- **Grafana**: `http://localhost:3000` (Default creds: admin/admin)
|
|
- **Prometheus**: `http://localhost:9090`
|
|
- **Metrics**: Authentication Success Rate, Session Duration, Extraction Throughput.
|
|
|
|
### 3. Task Dispatch configuration
|
|
Tasks are dispatched via Redis `task_queue` list.
|
|
Payload format:
|
|
```json
|
|
{
|
|
"type": "auth",
|
|
"url": "https://example.com/login",
|
|
"session_id": "sess_123"
|
|
}
|
|
```
|
|
|
|
## Architecture
|
|
|
|
- `src/browser/`: Camoufox (Firefox/Chrome) manager for auth.
|
|
- `src/extractor/`: Curl Client for high-speed extraction.
|
|
- `src/core/`: Shared logic (Session, Scheduler, Recovery).
|
|
- `src/orchestrator/`: Worker loops and task management.
|
|
|
|
## Testing
|
|
Run unit tests:
|
|
```bash
|
|
./venv/bin/pytest tests/unit/
|
|
```
|