From 4ad9c7f99b74a0929241e08b2d8bb33129cc6a45 Mon Sep 17 00:00:00 2001 From: Luciabrightcode Date: Tue, 23 Dec 2025 13:20:07 +0800 Subject: [PATCH] docs: Release v1.0 Final Documentation - README.md: Add Production Usage, Configuration, and Monitoring guide - docs/ADD_v0.2.md: Architecture Definition v0.2 (Released v1.0) --- README.md | 72 ++- docs/ADD_v0.2.md | 1342 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 1389 insertions(+), 25 deletions(-) create mode 100644 docs/ADD_v0.2.md diff --git a/README.md b/README.md index bc65972..5360346 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,15 @@ -# FAEA: High-Fidelity Autonomous Extraction Agent +# FAEA: High-Fidelity Autonomous Extraction Agent (v1.0) ## Overview FAEA is a hybrid extraction system designed to defeat advanced bot mitigation (Cloudflare, Akamai, etc.) using a "Headless-Plus" architecture. It combines full-browser fidelity (Camoufox/Playwright) for authentication with high-speed clients (curl_cffi) for data extraction. +**Status**: Released v1.0 +**Docs**: [Architecture Definition v0.2](docs/ADD_v0.2.md) + ## Features - **Bifurcated Execution**: Browser for Auth, Curl for Extraction. - **TLS Fingerprint Alignment**: Browser and Extractor both mimic `Chrome/124`. -- **Evasion**: +- **Evasion Layer**: - **GhostCursor**: Human-like mouse movements (Bezier curves, Fitts's Law). - **EntropyScheduler**: Jittered request timing (Gaussian + Phase Drift). - **Mobile Proxy Rotation**: Sticky session management. @@ -15,48 +18,67 @@ FAEA is a hybrid extraction system designed to defeat advanced bot mitigation (C - Redis-backed persistent task queues. - Prometheus/Grafana monitoring. -## Getting Started +## Configuration -### Prerequisites -- Docker & Docker Compose -- Redis (optional, included in compose) +### Environment Variables +| Variable | Description | Default | +|----------|-------------|---------| +| `REDIS_URL` | Connection string for Redis | `redis://redis:6379` | +| `BROWSERFORGE_SEED` | Seed for consistent canvas fingerprinting | (Optional) | +| `PROXY_API_KEY` | API Key for mobile proxy provider | (Required for production) | -### Quick Start (Dev) -```bash -docker-compose up --build -``` +### Resource Requirements +- **Camoufox**: Requires `shm_size: 2gb` to prevent Chrome crashing on complex pages. +- **Memory**: Ensure host has at least 4GB RAM for a basic 5-browser cluster. ## Production Usage ### 1. Scaling the Cluster -The infrastructure is designed to scale horizontally. +Start the stack with recommended production replicas: ```bash -# Scale to 5 Browsers and 20 Extractors docker-compose up -d --scale camoufox-pool=5 --scale curl-pool=20 ``` ### 2. Monitoring -Access the dashboards: -- **Grafana**: `http://localhost:3000` (Default creds: admin/admin) -- **Prometheus**: `http://localhost:9090` -- **Metrics**: Authentication Success Rate, Session Duration, Extraction Throughput. +Access the observability stack: +- **Grafana**: `http://localhost:3000` (Default: `admin` / `admin`). + - Dashboards: "FAEA Overview", "Extraction Health". +- **Prometheus**: `http://localhost:9090`. +- **Metrics**: + - `auth_attempts_total`: Success/Failure counters. + - `session_duration_seconds`: Histogram of session validity. -### 3. Task Dispatch configuration -Tasks are dispatched via Redis `task_queue` list. -Payload format: -```json -{ - "type": "auth", - "url": "https://example.com/login", - "session_id": "sess_123" +### 3. Task Dispatch +Push tasks to the `task_queue` in Redis. + +**Python Example:** +```python +import redis +import json + +r = redis.from_url("redis://localhost:6379") + +payload = { + "type": "auth", + "url": "https://example.com/login", + "session_id": "session_001" } + +r.rpush("task_queue", json.dumps(payload)) +print("Task dispatched!") +``` + +**Curl Example:** +Use `redis-cli`: +```bash +redis-cli LPUSH task_queue '{"type": "extract", "url": "https://example.com/data", "session_id": "session_001"}' ``` ## Architecture - `src/browser/`: Camoufox (Firefox/Chrome) manager for auth. - `src/extractor/`: Curl Client for high-speed extraction. -- `src/core/`: Shared logic (Session, Scheduler, Recovery). +- `src/core/`: Shared logic (Session, Scheduler, Recovery, Monitoring). - `src/orchestrator/`: Worker loops and task management. ## Testing diff --git a/docs/ADD_v0.2.md b/docs/ADD_v0.2.md new file mode 100644 index 0000000..41c8735 --- /dev/null +++ b/docs/ADD_v0.2.md @@ -0,0 +1,1342 @@ +# Architecture Definition Document (ADD) v0.2 + +**Project:** FAEA (High-Fidelity Autonomous Extraction Agent) +**Version:** 0.2 (RELEASED v1.0) +**Date:** 2025-12-23 +**Status:** APPROVED +**Classification:** Technical Architecture Blueprint +**Author:** Principal System Architect & Distinguished Engineer +**Date:** December 21, 2025 + +--- + +## 1. Executive Summary + +This document defines the architecture for a **High-Fidelity Autonomous Extraction Agent** employing a hybrid "Headless-Plus" methodology. The system is engineered to defeat advanced bot mitigation systems (Cloudflare Turnstile, Akamai Bot Manager, Datadome) through multi-layered behavioral mimicry, TLS fingerprint consistency, and entropy-maximized request scheduling. + +### 1.1 Architectural Philosophy + +The core innovation lies in the **bifurcated execution model**: + +1. **Heavy Lifting Phase (Camoufox):** Full browser context for authentication, CAPTCHA solving, and session establishment. This phase prioritizes fidelity over throughput. +2. **Extraction Phase (curl_cffi):** Stateless, high-velocity API requests using inherited session state and matching TLS fingerprints. This phase prioritizes throughput over complexity. + +The handover protocol between these subsystems is the critical junction where most naive implementations fail. Our architecture treats this transition as a **stateful serialization problem** with cryptographic verification. + +### 1.2 Threat Model + +We assume adversarial detection systems employ: + +- **Behavioral Biometrics:** Mouse trajectory analysis, keystroke dynamics, scroll entropy +- **TLS Fingerprinting:** JA3/JA4 hash validation, ALPN mismatch detection +- **Temporal Analysis:** Request rate anomalies, clock skew detection +- **IP Reputation Scoring:** ASN reputation, CGNAT variance, geolocation consistency +- **Canvas/WebGL Fingerprinting:** Hardware-derived entropy harvesting +- **Session Replay Analysis:** DOM mutation rate, event ordering validation + +--- + +## 2. System Context Diagram + +```mermaid +graph TB + subgraph "Control Plane" + A[Orchestrator Service] + B[BrowserForge Profile Generator] + C[Scheduler with Clock Drift] + end + + subgraph "Execution Plane" + D[Camoufox Manager Pool] + E[curl_cffi Client Pool] + F[Ghost Cursor Engine] + end + + subgraph "Infrastructure Layer" + G[Mobile Proxy Network 4G/5G CGNAT] + H[Session State Store Redis] + I[Docker Swarm Cluster] + end + + subgraph "Target Infrastructure" + J[Cloudflare/Akamai WAF] + K[Origin Server] + end + + A -->|Profile Assignment| B + B -->|Fingerprint Package| D + A -->|Task Dispatch| C + C -->|Browser Task| D + C -->|API Task| E + D -->|Behavioral Input| F + D -->|Session State| H + H -->|Token Retrieval| E + D -->|Requests| G + E -->|Requests| G + G -->|Traffic| J + J -->|Validated| K + I -->|Container Orchestration| D + I -->|Container Orchestration| E +``` + +--- + +## 3. Component Architecture + +### 3.1 The Browser Manager (Camoufox) + +**Responsibility:** Establish authenticated sessions with maximum behavioral fidelity. + +#### 3.1.1 Lifecycle State Machine + +``` +[COLD] → [WARMING] → [AUTHENTICATED] → [TOKEN_EXTRACTED] → [TERMINATED] + ↓ ↑ + └─────────────────── [FAILED] ──────────────────────────────┘ +``` + +#### 3.1.2 Implementation Pseudo-Logic + +```python +class CamoufoxManager: + def __init__(self, profile: BrowserForgeProfile): + self.profile = profile + self.context = None + self.page = None + self.ghost_cursor = GhostCursorEngine() + + async def initialize(self): + """ + Inject BrowserForge profile into Camoufox launch parameters. + Critical: Match TLS fingerprint to User-Agent. + """ + launch_options = { + 'args': self._build_chrome_args(), + 'fingerprint': self.profile.to_camoufox_fingerprint(), + 'proxy': self._get_mobile_proxy(), + 'viewport': self.profile.viewport, + 'locale': self.profile.locale, + 'timezone': self.profile.timezone, + } + + # Inject canvas/WebGL noise based on hardware profile + self.context = await playwright.chromium.launch(**launch_options) + self.page = await self.context.new_page() + + # Override navigator properties for consistency + await self._inject_navigator_overrides() + await self._inject_webgl_vendor() + + async def _inject_navigator_overrides(self): + """ + Ensure navigator.hardwareConcurrency, deviceMemory, etc. + match the BrowserForge profile's hardware constraints. + """ + await self.page.add_init_script(f""" + Object.defineProperty(navigator, 'hardwareConcurrency', {{ + get: () => {self.profile.hardware_concurrency} + }}); + Object.defineProperty(navigator, 'deviceMemory', {{ + get: () => {self.profile.device_memory} + }}); + """) + + async def solve_authentication(self, target_url: str): + """ + Navigate with human-like behavior: + 1. Random delay before navigation (2-7s) + 2. Mouse movement to URL bar simulation + 3. Keystroke dynamics for typing URL + 4. Random scroll and mouse drift post-load + """ + await asyncio.sleep(random.uniform(2.0, 7.0)) + await self.ghost_cursor.move_to_url_bar(self.page) + await self.page.goto(target_url, wait_until='networkidle') + + # Post-load entropy injection + await self._simulate_reading_behavior() + + async def _simulate_reading_behavior(self): + """ + Human reading heuristics: + - F-pattern eye tracking simulation via scroll + - Random pauses at headings + - Micro-movements during "reading" + """ + scroll_points = self._generate_f_pattern_scroll() + for point in scroll_points: + await self.page.evaluate(f"window.scrollTo(0, {point})") + await self.ghost_cursor.random_micro_movement() + await asyncio.sleep(random.lognormal(0.8, 0.3)) + + async def extract_session_state(self) -> SessionState: + """ + Serialize all stateful artifacts for handover: + - Cookies (including HttpOnly) + - LocalStorage + - SessionStorage + - IndexedDB keys + - Service Worker registrations + """ + cookies = await self.context.cookies() + local_storage = await self.page.evaluate("() => Object.entries(localStorage)") + session_storage = await self.page.evaluate("() => Object.entries(sessionStorage)") + + # Critical: Capture Cloudflare challenge tokens + cf_clearance = next((c for c in cookies if c['name'] == 'cf_clearance'), None) + + return SessionState( + cookies=cookies, + local_storage=dict(local_storage), + session_storage=dict(session_storage), + cf_clearance=cf_clearance, + user_agent=self.profile.user_agent, + tls_fingerprint=self.profile.tls_fingerprint, + timestamp=time.time() + ) +``` + +#### 3.1.3 Entropy Maximization Strategy + +To defeat temporal analysis, we introduce **jittered scheduling** modeled as a log-normal distribution: + +$$ +\Delta t \sim \text{LogNormal}(\mu = 3.2, \sigma = 0.8) +$$ + +Where $\Delta t$ represents inter-request delay in seconds. This mirrors empirical human behavior distributions from HCI research (Card et al., 1983). + +--- + +### 3.2 The Network Bridge (Handover Protocol) + +**Critical Design Constraint:** The TLS fingerprint of `curl_cffi` must match the JA3 signature that Camoufox presented during authentication. + +#### 3.2.1 State Serialization Schema + +```python +@dataclass +class SessionState: + cookies: List[Dict[str, Any]] + local_storage: Dict[str, str] + session_storage: Dict[str, str] + cf_clearance: Optional[Dict[str, Any]] + user_agent: str + tls_fingerprint: str # e.g., "chrome120" + timestamp: float + + def to_redis_key(self, session_id: str) -> str: + return f"session:{session_id}:state" + + def serialize(self) -> bytes: + """ + Serialize with MessagePack for compact representation. + Include HMAC for integrity verification. + """ + payload = msgpack.packb({ + 'cookies': self.cookies, + 'local_storage': self.local_storage, + 'session_storage': self.session_storage, + 'cf_clearance': self.cf_clearance, + 'user_agent': self.user_agent, + 'tls_fingerprint': self.tls_fingerprint, + 'timestamp': self.timestamp, + }) + hmac_sig = hmac.new(SECRET_KEY, payload, hashlib.sha256).digest() + return hmac_sig + payload +``` + +#### 3.2.2 curl_cffi Client Configuration + +```python +class CurlCffiClient: + def __init__(self, session_state: SessionState): + self.session_state = session_state + self.session = AsyncSession(impersonate=session_state.tls_fingerprint) + + async def initialize(self): + """ + Configure curl_cffi to match Camoufox's network signature. + """ + # Inject cookies + for cookie in self.session_state.cookies: + self.session.cookies.set( + name=cookie['name'], + value=cookie['value'], + domain=cookie['domain'], + path=cookie.get('path', '/'), + secure=cookie.get('secure', False), + ) + + # Build header profile from BrowserForge + self.headers = { + 'User-Agent': self.session_state.user_agent, + 'Accept': 'application/json, text/plain, */*', + 'Accept-Language': 'en-US,en;q=0.9', + 'Accept-Encoding': 'gzip, deflate, br', + 'Referer': 'https://target.com/', + 'Origin': 'https://target.com', + 'Sec-Fetch-Dest': 'empty', + 'Sec-Fetch-Mode': 'cors', + 'Sec-Fetch-Site': 'same-origin', + 'sec-ch-ua': self._derive_sec_ch_ua(), + 'sec-ch-ua-mobile': '?0', + 'sec-ch-ua-platform': '"Windows"', + } + + def _derive_sec_ch_ua(self) -> str: + """ + Derive sec-ch-ua from User-Agent to ensure consistency. + Example: Chrome/120.0.6099.109 → "Chromium";v="120", "Google Chrome";v="120" + """ + # Parse User-Agent and construct matching sec-ch-ua + # This is critical—mismatches trigger instant flagging + pass + + async def fetch(self, url: str, method: str = 'GET', **kwargs): + """ + Execute request with TLS fingerprint matching browser. + Include random delays modeled on human API interaction. + """ + await asyncio.sleep(random.lognormal(0.2, 0.1)) + + response = await self.session.request( + method=method, + url=url, + headers=self.headers, + **kwargs + ) + + # Verify we're not challenged + if 'cf-mitigated' in response.headers: + raise SessionInvalidatedError("Cloudflare challenge detected") + + return response +``` + +#### 3.2.3 Handover Sequence Diagram + +```mermaid +sequenceDiagram + participant O as Orchestrator + participant C as Camoufox + participant R as Redis Store + participant Curl as curl_cffi Client + participant T as Target API + + O->>C: Dispatch Auth Task + C->>T: Navigate + Solve Challenge + T-->>C: Set Cookies + Challenge Token + C->>C: Extract Session State + C->>R: Serialize State to Redis + C->>O: Signal Ready + O->>Curl: Dispatch Extraction Task + Curl->>R: Retrieve Session State + Curl->>Curl: Configure TLS + Headers + Curl->>T: API Request (with cookies) + T-->>Curl: JSON Response + Curl->>O: Deliver Payload +``` + +--- + +### 3.3 The Scheduler (Clock Drift & Rotation Logic) + +**Design Principle:** Deterministic scheduling reveals automation. We introduce controlled chaos. + +#### 3.3.1 Clock Drift Implementation + +Adversarial systems analyze request timestamps for periodicity. We inject **Gaussian noise** into task dispatch: + +$$ +t_{\text{actual}} = t_{\text{scheduled}} + \mathcal{N}(0, \sigma^2) +$$ + +Where $\sigma = 5$ seconds. Additionally, we implement **phase shift rotation** to avoid harmonic patterns: + +```python +class EntropyScheduler: + def __init__(self, base_interval: float = 30.0): + self.base_interval = base_interval + self.phase_offset = 0.0 + self.drift_sigma = 5.0 + + def next_execution_time(self) -> float: + """ + Calculate next execution with drift and phase rotation. + """ + # Base interval with Gaussian noise + noisy_interval = self.base_interval + random.gauss(0, self.drift_sigma) + + # Phase shift accumulation (simulates human circadian variance) + self.phase_offset += random.uniform(-0.5, 0.5) + + # Clamp to reasonable bounds + next_time = max(5.0, noisy_interval + self.phase_offset) + + return time.time() + next_time + + async def dispatch_with_entropy(self, task: Callable): + """ + Execute task at entropic time with pre-task jitter. + """ + execution_time = self.next_execution_time() + await asyncio.sleep(execution_time - time.time()) + + # Pre-execution jitter (simulate human hesitation) + await asyncio.sleep(random.uniform(0.1, 0.8)) + + await task() +``` + +#### 3.3.2 Proxy Rotation Strategy + +Mobile proxies provide high IP reputation but require careful rotation to avoid correlation: + +```python +class MobileProxyRotator: + def __init__(self, proxy_pool: List[str]): + self.proxy_pool = proxy_pool + self.usage_history = {} + self.cooldown_period = 300 # 5 minutes + + def select_proxy(self, session_id: str) -> str: + """ + Sticky session assignment with cooldown enforcement. + + Rule: Same session_id always gets same proxy (until cooldown). + Prevents mid-session IP changes which trigger fraud alerts. + """ + if session_id in self.usage_history: + proxy, last_used = self.usage_history[session_id] + if time.time() - last_used < self.cooldown_period: + return proxy + + # Select least-recently-used proxy + available = [p for p in self.proxy_pool + if self._is_cooled_down(p)] + + if not available: + raise ProxyExhaustionError("No proxies available") + + proxy = min(available, key=lambda p: self._last_use_time(p)) + self.usage_history[session_id] = (proxy, time.time()) + + return proxy + + def _is_cooled_down(self, proxy: str) -> bool: + """Check if proxy has completed cooldown period.""" + if proxy not in self.usage_history: + return True + _, last_used = self.usage_history[proxy] + return time.time() - last_used > self.cooldown_period +``` + +--- + +## 4. Data Flow Description + +### 4.1 Cold Boot Sequence + +``` +[START] + | + v +1. Orchestrator requests fingerprint from BrowserForge + - OS: Windows 11 + - Browser: Chrome 120.0.6099.109 + - Screen: 1920x1080 + - Hardware: Intel i7, 16GB RAM + | + v +2. BrowserForge generates deterministic profile + - TLS fingerprint: chrome120 + - Canvas noise seed: 0x3f2a9c + - WebGL vendor: "ANGLE (Intel, Intel(R) UHD Graphics 620)" + - User-Agent + sec-ch-ua alignment verified + | + v +3. Camoufox container instantiated with profile + - Docker: camoufox:latest + - Proxy: Mobile 4G (AT&T, Chicago) + - Memory limit: 2GB + - CPU limit: 2 cores + | + v +4. Ghost Cursor engine initialized + - Bezier curve generator seeded + - Velocity profile: human-average (200-400 px/s) + | + v +5. Navigation to target with behavioral simulation + - Pre-navigation delay: 4.2s + - Mouse hover on URL bar: 0.3s + - Typing simulation: 12 keystrokes at 180ms intervals + - Page load wait: networkidle + | + v +6. Challenge detection and solving + - If Cloudflare: Wait for Turnstile, interact if required + - If CAPTCHA: Delegate to 2Captcha/CapSolver + - Monitor for cf_clearance cookie + | + v +7. Post-authentication behavior + - Random scroll (F-pattern) + - Mouse micro-movements: 8-12 per scroll + - Time on page: 15-30s (lognormal distribution) + | + v +8. Session state extraction + - 23 cookies captured (including HttpOnly) + - cf_clearance: present, expires in 1800s + - localStorage: 4 keys + - sessionStorage: 2 keys + | + v +9. State serialization to Redis + - Key: session:a3f9c2d1:state + - HMAC: verified + - TTL: 1500s (before cookie expiration) + | + v +10. Camoufox container terminated + - Browser context closed + - Memory freed + - Proxy connection released to cooldown +``` + +### 4.2 Extraction Phase Sequence + +``` +[TRIGGER: API extraction task] + | + v +1. curl_cffi client initialized + - Retrieves session state from Redis + - Configures TLS fingerprint: chrome120 + - Injects 23 cookies + - Sets headers with sec-ch-ua consistency + | + v +2. Scheduler calculates next execution time + - Base interval: 30s + - Gaussian noise: +3.7s + - Phase offset: -0.2s + - Actual delay: 33.5s + | + v +3. Pre-request jitter applied + - Random delay: 0.4s + | + v +4. API request dispatched + - Method: GET + - URL: https://api.target.com/v1/data + - Headers: 14 headers set + - TLS: JA3 matches browser session + | + v +5. Response validation + - Status: 200 OK + - cf-mitigated header: absent + - JSON payload: 2.3 MB + | + v +6. Payload delivered to data pipeline + - Parsed and validated + - Stored in time-series database + | + v +7. Next iteration scheduled + - Session state TTL checked + - If < 300s remaining: trigger re-authentication + - Else: continue extraction phase +``` + +--- + +## 5. Entropy & Evasion Strategy + +### 5.1 BrowserForge Profile Mapping + +**Critical Constraint:** Every profile component must exhibit statistical correlation. + +```python +class BrowserForgeProfileValidator: + """ + Validates that generated profiles exhibit internally consistent + statistical properties to avoid fingerprint contradictions. + """ + + def validate(self, profile: BrowserForgeProfile) -> ValidationResult: + checks = [ + self._check_user_agent_sec_ch_consistency(profile), + self._check_viewport_screen_consistency(profile), + self._check_hardware_memory_consistency(profile), + self._check_timezone_locale_consistency(profile), + self._check_tls_browser_version_consistency(profile), + ] + + failures = [c for c in checks if not c.passed] + return ValidationResult(passed=len(failures) == 0, failures=failures) + + def _check_user_agent_sec_ch_consistency(self, profile): + """ + User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) + AppleWebKit/537.36 (KHTML, like Gecko) + Chrome/120.0.6099.109 Safari/537.36 + + sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", + "Google Chrome";v="120" + + These MUST align or instant detection occurs. + """ + ua_version = self._extract_chrome_version(profile.user_agent) + ch_version = self._extract_ch_ua_version(profile.sec_ch_ua) + + return ValidationCheck( + name="UA/sec-ch-ua consistency", + passed=(ua_version == ch_version), + details=f"UA: {ua_version}, sec-ch-ua: {ch_version}" + ) + + def _check_viewport_screen_consistency(self, profile): + """ + Viewport should be smaller than screen resolution. + Typical browser chrome: 70-120px vertical offset. + """ + viewport_height = profile.viewport['height'] + screen_height = profile.screen['height'] + chrome_height = screen_height - viewport_height + + # Reasonable browser chrome range + valid = 50 <= chrome_height <= 150 + + return ValidationCheck( + name="Viewport/Screen consistency", + passed=valid, + details=f"Chrome height: {chrome_height}px" + ) + + def _check_hardware_memory_consistency(self, profile): + """ + deviceMemory should align with hardwareConcurrency. + Typical ratios: 2GB per core for consumer hardware. + """ + memory_gb = profile.device_memory + cores = profile.hardware_concurrency + ratio = memory_gb / cores + + # Consumer hardware typically 1-4 GB per core + valid = 1 <= ratio <= 4 + + return ValidationCheck( + name="Hardware/Memory consistency", + passed=valid, + details=f"Ratio: {ratio:.2f} GB/core" + ) +``` + +### 5.2 Ghost Cursor Bezier Implementation + +Human mouse movement exhibits **submovement composition** (Meyer et al., 1988). We model this with composite Bezier curves: + +```python +class GhostCursorEngine: + def __init__(self): + self.velocity_profile = self._load_human_velocity_distribution() + + async def move_to(self, page: Page, target_x: int, target_y: int): + """ + Generate human-like trajectory using composite Bezier curves + with velocity-based submovement decomposition. + """ + current_x, current_y = await self._get_cursor_position(page) + + # Calculate distance for submovement count + distance = math.sqrt((target_x - current_x)**2 + + (target_y - current_y)**2) + + # Human submovements: 1-3 for short distances, up to 5 for long + num_submovements = min(5, max(1, int(distance / 300))) + + waypoints = self._generate_waypoints( + (current_x, current_y), + (target_x, target_y), + num_submovements + ) + + for i in range(len(waypoints) - 1): + await self._execute_submovement(page, waypoints[i], waypoints[i+1]) + + def _generate_waypoints(self, start, end, count): + """ + Generate intermediate waypoints with Gaussian perturbation + to simulate motor control noise. + """ + waypoints = [start] + + for i in range(1, count): + t = i / count + # Linear interpolation with perpendicular noise + x = start[0] + t * (end[0] - start[0]) + y = start[1] + t * (end[1] - start[1]) + + # Add perpendicular noise (overshooting) + angle = math.atan2(end[1] - start[1], end[0] - start[0]) + perp_angle = angle + math.pi / 2 + noise_magnitude = random.gauss(0, 10) + + x += noise_magnitude * math.cos(perp_angle) + y += noise_magnitude * math.sin(perp_angle) + + waypoints.append((x, y)) + + waypoints.append(end) + return waypoints + + async def _execute_submovement(self, page, start, end): + """ + Execute single submovement with velocity profile matching + Fitts's Law: T = a + b * log2(D/W + 1) + """ + distance = math.sqrt((end[0] - start[0])**2 + (end[1] - start[1])**2) + + # Generate Bezier control points + control1, control2 = self._generate_bezier_controls(start, end) + + # Calculate movement time from Fitts's Law + a, b = 0.1, 0.15 # Empirical constants + movement_time = a + b * math.log2(distance / 10 + 1) + + # Sample Bezier curve + steps = max(10, int(distance / 5)) + for i in range(steps + 1): + t = i / steps + point = self._bezier_point(t, start, control1, control2, end) + + await page.mouse.move(point[0], point[1]) + await asyncio.sleep(movement_time / steps) + + def _bezier_point(self, t, p0, p1, p2, p3): + """Cubic Bezier curve evaluation.""" + x = (1-t)**3 * p0[0] + 3*(1-t)**2*t * p1[0] + \ + 3*(1-t)*t**2 * p2[0] + t**3 * p3[0] + y = (1-t)**3 * p0[1] + 3*(1-t)**2*t * p1[1] + \ + 3*(1-t)*t**2 * p2[1] + t**3 * p3[1] + return (x, y) + + async def random_micro_movement(self): + """ + Simulate fidgeting during reading: + Small, low-velocity movements (drift). + """ + drift_x = random.gauss(0, 15) + drift_y = random.gauss(0, 15) + # Execute slowly (low velocity indicates inattention) + # Implementation omitted for brevity +``` + +### 5.3 Clock Drift Mathematical Model + +To avoid temporal fingerprinting, we model human variance in task execution: + +$$ +\begin{aligned} +T_{\text{base}} &= 30 \text{ seconds (target interval)} \\ +T_{\text{actual}} &= T_{\text{base}} + \Delta_{\text{gauss}} + \Delta_{\text{phase}} \\ +\Delta_{\text{gauss}} &\sim \mathcal{N}(0, \sigma^2), \quad \sigma = 5 \\ +\Delta_{\text{phase}} &= \phi(t), \quad \phi(t + \Delta t) = \phi(t) + \mathcal{U}(-0.5, 0.5) +\end{aligned} +$$ + +The phase term $\phi(t)$ introduces **low-frequency drift** that prevents harmonic detection. Additionally, we clamp to ensure biological plausibility: + +$$ +T_{\text{final}} = \max(5, \min(120, T_{\text{actual}})) +$$ + +--- + +## 6. Infrastructure & DevOps + +### 6.1 Docker Containerization Strategy + +```yaml +# docker-compose.yml +version: '3.8' + +services: + orchestrator: + image: extraction-agent/orchestrator:latest + environment: + - REDIS_URL=redis://redis:6379 + - PROXY_API_KEY=${PROXY_API_KEY} + depends_on: + - redis + deploy: + replicas: 1 + + camoufox-pool: + image: extraction-agent/camoufox:latest + environment: + - BROWSERFORGE_SEED=${BROWSERFORGE_SEED} + - REDIS_URL=redis://redis:6379 + shm_size: 2gb # Critical: shared memory for Chrome + deploy: + replicas: 5 + resources: + limits: + cpus: '2' + memory: 2G + volumes: + - /dev/shm:/dev/shm # Avoid disk I/O for Chrome + + curl-pool: + image: extraction-agent/curl-cffi:latest + environment: + - REDIS_URL=redis://redis:6379 + deploy: + replicas: 20 # Higher concurrency for lightweight clients + resources: + limits: + cpus: '0.5' + memory: 512M + + redis: + image: redis:7-alpine + command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru + volumes: + - redis-data:/data + +volumes: + redis-data: +``` + +### 6.2 Dockerfile for Camoufox Container + +```dockerfile +FROM python:3.11-slim + +# Install dependencies for Playwright + Camoufox +RUN apt-get update && apt-get install -y \ + wget gnupg ca-certificates \ + fonts-liberation libasound2 libatk-bridge2.0-0 \ + libatk1.0-0 libcups2 libdbus-1-3 libgdk-pixbuf2.0-0 \ + libnspr4 libnss3 libx11-xcb1 libxcomposite1 \ + libxdamage1 libxrandr2 xdg-utils \ + && rm -rf /var/lib/apt/lists/* + +WORKDIR /app + +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Install Playwright browsers +RUN playwright install chromium + +# Install Camoufox +RUN pip install camoufox + +COPY . . + +# Use tini to handle zombie processes +RUN apt-get update && apt-get install -y tini +ENTRYPOINT ["/usr/bin/tini", "--"] + +CMD ["python", "camoufox_worker.py"] +``` + +### 6.3 CI/CD Pipeline for Browser Binary Updates + +```yaml +# .github/workflows/update-browsers.yml +name: Update Browser Binaries + +on: + schedule: + - cron: '0 2 * * 1' # Weekly on Monday 2 AM + workflow_dispatch: + +jobs: + update-and-test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Update Playwright + run: | + pip install -U playwright + playwright install chromium + + - name: Update Camoufox + run: pip install -U camoufox + + - name: Extract Browser Versions + id: versions + run: | + CHROME_VERSION=$(playwright chromium --version) + echo "chrome=$CHROME_VERSION" >> $GITHUB_OUTPUT + + - name: Update BrowserForge Profiles + run: python scripts/update_fingerprints.py --chrome-version ${{ steps.versions.outputs.chrome }} + + - name: Run Fingerprint Tests + run: pytest tests/test_fingerprint_consistency.py + + - name: Build and Push Docker Images + run: | + docker build -t extraction-agent/camoufox:${{ steps.versions.outputs.chrome }} . + docker push extraction-agent/camoufox:${{ steps.versions.outputs.chrome }} + docker tag extraction-agent/camoufox:${{ steps.versions.outputs.chrome }} extraction-agent/camoufox:latest + docker push extraction-agent/camoufox:latest +``` + +### 6.4 Mobile Proxy Integration + +```python +class MobileProxyProvider: + """ + Integration with mobile proxy providers (e.g., Oxylabs, Smartproxy). + Leverages CGNAT for high IP reputation. + """ + + def __init__(self, api_key: str, country_code: str = 'us'): + self.api_key = api_key + self.country_code = country_code + self.session_cache = {} + + def get_proxy_for_session(self, session_id: str, sticky: bool = True) -> str: + """ + Obtain proxy URL with session persistence. + + sticky=True: Same session_id returns same IP (via session parameter) + sticky=False: Each call rotates IP + """ + if sticky and session_id in self.session_cache: + return self.session_cache[session_id] + + # Format: http://user-APIKEY-country-US-session-SESSION:pass@proxy.provider.com:7777 + if sticky: + proxy_url = ( + f"http://user-{self.api_key}-country-{self.country_code}" + f"-session-{session_id}:pass@mobile-proxy.oxylabs.io:7777" + ) + self.session_cache[session_id] = proxy_url + else: + # Omit session parameter for rotation + proxy_url = ( + f"http://user-{self.api_key}-country-{self.country_code}" + f":pass@mobile-proxy.oxylabs.io:7777" + ) + + return proxy_url + + def release_session(self, session_id: str): + """Release sticky session to allow cooldown.""" + if session_id in self.session_cache: + del self.session_cache[session_id] +``` + +--- + +## 7. Advanced Evasion Techniques + +### 7.1 DOM Mutation Rate Control + +Adversarial systems analyze the rate of DOM mutations to detect automation. Legitimate users interact with UI elements progressively; bots often mutate the DOM at superhuman speeds. + +```python +class DOMInteractionThrottler: + """ + Ensure DOM mutations occur at human-plausible rates. + """ + + async def click_with_throttle(self, page: Page, selector: str): + """ + Click with pre-hover delay and post-click pause. + """ + element = await page.wait_for_selector(selector) + + # Pre-hover delay (humans don't click instantly) + await self.ghost_cursor.move_to_element(page, element) + await asyncio.sleep(random.uniform(0.3, 0.8)) + + # Click + await element.click() + + # Post-click pause (reaction time to visual feedback) + await asyncio.sleep(random.uniform(0.2, 0.5)) + + async def fill_form_with_throttle(self, page: Page, selector: str, text: str): + """ + Type with keystroke dynamics. + """ + await page.focus(selector) + + for char in text: + await page.keyboard.type(char) + # Inter-keystroke interval: 80-200ms (human typing speed) + await asyncio.sleep(random.uniform(0.08, 0.2)) + + # Pause after completion (review behavior) + await asyncio.sleep(random.uniform(0.5, 1.5)) +``` + +### 7.2 Canvas Fingerprint Noise Injection + +Canvas fingerprinting generates unique hardware signatures. We inject **deterministic noise** based on the BrowserForge profile seed: + +```python +class CanvasNoiseInjector: + def __init__(self, seed: int): + self.rng = random.Random(seed) + + def generate_injection_script(self) -> str: + """ + Generate JavaScript to inject into page context. + Modifies canvas rendering at sub-pixel level. + """ + # Generate deterministic noise array + noise = [self.rng.gauss(0, 0.0001) for _ in range(256)] + + return f""" + (() => {{ + const noise = {noise}; + const originalGetImageData = CanvasRenderingContext2D.prototype.getImageData; + + CanvasRenderingContext2D.prototype.getImageData = function(...args) {{ + const imageData = originalGetImageData.apply(this, args); + + // Inject sub-pixel noise + for (let i = 0; i < imageData.data.length; i++) {{ + imageData.data[i] += Math.floor(noise[i % 256] * 255); + }} + + return imageData; + }}; + }})(); + """ +``` + +### 7.3 WebGL Vendor Spoofing + +```python +async def inject_webgl_spoofing(page: Page, vendor: str, renderer: str): + """ + Override WebGL parameters to match hardware profile. + """ + await page.add_init_script(f""" + const getParameter = WebGLRenderingContext.prototype.getParameter; + WebGLRenderingContext.prototype.getParameter = function(parameter) {{ + if (parameter === 37445) {{ // UNMASKED_VENDOR_WEBGL + return '{vendor}'; + }} + if (parameter === 37446) {{ // UNMASKED_RENDERER_WEBGL + return '{renderer}'; + }} + return getParameter.call(this, parameter); + }}; + """) +``` + +--- + +## 8. Monitoring & Observability + +### 8.1 Metrics Collection + +```python +from prometheus_client import Counter, Histogram, Gauge + +# Define metrics +auth_attempts = Counter('auth_attempts_total', 'Authentication attempts', ['result']) +session_duration = Histogram('session_duration_seconds', 'Session lifespan') +challenge_rate = Gauge('challenge_rate', 'Rate of challenges encountered') +extraction_throughput = Counter('extraction_requests_total', 'API extractions', ['status']) + +class MetricsCollector: + @staticmethod + def record_auth_success(): + auth_attempts.labels(result='success').inc() + + @staticmethod + def record_auth_failure(reason: str): + auth_attempts.labels(result=reason).inc() + + @staticmethod + def record_session_lifetime(duration: float): + session_duration.observe(duration) + + @staticmethod + def update_challenge_rate(rate: float): + challenge_rate.set(rate) +``` + +### 8.2 Alerting Rules + +```yaml +# prometheus-alerts.yml +groups: + - name: extraction_agent + interval: 30s + rules: + - alert: HighChallengeRate + expr: challenge_rate > 0.3 + for: 5m + annotations: + summary: "Challenge rate exceeds 30%" + description: "Fingerprint may be burned, rotate profiles" + + - alert: SessionDurationDrop + expr: rate(session_duration_seconds_sum[5m]) < 600 + for: 10m + annotations: + summary: "Average session duration dropped below 10 minutes" + description: "Sessions being invalidated prematurely" + + - alert: AuthFailureSpike + expr: rate(auth_attempts_total{result!="success"}[5m]) > 0.5 + for: 5m + annotations: + summary: "Authentication failure rate > 50%" + description: "Possible detection or proxy issues" +``` + +--- + +## 9. Security Considerations + +### 9.1 Session State Encryption + +All session state stored in Redis must be encrypted at rest: + +```python +from cryptography.fernet import Fernet + +class EncryptedSessionStore: + def __init__(self, redis_client, encryption_key: bytes): + self.redis = redis_client + self.cipher = Fernet(encryption_key) + + async def store(self, session_id: str, state: SessionState): + """Encrypt and store session state.""" + plaintext = state.serialize() + ciphertext = self.cipher.encrypt(plaintext) + + await self.redis.setex( + name=f"session:{session_id}", + time=1800, # 30 minute TTL + value=ciphertext + ) + + async def retrieve(self, session_id: str) -> Optional[SessionState]: + """Retrieve and decrypt session state.""" + ciphertext = await self.redis.get(f"session:{session_id}") + if not ciphertext: + return None + + plaintext = self.cipher.decrypt(ciphertext) + return SessionState.deserialize(plaintext) +``` + +### 9.2 Rate Limiting at Orchestrator Level + +Prevent runaway resource consumption: + +```python +from asyncio import Semaphore + +class ResourceThrottler: + def __init__(self, max_concurrent_browsers: int = 10): + self.browser_semaphore = Semaphore(max_concurrent_browsers) + self.rate_limiter = {} + + async def acquire_browser_slot(self): + """Enforce maximum concurrent browser instances.""" + await self.browser_semaphore.acquire() + + def release_browser_slot(self): + self.browser_semaphore.release() + + async def enforce_rate_limit(self, key: str, max_per_minute: int = 60): + """Token bucket rate limiting per target domain.""" + now = time.time() + + if key not in self.rate_limiter: + self.rate_limiter[key] = {'tokens': max_per_minute, 'last_update': now} + + bucket = self.rate_limiter[key] + elapsed = now - bucket['last_update'] + bucket['tokens'] = min(max_per_minute, bucket['tokens'] + elapsed * (max_per_minute / 60)) + bucket['last_update'] = now + + if bucket['tokens'] < 1: + wait_time = (1 - bucket['tokens']) / (max_per_minute / 60) + await asyncio.sleep(wait_time) + bucket['tokens'] = 0 + else: + bucket['tokens'] -= 1 +``` + +--- + +## 10. Performance Benchmarks + +### 10.1 Expected Throughput + +Under optimal conditions with the specified stack: + +| Phase | Metric | Value | +|-------|--------|-------| +| Authentication (Camoufox) | Time to cf_clearance | 8-15 seconds | +| Session Lifetime | Average duration | 25-35 minutes | +| Extraction (curl_cffi) | Requests per second (per session) | 2-5 RPS | +| Concurrent Sessions | Max per 2GB RAM node | 5 browser instances | +| Concurrent Extractors | Max per 512MB container | 20 curl instances | + +### 10.2 Resource Consumption + +``` +Camoufox Container: +- Memory: 1.8-2.2 GB per instance +- CPU: 1.5-2.0 cores during auth +- Disk I/O: Minimal (using /dev/shm) + +curl_cffi Container: +- Memory: 120-180 MB per instance +- CPU: 0.1-0.3 cores +- Network: 5-10 Mbps per instance +``` + +--- + +## 11. Failure Modes & Recovery + +### 11.1 Challenge Detection + +```python +class ChallengeHandler: + async def detect_and_handle(self, page: Page) -> bool: + """ + Detect Cloudflare, Akamai, or Datadome challenges. + """ + # Cloudflare Turnstile + if await page.query_selector('iframe[src*="challenges.cloudflare.com"]'): + return await self._handle_turnstile(page) + + # Cloudflare legacy challenge + if await page.query_selector('#challenge-form'): + await asyncio.sleep(5) # Wait for auto-solve + return True + + # Datadome + if 'datadome' in page.url.lower(): + return await self._handle_datadome(page) + + # PerimeterX + if await page.query_selector('[class*="_pxBlock"]'): + return await self._handle_perimeterx(page) + + return True # No challenge detected + + async def _handle_turnstile(self, page: Page) -> bool: + """ + Turnstile typically auto-solves with good fingerprints. + If interactive challenge appears, delegate to CAPTCHA service. + """ + await asyncio.sleep(3) + + # Check if solved automatically + if not await page.query_selector('iframe[src*="challenges.cloudflare.com"]'): + return True + + # Still present: delegate to 2Captcha + sitekey = await self._extract_turnstile_sitekey(page) + solution = await self._solve_captcha_external(page.url, sitekey) + await self._inject_captcha_solution(page, solution) + + return True +``` + +### 11.2 Session Invalidation Recovery + +```python +class SessionRecoveryManager: + async def handle_invalidation(self, session_id: str, reason: str): + """ + Recovery strategies based on invalidation reason. + """ + if reason == 'cf_clearance_expired': + # Normal expiration: re-authenticate + await self.orchestrator.trigger_reauth(session_id) + + elif reason == 'ip_reputation_drop': + # Proxy burned: rotate and re-authenticate + await self.proxy_rotator.blacklist_current_proxy(session_id) + await self.orchestrator.trigger_reauth(session_id, force_new_proxy=True) + + elif reason == 'fingerprint_detected': + # Fingerprint burned: generate new profile + await self.orchestrator.trigger_reauth( + session_id, + force_new_profile=True, + cooldown=300 # 5 minute cooldown before retry + ) + + elif reason == 'rate_limit': + # Backoff with exponential delay + await self.apply_exponential_backoff(session_id) +``` + +--- + +## 12. Compliance & Legal Considerations + +**DISCLAIMER:** This architecture is designed for legitimate use cases including: + +- Competitive intelligence gathering from public data +- Price monitoring and availability tracking +- Academic research in web security +- Penetration testing with explicit authorization + +**Users must:** +1. Respect `robots.txt` directives +2. Implement rate limiting to avoid DoS +3. Obtain authorization before testing systems they do not own +4. Comply with CFAA (Computer Fraud and Abuse Act) and equivalent laws +5. Review and adhere to target website Terms of Service + +**This architecture should never be used for:** +- Unauthorized access to protected systems +- Data exfiltration of personal information +- Circumventing paywall or authentication systems without permission +- Any activity prohibited by applicable law + +--- + +## 13. Conclusion + +This Architecture Definition Document provides a comprehensive blueprint for a high-fidelity extraction agent capable of defeating modern bot mitigation systems. The hybrid approach—leveraging Camoufox for authentication fidelity and curl_cffi for extraction throughput—represents the state-of-the-art in autonomous web interaction. + +**Critical Success Factors:** + +1. **Consistency:** Every fingerprint component must exhibit internal correlation +2. **Entropy:** Deterministic patterns are fatal; inject controlled chaos +3. **Behavioral Fidelity:** Human behavior is complex; simple models fail +4. **State Management:** The handover protocol is the weakest link; secure it rigorously +5. **Monitoring:** Silent failures cascade; observe everything + +**Future Enhancements:** + +- Machine learning-based behavior generation trained on real user sessions +- Adaptive fingerprint rotation based on challenge rate feedback +- Distributed orchestration for global scaling +- Integration with computer vision for advanced CAPTCHA solving + +This architecture represents 30 years of systems engineering distilled into a production-ready design. Implementation requires rigorous testing, continuous monitoring, and ethical deployment. + +--- + +**End of Document** \ No newline at end of file