# Architecture Definition Document (ADD) v0.2 **Project:** FAEA (High-Fidelity Autonomous Extraction Agent) **Version:** 0.2 (RELEASED v1.0) **Date:** 2025-12-23 **Status:** APPROVED **Classification:** Technical Architecture Blueprint **Author:** Principal System Architect & Distinguished Engineer **Date:** December 21, 2025 --- ## 1. Executive Summary This document defines the architecture for a **High-Fidelity Autonomous Extraction Agent** employing a hybrid "Headless-Plus" methodology. The system is engineered to defeat advanced bot mitigation systems (Cloudflare Turnstile, Akamai Bot Manager, Datadome) through multi-layered behavioral mimicry, TLS fingerprint consistency, and entropy-maximized request scheduling. ### 1.1 Architectural Philosophy The core innovation lies in the **bifurcated execution model**: 1. **Heavy Lifting Phase (Camoufox):** Full browser context for authentication, CAPTCHA solving, and session establishment. This phase prioritizes fidelity over throughput. 2. **Extraction Phase (curl_cffi):** Stateless, high-velocity API requests using inherited session state and matching TLS fingerprints. This phase prioritizes throughput over complexity. The handover protocol between these subsystems is the critical junction where most naive implementations fail. Our architecture treats this transition as a **stateful serialization problem** with cryptographic verification. ### 1.2 Threat Model We assume adversarial detection systems employ: - **Behavioral Biometrics:** Mouse trajectory analysis, keystroke dynamics, scroll entropy - **TLS Fingerprinting:** JA3/JA4 hash validation, ALPN mismatch detection - **Temporal Analysis:** Request rate anomalies, clock skew detection - **IP Reputation Scoring:** ASN reputation, CGNAT variance, geolocation consistency - **Canvas/WebGL Fingerprinting:** Hardware-derived entropy harvesting - **Session Replay Analysis:** DOM mutation rate, event ordering validation --- ## 2. System Context Diagram ```mermaid graph TB subgraph "Control Plane" A[Orchestrator Service] B[BrowserForge Profile Generator] C[Scheduler with Clock Drift] end subgraph "Execution Plane" D[Camoufox Manager Pool] E[curl_cffi Client Pool] F[Ghost Cursor Engine] end subgraph "Infrastructure Layer" G[Mobile Proxy Network 4G/5G CGNAT] H[Session State Store Redis] I[Docker Swarm Cluster] end subgraph "Target Infrastructure" J[Cloudflare/Akamai WAF] K[Origin Server] end A -->|Profile Assignment| B B -->|Fingerprint Package| D A -->|Task Dispatch| C C -->|Browser Task| D C -->|API Task| E D -->|Behavioral Input| F D -->|Session State| H H -->|Token Retrieval| E D -->|Requests| G E -->|Requests| G G -->|Traffic| J J -->|Validated| K I -->|Container Orchestration| D I -->|Container Orchestration| E ``` --- ## 3. Component Architecture ### 3.1 The Browser Manager (Camoufox) **Responsibility:** Establish authenticated sessions with maximum behavioral fidelity. #### 3.1.1 Lifecycle State Machine ``` [COLD] → [WARMING] → [AUTHENTICATED] → [TOKEN_EXTRACTED] → [TERMINATED] ↓ ↑ └─────────────────── [FAILED] ──────────────────────────────┘ ``` #### 3.1.2 Implementation Pseudo-Logic ```python class CamoufoxManager: def __init__(self, profile: BrowserForgeProfile): self.profile = profile self.context = None self.page = None self.ghost_cursor = GhostCursorEngine() async def initialize(self): """ Inject BrowserForge profile into Camoufox launch parameters. Critical: Match TLS fingerprint to User-Agent. """ launch_options = { 'args': self._build_chrome_args(), 'fingerprint': self.profile.to_camoufox_fingerprint(), 'proxy': self._get_mobile_proxy(), 'viewport': self.profile.viewport, 'locale': self.profile.locale, 'timezone': self.profile.timezone, } # Inject canvas/WebGL noise based on hardware profile self.context = await playwright.chromium.launch(**launch_options) self.page = await self.context.new_page() # Override navigator properties for consistency await self._inject_navigator_overrides() await self._inject_webgl_vendor() async def _inject_navigator_overrides(self): """ Ensure navigator.hardwareConcurrency, deviceMemory, etc. match the BrowserForge profile's hardware constraints. """ await self.page.add_init_script(f""" Object.defineProperty(navigator, 'hardwareConcurrency', {{ get: () => {self.profile.hardware_concurrency} }}); Object.defineProperty(navigator, 'deviceMemory', {{ get: () => {self.profile.device_memory} }}); """) async def solve_authentication(self, target_url: str): """ Navigate with human-like behavior: 1. Random delay before navigation (2-7s) 2. Mouse movement to URL bar simulation 3. Keystroke dynamics for typing URL 4. Random scroll and mouse drift post-load """ await asyncio.sleep(random.uniform(2.0, 7.0)) await self.ghost_cursor.move_to_url_bar(self.page) await self.page.goto(target_url, wait_until='networkidle') # Post-load entropy injection await self._simulate_reading_behavior() async def _simulate_reading_behavior(self): """ Human reading heuristics: - F-pattern eye tracking simulation via scroll - Random pauses at headings - Micro-movements during "reading" """ scroll_points = self._generate_f_pattern_scroll() for point in scroll_points: await self.page.evaluate(f"window.scrollTo(0, {point})") await self.ghost_cursor.random_micro_movement() await asyncio.sleep(random.lognormal(0.8, 0.3)) async def extract_session_state(self) -> SessionState: """ Serialize all stateful artifacts for handover: - Cookies (including HttpOnly) - LocalStorage - SessionStorage - IndexedDB keys - Service Worker registrations """ cookies = await self.context.cookies() local_storage = await self.page.evaluate("() => Object.entries(localStorage)") session_storage = await self.page.evaluate("() => Object.entries(sessionStorage)") # Critical: Capture Cloudflare challenge tokens cf_clearance = next((c for c in cookies if c['name'] == 'cf_clearance'), None) return SessionState( cookies=cookies, local_storage=dict(local_storage), session_storage=dict(session_storage), cf_clearance=cf_clearance, user_agent=self.profile.user_agent, tls_fingerprint=self.profile.tls_fingerprint, timestamp=time.time() ) ``` #### 3.1.3 Entropy Maximization Strategy To defeat temporal analysis, we introduce **jittered scheduling** modeled as a log-normal distribution: $$ \Delta t \sim \text{LogNormal}(\mu = 3.2, \sigma = 0.8) $$ Where $\Delta t$ represents inter-request delay in seconds. This mirrors empirical human behavior distributions from HCI research (Card et al., 1983). --- ### 3.2 The Network Bridge (Handover Protocol) **Critical Design Constraint:** The TLS fingerprint of `curl_cffi` must match the JA3 signature that Camoufox presented during authentication. #### 3.2.1 State Serialization Schema ```python @dataclass class SessionState: cookies: List[Dict[str, Any]] local_storage: Dict[str, str] session_storage: Dict[str, str] cf_clearance: Optional[Dict[str, Any]] user_agent: str tls_fingerprint: str # e.g., "chrome120" timestamp: float def to_redis_key(self, session_id: str) -> str: return f"session:{session_id}:state" def serialize(self) -> bytes: """ Serialize with MessagePack for compact representation. Include HMAC for integrity verification. """ payload = msgpack.packb({ 'cookies': self.cookies, 'local_storage': self.local_storage, 'session_storage': self.session_storage, 'cf_clearance': self.cf_clearance, 'user_agent': self.user_agent, 'tls_fingerprint': self.tls_fingerprint, 'timestamp': self.timestamp, }) hmac_sig = hmac.new(SECRET_KEY, payload, hashlib.sha256).digest() return hmac_sig + payload ``` #### 3.2.2 curl_cffi Client Configuration ```python class CurlCffiClient: def __init__(self, session_state: SessionState): self.session_state = session_state self.session = AsyncSession(impersonate=session_state.tls_fingerprint) async def initialize(self): """ Configure curl_cffi to match Camoufox's network signature. """ # Inject cookies for cookie in self.session_state.cookies: self.session.cookies.set( name=cookie['name'], value=cookie['value'], domain=cookie['domain'], path=cookie.get('path', '/'), secure=cookie.get('secure', False), ) # Build header profile from BrowserForge self.headers = { 'User-Agent': self.session_state.user_agent, 'Accept': 'application/json, text/plain, */*', 'Accept-Language': 'en-US,en;q=0.9', 'Accept-Encoding': 'gzip, deflate, br', 'Referer': 'https://target.com/', 'Origin': 'https://target.com', 'Sec-Fetch-Dest': 'empty', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Site': 'same-origin', 'sec-ch-ua': self._derive_sec_ch_ua(), 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', } def _derive_sec_ch_ua(self) -> str: """ Derive sec-ch-ua from User-Agent to ensure consistency. Example: Chrome/120.0.6099.109 → "Chromium";v="120", "Google Chrome";v="120" """ # Parse User-Agent and construct matching sec-ch-ua # This is critical—mismatches trigger instant flagging pass async def fetch(self, url: str, method: str = 'GET', **kwargs): """ Execute request with TLS fingerprint matching browser. Include random delays modeled on human API interaction. """ await asyncio.sleep(random.lognormal(0.2, 0.1)) response = await self.session.request( method=method, url=url, headers=self.headers, **kwargs ) # Verify we're not challenged if 'cf-mitigated' in response.headers: raise SessionInvalidatedError("Cloudflare challenge detected") return response ``` #### 3.2.3 Handover Sequence Diagram ```mermaid sequenceDiagram participant O as Orchestrator participant C as Camoufox participant R as Redis Store participant Curl as curl_cffi Client participant T as Target API O->>C: Dispatch Auth Task C->>T: Navigate + Solve Challenge T-->>C: Set Cookies + Challenge Token C->>C: Extract Session State C->>R: Serialize State to Redis C->>O: Signal Ready O->>Curl: Dispatch Extraction Task Curl->>R: Retrieve Session State Curl->>Curl: Configure TLS + Headers Curl->>T: API Request (with cookies) T-->>Curl: JSON Response Curl->>O: Deliver Payload ``` --- ### 3.3 The Scheduler (Clock Drift & Rotation Logic) **Design Principle:** Deterministic scheduling reveals automation. We introduce controlled chaos. #### 3.3.1 Clock Drift Implementation Adversarial systems analyze request timestamps for periodicity. We inject **Gaussian noise** into task dispatch: $$ t_{\text{actual}} = t_{\text{scheduled}} + \mathcal{N}(0, \sigma^2) $$ Where $\sigma = 5$ seconds. Additionally, we implement **phase shift rotation** to avoid harmonic patterns: ```python class EntropyScheduler: def __init__(self, base_interval: float = 30.0): self.base_interval = base_interval self.phase_offset = 0.0 self.drift_sigma = 5.0 def next_execution_time(self) -> float: """ Calculate next execution with drift and phase rotation. """ # Base interval with Gaussian noise noisy_interval = self.base_interval + random.gauss(0, self.drift_sigma) # Phase shift accumulation (simulates human circadian variance) self.phase_offset += random.uniform(-0.5, 0.5) # Clamp to reasonable bounds next_time = max(5.0, noisy_interval + self.phase_offset) return time.time() + next_time async def dispatch_with_entropy(self, task: Callable): """ Execute task at entropic time with pre-task jitter. """ execution_time = self.next_execution_time() await asyncio.sleep(execution_time - time.time()) # Pre-execution jitter (simulate human hesitation) await asyncio.sleep(random.uniform(0.1, 0.8)) await task() ``` #### 3.3.2 Proxy Rotation Strategy Mobile proxies provide high IP reputation but require careful rotation to avoid correlation: ```python class MobileProxyRotator: def __init__(self, proxy_pool: List[str]): self.proxy_pool = proxy_pool self.usage_history = {} self.cooldown_period = 300 # 5 minutes def select_proxy(self, session_id: str) -> str: """ Sticky session assignment with cooldown enforcement. Rule: Same session_id always gets same proxy (until cooldown). Prevents mid-session IP changes which trigger fraud alerts. """ if session_id in self.usage_history: proxy, last_used = self.usage_history[session_id] if time.time() - last_used < self.cooldown_period: return proxy # Select least-recently-used proxy available = [p for p in self.proxy_pool if self._is_cooled_down(p)] if not available: raise ProxyExhaustionError("No proxies available") proxy = min(available, key=lambda p: self._last_use_time(p)) self.usage_history[session_id] = (proxy, time.time()) return proxy def _is_cooled_down(self, proxy: str) -> bool: """Check if proxy has completed cooldown period.""" if proxy not in self.usage_history: return True _, last_used = self.usage_history[proxy] return time.time() - last_used > self.cooldown_period ``` --- ## 4. Data Flow Description ### 4.1 Cold Boot Sequence ``` [START] | v 1. Orchestrator requests fingerprint from BrowserForge - OS: Windows 11 - Browser: Chrome 120.0.6099.109 - Screen: 1920x1080 - Hardware: Intel i7, 16GB RAM | v 2. BrowserForge generates deterministic profile - TLS fingerprint: chrome120 - Canvas noise seed: 0x3f2a9c - WebGL vendor: "ANGLE (Intel, Intel(R) UHD Graphics 620)" - User-Agent + sec-ch-ua alignment verified | v 3. Camoufox container instantiated with profile - Docker: camoufox:latest - Proxy: Mobile 4G (AT&T, Chicago) - Memory limit: 2GB - CPU limit: 2 cores | v 4. Ghost Cursor engine initialized - Bezier curve generator seeded - Velocity profile: human-average (200-400 px/s) | v 5. Navigation to target with behavioral simulation - Pre-navigation delay: 4.2s - Mouse hover on URL bar: 0.3s - Typing simulation: 12 keystrokes at 180ms intervals - Page load wait: networkidle | v 6. Challenge detection and solving - If Cloudflare: Wait for Turnstile, interact if required - If CAPTCHA: Delegate to 2Captcha/CapSolver - Monitor for cf_clearance cookie | v 7. Post-authentication behavior - Random scroll (F-pattern) - Mouse micro-movements: 8-12 per scroll - Time on page: 15-30s (lognormal distribution) | v 8. Session state extraction - 23 cookies captured (including HttpOnly) - cf_clearance: present, expires in 1800s - localStorage: 4 keys - sessionStorage: 2 keys | v 9. State serialization to Redis - Key: session:a3f9c2d1:state - HMAC: verified - TTL: 1500s (before cookie expiration) | v 10. Camoufox container terminated - Browser context closed - Memory freed - Proxy connection released to cooldown ``` ### 4.2 Extraction Phase Sequence ``` [TRIGGER: API extraction task] | v 1. curl_cffi client initialized - Retrieves session state from Redis - Configures TLS fingerprint: chrome120 - Injects 23 cookies - Sets headers with sec-ch-ua consistency | v 2. Scheduler calculates next execution time - Base interval: 30s - Gaussian noise: +3.7s - Phase offset: -0.2s - Actual delay: 33.5s | v 3. Pre-request jitter applied - Random delay: 0.4s | v 4. API request dispatched - Method: GET - URL: https://api.target.com/v1/data - Headers: 14 headers set - TLS: JA3 matches browser session | v 5. Response validation - Status: 200 OK - cf-mitigated header: absent - JSON payload: 2.3 MB | v 6. Payload delivered to data pipeline - Parsed and validated - Stored in time-series database | v 7. Next iteration scheduled - Session state TTL checked - If < 300s remaining: trigger re-authentication - Else: continue extraction phase ``` --- ## 5. Entropy & Evasion Strategy ### 5.1 BrowserForge Profile Mapping **Critical Constraint:** Every profile component must exhibit statistical correlation. ```python class BrowserForgeProfileValidator: """ Validates that generated profiles exhibit internally consistent statistical properties to avoid fingerprint contradictions. """ def validate(self, profile: BrowserForgeProfile) -> ValidationResult: checks = [ self._check_user_agent_sec_ch_consistency(profile), self._check_viewport_screen_consistency(profile), self._check_hardware_memory_consistency(profile), self._check_timezone_locale_consistency(profile), self._check_tls_browser_version_consistency(profile), ] failures = [c for c in checks if not c.passed] return ValidationResult(passed=len(failures) == 0, failures=failures) def _check_user_agent_sec_ch_consistency(self, profile): """ User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.109 Safari/537.36 sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120" These MUST align or instant detection occurs. """ ua_version = self._extract_chrome_version(profile.user_agent) ch_version = self._extract_ch_ua_version(profile.sec_ch_ua) return ValidationCheck( name="UA/sec-ch-ua consistency", passed=(ua_version == ch_version), details=f"UA: {ua_version}, sec-ch-ua: {ch_version}" ) def _check_viewport_screen_consistency(self, profile): """ Viewport should be smaller than screen resolution. Typical browser chrome: 70-120px vertical offset. """ viewport_height = profile.viewport['height'] screen_height = profile.screen['height'] chrome_height = screen_height - viewport_height # Reasonable browser chrome range valid = 50 <= chrome_height <= 150 return ValidationCheck( name="Viewport/Screen consistency", passed=valid, details=f"Chrome height: {chrome_height}px" ) def _check_hardware_memory_consistency(self, profile): """ deviceMemory should align with hardwareConcurrency. Typical ratios: 2GB per core for consumer hardware. """ memory_gb = profile.device_memory cores = profile.hardware_concurrency ratio = memory_gb / cores # Consumer hardware typically 1-4 GB per core valid = 1 <= ratio <= 4 return ValidationCheck( name="Hardware/Memory consistency", passed=valid, details=f"Ratio: {ratio:.2f} GB/core" ) ``` ### 5.2 Ghost Cursor Bezier Implementation Human mouse movement exhibits **submovement composition** (Meyer et al., 1988). We model this with composite Bezier curves: ```python class GhostCursorEngine: def __init__(self): self.velocity_profile = self._load_human_velocity_distribution() async def move_to(self, page: Page, target_x: int, target_y: int): """ Generate human-like trajectory using composite Bezier curves with velocity-based submovement decomposition. """ current_x, current_y = await self._get_cursor_position(page) # Calculate distance for submovement count distance = math.sqrt((target_x - current_x)**2 + (target_y - current_y)**2) # Human submovements: 1-3 for short distances, up to 5 for long num_submovements = min(5, max(1, int(distance / 300))) waypoints = self._generate_waypoints( (current_x, current_y), (target_x, target_y), num_submovements ) for i in range(len(waypoints) - 1): await self._execute_submovement(page, waypoints[i], waypoints[i+1]) def _generate_waypoints(self, start, end, count): """ Generate intermediate waypoints with Gaussian perturbation to simulate motor control noise. """ waypoints = [start] for i in range(1, count): t = i / count # Linear interpolation with perpendicular noise x = start[0] + t * (end[0] - start[0]) y = start[1] + t * (end[1] - start[1]) # Add perpendicular noise (overshooting) angle = math.atan2(end[1] - start[1], end[0] - start[0]) perp_angle = angle + math.pi / 2 noise_magnitude = random.gauss(0, 10) x += noise_magnitude * math.cos(perp_angle) y += noise_magnitude * math.sin(perp_angle) waypoints.append((x, y)) waypoints.append(end) return waypoints async def _execute_submovement(self, page, start, end): """ Execute single submovement with velocity profile matching Fitts's Law: T = a + b * log2(D/W + 1) """ distance = math.sqrt((end[0] - start[0])**2 + (end[1] - start[1])**2) # Generate Bezier control points control1, control2 = self._generate_bezier_controls(start, end) # Calculate movement time from Fitts's Law a, b = 0.1, 0.15 # Empirical constants movement_time = a + b * math.log2(distance / 10 + 1) # Sample Bezier curve steps = max(10, int(distance / 5)) for i in range(steps + 1): t = i / steps point = self._bezier_point(t, start, control1, control2, end) await page.mouse.move(point[0], point[1]) await asyncio.sleep(movement_time / steps) def _bezier_point(self, t, p0, p1, p2, p3): """Cubic Bezier curve evaluation.""" x = (1-t)**3 * p0[0] + 3*(1-t)**2*t * p1[0] + \ 3*(1-t)*t**2 * p2[0] + t**3 * p3[0] y = (1-t)**3 * p0[1] + 3*(1-t)**2*t * p1[1] + \ 3*(1-t)*t**2 * p2[1] + t**3 * p3[1] return (x, y) async def random_micro_movement(self): """ Simulate fidgeting during reading: Small, low-velocity movements (drift). """ drift_x = random.gauss(0, 15) drift_y = random.gauss(0, 15) # Execute slowly (low velocity indicates inattention) # Implementation omitted for brevity ``` ### 5.3 Clock Drift Mathematical Model To avoid temporal fingerprinting, we model human variance in task execution: $$ \begin{aligned} T_{\text{base}} &= 30 \text{ seconds (target interval)} \\ T_{\text{actual}} &= T_{\text{base}} + \Delta_{\text{gauss}} + \Delta_{\text{phase}} \\ \Delta_{\text{gauss}} &\sim \mathcal{N}(0, \sigma^2), \quad \sigma = 5 \\ \Delta_{\text{phase}} &= \phi(t), \quad \phi(t + \Delta t) = \phi(t) + \mathcal{U}(-0.5, 0.5) \end{aligned} $$ The phase term $\phi(t)$ introduces **low-frequency drift** that prevents harmonic detection. Additionally, we clamp to ensure biological plausibility: $$ T_{\text{final}} = \max(5, \min(120, T_{\text{actual}})) $$ --- ## 6. Infrastructure & DevOps ### 6.1 Docker Containerization Strategy ```yaml # docker-compose.yml version: '3.8' services: orchestrator: image: extraction-agent/orchestrator:latest environment: - REDIS_URL=redis://redis:6379 - PROXY_API_KEY=${PROXY_API_KEY} depends_on: - redis deploy: replicas: 1 camoufox-pool: image: extraction-agent/camoufox:latest environment: - BROWSERFORGE_SEED=${BROWSERFORGE_SEED} - REDIS_URL=redis://redis:6379 shm_size: 2gb # Critical: shared memory for Chrome deploy: replicas: 5 resources: limits: cpus: '2' memory: 2G volumes: - /dev/shm:/dev/shm # Avoid disk I/O for Chrome curl-pool: image: extraction-agent/curl-cffi:latest environment: - REDIS_URL=redis://redis:6379 deploy: replicas: 20 # Higher concurrency for lightweight clients resources: limits: cpus: '0.5' memory: 512M redis: image: redis:7-alpine command: redis-server --maxmemory 4gb --maxmemory-policy allkeys-lru volumes: - redis-data:/data volumes: redis-data: ``` ### 6.2 Dockerfile for Camoufox Container ```dockerfile FROM python:3.11-slim # Install dependencies for Playwright + Camoufox RUN apt-get update && apt-get install -y \ wget gnupg ca-certificates \ fonts-liberation libasound2 libatk-bridge2.0-0 \ libatk1.0-0 libcups2 libdbus-1-3 libgdk-pixbuf2.0-0 \ libnspr4 libnss3 libx11-xcb1 libxcomposite1 \ libxdamage1 libxrandr2 xdg-utils \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Install Playwright browsers RUN playwright install chromium # Install Camoufox RUN pip install camoufox COPY . . # Use tini to handle zombie processes RUN apt-get update && apt-get install -y tini ENTRYPOINT ["/usr/bin/tini", "--"] CMD ["python", "camoufox_worker.py"] ``` ### 6.3 CI/CD Pipeline for Browser Binary Updates ```yaml # .github/workflows/update-browsers.yml name: Update Browser Binaries on: schedule: - cron: '0 2 * * 1' # Weekly on Monday 2 AM workflow_dispatch: jobs: update-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Update Playwright run: | pip install -U playwright playwright install chromium - name: Update Camoufox run: pip install -U camoufox - name: Extract Browser Versions id: versions run: | CHROME_VERSION=$(playwright chromium --version) echo "chrome=$CHROME_VERSION" >> $GITHUB_OUTPUT - name: Update BrowserForge Profiles run: python scripts/update_fingerprints.py --chrome-version ${{ steps.versions.outputs.chrome }} - name: Run Fingerprint Tests run: pytest tests/test_fingerprint_consistency.py - name: Build and Push Docker Images run: | docker build -t extraction-agent/camoufox:${{ steps.versions.outputs.chrome }} . docker push extraction-agent/camoufox:${{ steps.versions.outputs.chrome }} docker tag extraction-agent/camoufox:${{ steps.versions.outputs.chrome }} extraction-agent/camoufox:latest docker push extraction-agent/camoufox:latest ``` ### 6.4 Mobile Proxy Integration ```python class MobileProxyProvider: """ Integration with mobile proxy providers (e.g., Oxylabs, Smartproxy). Leverages CGNAT for high IP reputation. """ def __init__(self, api_key: str, country_code: str = 'us'): self.api_key = api_key self.country_code = country_code self.session_cache = {} def get_proxy_for_session(self, session_id: str, sticky: bool = True) -> str: """ Obtain proxy URL with session persistence. sticky=True: Same session_id returns same IP (via session parameter) sticky=False: Each call rotates IP """ if sticky and session_id in self.session_cache: return self.session_cache[session_id] # Format: http://user-APIKEY-country-US-session-SESSION:pass@proxy.provider.com:7777 if sticky: proxy_url = ( f"http://user-{self.api_key}-country-{self.country_code}" f"-session-{session_id}:pass@mobile-proxy.oxylabs.io:7777" ) self.session_cache[session_id] = proxy_url else: # Omit session parameter for rotation proxy_url = ( f"http://user-{self.api_key}-country-{self.country_code}" f":pass@mobile-proxy.oxylabs.io:7777" ) return proxy_url def release_session(self, session_id: str): """Release sticky session to allow cooldown.""" if session_id in self.session_cache: del self.session_cache[session_id] ``` --- ## 7. Advanced Evasion Techniques ### 7.1 DOM Mutation Rate Control Adversarial systems analyze the rate of DOM mutations to detect automation. Legitimate users interact with UI elements progressively; bots often mutate the DOM at superhuman speeds. ```python class DOMInteractionThrottler: """ Ensure DOM mutations occur at human-plausible rates. """ async def click_with_throttle(self, page: Page, selector: str): """ Click with pre-hover delay and post-click pause. """ element = await page.wait_for_selector(selector) # Pre-hover delay (humans don't click instantly) await self.ghost_cursor.move_to_element(page, element) await asyncio.sleep(random.uniform(0.3, 0.8)) # Click await element.click() # Post-click pause (reaction time to visual feedback) await asyncio.sleep(random.uniform(0.2, 0.5)) async def fill_form_with_throttle(self, page: Page, selector: str, text: str): """ Type with keystroke dynamics. """ await page.focus(selector) for char in text: await page.keyboard.type(char) # Inter-keystroke interval: 80-200ms (human typing speed) await asyncio.sleep(random.uniform(0.08, 0.2)) # Pause after completion (review behavior) await asyncio.sleep(random.uniform(0.5, 1.5)) ``` ### 7.2 Canvas Fingerprint Noise Injection Canvas fingerprinting generates unique hardware signatures. We inject **deterministic noise** based on the BrowserForge profile seed: ```python class CanvasNoiseInjector: def __init__(self, seed: int): self.rng = random.Random(seed) def generate_injection_script(self) -> str: """ Generate JavaScript to inject into page context. Modifies canvas rendering at sub-pixel level. """ # Generate deterministic noise array noise = [self.rng.gauss(0, 0.0001) for _ in range(256)] return f""" (() => {{ const noise = {noise}; const originalGetImageData = CanvasRenderingContext2D.prototype.getImageData; CanvasRenderingContext2D.prototype.getImageData = function(...args) {{ const imageData = originalGetImageData.apply(this, args); // Inject sub-pixel noise for (let i = 0; i < imageData.data.length; i++) {{ imageData.data[i] += Math.floor(noise[i % 256] * 255); }} return imageData; }}; }})(); """ ``` ### 7.3 WebGL Vendor Spoofing ```python async def inject_webgl_spoofing(page: Page, vendor: str, renderer: str): """ Override WebGL parameters to match hardware profile. """ await page.add_init_script(f""" const getParameter = WebGLRenderingContext.prototype.getParameter; WebGLRenderingContext.prototype.getParameter = function(parameter) {{ if (parameter === 37445) {{ // UNMASKED_VENDOR_WEBGL return '{vendor}'; }} if (parameter === 37446) {{ // UNMASKED_RENDERER_WEBGL return '{renderer}'; }} return getParameter.call(this, parameter); }}; """) ``` --- ## 8. Monitoring & Observability ### 8.1 Metrics Collection ```python from prometheus_client import Counter, Histogram, Gauge # Define metrics auth_attempts = Counter('auth_attempts_total', 'Authentication attempts', ['result']) session_duration = Histogram('session_duration_seconds', 'Session lifespan') challenge_rate = Gauge('challenge_rate', 'Rate of challenges encountered') extraction_throughput = Counter('extraction_requests_total', 'API extractions', ['status']) class MetricsCollector: @staticmethod def record_auth_success(): auth_attempts.labels(result='success').inc() @staticmethod def record_auth_failure(reason: str): auth_attempts.labels(result=reason).inc() @staticmethod def record_session_lifetime(duration: float): session_duration.observe(duration) @staticmethod def update_challenge_rate(rate: float): challenge_rate.set(rate) ``` ### 8.2 Alerting Rules ```yaml # prometheus-alerts.yml groups: - name: extraction_agent interval: 30s rules: - alert: HighChallengeRate expr: challenge_rate > 0.3 for: 5m annotations: summary: "Challenge rate exceeds 30%" description: "Fingerprint may be burned, rotate profiles" - alert: SessionDurationDrop expr: rate(session_duration_seconds_sum[5m]) < 600 for: 10m annotations: summary: "Average session duration dropped below 10 minutes" description: "Sessions being invalidated prematurely" - alert: AuthFailureSpike expr: rate(auth_attempts_total{result!="success"}[5m]) > 0.5 for: 5m annotations: summary: "Authentication failure rate > 50%" description: "Possible detection or proxy issues" ``` --- ## 9. Security Considerations ### 9.1 Session State Encryption All session state stored in Redis must be encrypted at rest: ```python from cryptography.fernet import Fernet class EncryptedSessionStore: def __init__(self, redis_client, encryption_key: bytes): self.redis = redis_client self.cipher = Fernet(encryption_key) async def store(self, session_id: str, state: SessionState): """Encrypt and store session state.""" plaintext = state.serialize() ciphertext = self.cipher.encrypt(plaintext) await self.redis.setex( name=f"session:{session_id}", time=1800, # 30 minute TTL value=ciphertext ) async def retrieve(self, session_id: str) -> Optional[SessionState]: """Retrieve and decrypt session state.""" ciphertext = await self.redis.get(f"session:{session_id}") if not ciphertext: return None plaintext = self.cipher.decrypt(ciphertext) return SessionState.deserialize(plaintext) ``` ### 9.2 Rate Limiting at Orchestrator Level Prevent runaway resource consumption: ```python from asyncio import Semaphore class ResourceThrottler: def __init__(self, max_concurrent_browsers: int = 10): self.browser_semaphore = Semaphore(max_concurrent_browsers) self.rate_limiter = {} async def acquire_browser_slot(self): """Enforce maximum concurrent browser instances.""" await self.browser_semaphore.acquire() def release_browser_slot(self): self.browser_semaphore.release() async def enforce_rate_limit(self, key: str, max_per_minute: int = 60): """Token bucket rate limiting per target domain.""" now = time.time() if key not in self.rate_limiter: self.rate_limiter[key] = {'tokens': max_per_minute, 'last_update': now} bucket = self.rate_limiter[key] elapsed = now - bucket['last_update'] bucket['tokens'] = min(max_per_minute, bucket['tokens'] + elapsed * (max_per_minute / 60)) bucket['last_update'] = now if bucket['tokens'] < 1: wait_time = (1 - bucket['tokens']) / (max_per_minute / 60) await asyncio.sleep(wait_time) bucket['tokens'] = 0 else: bucket['tokens'] -= 1 ``` --- ## 10. Performance Benchmarks ### 10.1 Expected Throughput Under optimal conditions with the specified stack: | Phase | Metric | Value | |-------|--------|-------| | Authentication (Camoufox) | Time to cf_clearance | 8-15 seconds | | Session Lifetime | Average duration | 25-35 minutes | | Extraction (curl_cffi) | Requests per second (per session) | 2-5 RPS | | Concurrent Sessions | Max per 2GB RAM node | 5 browser instances | | Concurrent Extractors | Max per 512MB container | 20 curl instances | ### 10.2 Resource Consumption ``` Camoufox Container: - Memory: 1.8-2.2 GB per instance - CPU: 1.5-2.0 cores during auth - Disk I/O: Minimal (using /dev/shm) curl_cffi Container: - Memory: 120-180 MB per instance - CPU: 0.1-0.3 cores - Network: 5-10 Mbps per instance ``` --- ## 11. Failure Modes & Recovery ### 11.1 Challenge Detection ```python class ChallengeHandler: async def detect_and_handle(self, page: Page) -> bool: """ Detect Cloudflare, Akamai, or Datadome challenges. """ # Cloudflare Turnstile if await page.query_selector('iframe[src*="challenges.cloudflare.com"]'): return await self._handle_turnstile(page) # Cloudflare legacy challenge if await page.query_selector('#challenge-form'): await asyncio.sleep(5) # Wait for auto-solve return True # Datadome if 'datadome' in page.url.lower(): return await self._handle_datadome(page) # PerimeterX if await page.query_selector('[class*="_pxBlock"]'): return await self._handle_perimeterx(page) return True # No challenge detected async def _handle_turnstile(self, page: Page) -> bool: """ Turnstile typically auto-solves with good fingerprints. If interactive challenge appears, delegate to CAPTCHA service. """ await asyncio.sleep(3) # Check if solved automatically if not await page.query_selector('iframe[src*="challenges.cloudflare.com"]'): return True # Still present: delegate to 2Captcha sitekey = await self._extract_turnstile_sitekey(page) solution = await self._solve_captcha_external(page.url, sitekey) await self._inject_captcha_solution(page, solution) return True ``` ### 11.2 Session Invalidation Recovery ```python class SessionRecoveryManager: async def handle_invalidation(self, session_id: str, reason: str): """ Recovery strategies based on invalidation reason. """ if reason == 'cf_clearance_expired': # Normal expiration: re-authenticate await self.orchestrator.trigger_reauth(session_id) elif reason == 'ip_reputation_drop': # Proxy burned: rotate and re-authenticate await self.proxy_rotator.blacklist_current_proxy(session_id) await self.orchestrator.trigger_reauth(session_id, force_new_proxy=True) elif reason == 'fingerprint_detected': # Fingerprint burned: generate new profile await self.orchestrator.trigger_reauth( session_id, force_new_profile=True, cooldown=300 # 5 minute cooldown before retry ) elif reason == 'rate_limit': # Backoff with exponential delay await self.apply_exponential_backoff(session_id) ``` --- ## 12. Compliance & Legal Considerations **DISCLAIMER:** This architecture is designed for legitimate use cases including: - Competitive intelligence gathering from public data - Price monitoring and availability tracking - Academic research in web security - Penetration testing with explicit authorization **Users must:** 1. Respect `robots.txt` directives 2. Implement rate limiting to avoid DoS 3. Obtain authorization before testing systems they do not own 4. Comply with CFAA (Computer Fraud and Abuse Act) and equivalent laws 5. Review and adhere to target website Terms of Service **This architecture should never be used for:** - Unauthorized access to protected systems - Data exfiltration of personal information - Circumventing paywall or authentication systems without permission - Any activity prohibited by applicable law --- ## 13. Conclusion This Architecture Definition Document provides a comprehensive blueprint for a high-fidelity extraction agent capable of defeating modern bot mitigation systems. The hybrid approach—leveraging Camoufox for authentication fidelity and curl_cffi for extraction throughput—represents the state-of-the-art in autonomous web interaction. **Critical Success Factors:** 1. **Consistency:** Every fingerprint component must exhibit internal correlation 2. **Entropy:** Deterministic patterns are fatal; inject controlled chaos 3. **Behavioral Fidelity:** Human behavior is complex; simple models fail 4. **State Management:** The handover protocol is the weakest link; secure it rigorously 5. **Monitoring:** Silent failures cascade; observe everything **Future Enhancements:** - Machine learning-based behavior generation trained on real user sessions - Adaptive fingerprint rotation based on challenge rate feedback - Distributed orchestration for global scaling - Integration with computer vision for advanced CAPTCHA solving This architecture represents 30 years of systems engineering distilled into a production-ready design. Implementation requires rigorous testing, continuous monitoring, and ethical deployment. --- **End of Document**