Architecture

This document is the engineering reference for the Vyrox platform. It is written for the person who is about to read or modify the code, or who is evaluating Vyrox for a regulated workload and needs to know exactly what the system does. It does not describe what the product will become. It describes what runs in CI today.

If you are looking for setup steps, see QUICKSTART.md. If you are looking for the threat model, see THREAT_MODEL.md. If you want the on-disk audit format, see AUDIT_CHAIN.md.

Pipeline at a glance

   EDR vendors                                   Vyrox platform
                                                                                                   
  CrowdStrike Falcon  ─┐                                                                           
  SentinelOne          ├─▶  POST /webhook/{vendor}  ─▶  Ingestion (FastAPI)                        
  Defender Graph       │      HMAC or bearer auth          │                                       
  Generic JSON         ─┘      per-tenant secret           ▼                                       
                                                  NormalizedAlert  ─▶ Redis  LPUSH/RPOP            
                                                                          │   vyrox:alerts:{tid}   
                                                                          ▼                        
                                              Worker (asyncio)                                     
                                                                                                   
                                              1. Cache lookup     (24h TTL by alert fingerprint)   
                                              2. Heuristics       (Noisy OR, <5ms)                 
                                                                  ├─ confidence ≥ 0.75 ▶ accept  
                                                                  ├─ confidence ≤ 0.25 ▶ BENIGN  
                                                                  └─ otherwise ▶ LLM             
                                              3. LLM fallback     (primary + 2 fallback models)    
                                                                  + Pydantic schema validation     
                                                                  + per-tenant daily token budget   
                                              4. Persist          (SQLite, tenant-scoped tables)   
                                              5. Notify           (signed HTTP to bot)             
                                                                                                   
                                              Discord bot (FastAPI)                                
                                              ├─ /interactions  (Ed25519 verified)                 
                                              ├─ /webhook       (HMAC verified)                    
                                              └─ approval flow  ▶ Rust proxy                       
                                                                                                   
                                              Rust proxy                                           
                                              ├─ HMAC verify       (constant time)                 
                                              ├─ replay window     (±30s)                          
                                              ├─ nonce dedup       (DashMap, 10min retention)      
                                              ├─ audit append      (hash-chained JSONL)            
                                              └─ EDR API call      (or DRY_RUN short-circuit)      

All five services are independent processes. They communicate over HTTP and Redis only. There is no shared in-process state across services. The SQLite database is shared between the worker and the Discord bot in the current pilot deployment. A future Postgres migration is tracked in todo.md (private) before tenant count reaches twenty five.

Components

ComponentLanguageProcessWhat it owns
IngestionPython, FastAPIuvicorn ingestion.main:appWebhook auth, vendor payload normalisation, Redis enqueue
WorkerPython, asynciopython -m worker.mainTriage pipeline, persistence, Discord notification
Discord botPython, FastAPIuvicorn discord_bot.main:appInteraction handling, approval flow, signing toward the proxy
Containment proxyRust, Axumvyrox-proxyHMAC verify, replay window, nonce dedup, audit, EDR API call
Heuristics enginePythonimported by the workerPattern matching, Noisy OR aggregation

The heuristics engine is private. The shape of its API (HeuristicsEngine.score(alert: dict) -> HeuristicResult) is documented here because callers depend on it. The pattern weights and the MITRE technique mapping are not.

Critical rules

These six rules are enforced by tests and reviewed in every PR. Violating one is a blocking issue, not a stylistic choice.

Rule 1: Tenant isolation

Every database query carries a tenant_id filter. Every Redis key is namespaced vyrox:alerts:{tenant_id}. There is no shared bucket and no fallback tenant.

The previous default-tenant fallback was removed on 2026-05-21 after the first audit caught it. The replacement contract: if a payload arrives without the vendor's tenant identifier (customer_id, accountId, tenantId), the ingestion route returns HTTP 400 and the EDR retries. The function is resolve_tenant_id in ingestion/main.py. It raises MissingTenantIdentifier on a missing or empty value.

The schema invariant is checked at boot. shared/db.py:_assert_tenant_id_present walks every table in _TENANT_SCOPED_TABLES (alerts, actions, verdict_cache, token_usage) and refuses to start the service if any of them is missing the tenant_id column. The check uses PRAGMA table_info, runs once at startup, and raises SchemaIntegrityError loudly enough that the deploy fails.

Rule 2: Audit before response

Every state-changing operation writes an audit entry before the response goes back to the caller. The audit log is append-only JSONL. Each entry carries previous_hash (the SHA-256 of the prior entry) and hash (the SHA-256 of previous_hash || canonical_json(entry)). The first entry of the very first log file links to a sentinel genesis hash of sixty four zeros.

The chain survives process restarts. AuditWriter.__init__ in shared/audit.py reads the last hash from today's log file before accepting the first write. The Rust proxy uses the same approach in audit::ChainState::from_file. Both implementations agree on the wire format. The independent specification is in AUDIT_CHAIN.md.

Audit writes are durable. _sync_write in shared/audit.py flushes and os.fsync after every entry. The Rust side does flush followed by sync_data. A power cut between the write and the OS writeback does not lose entries.

Rule 3: HMAC before processing

Every webhook payload is verified before any parser touches its bytes. The verification uses hmac.compare_digest on the Python side and the subtle::ConstantTimeEq trait on the Rust side. Both run in time proportional to the MAC length, not to where the first byte mismatch appears.

The wire format on the Python side: sign(payload: str, secret: str) returns f"sha256={hex_digest}". The Rust verifier strips the "sha256=" prefix before comparing. The round-trip is locked by tests/test_p0_regressions.py::test_hmac_python_sign_uses_sha256_prefix.

For requests carrying JSON bodies that travel between Vyrox services (the worker calling the bot, the bot calling the proxy), the body is serialised with separators=(",", ":") and sort_keys=True. Without that pinning, Python's default json.dumps and Rust's serde_json disagree on whitespace and key order, which produces a different MAC on the verifier side and a silent 401.

Rule 4: No autonomous containment

The LLM cannot trigger a containment action. The heuristics engine cannot trigger a containment action. The worker cannot trigger a containment action.

The only code path that calls the Rust proxy is the Discord bot's approval handler in discord_bot/handlers/approvals.py, which runs in response to a Discord button click. The button click itself is authenticated end to end: Discord signs the interaction with Ed25519, the bot verifies the signature against the application's public key in discord_bot/security.py, the handler then signs an ActionRequest with the shared HMAC secret, and the proxy verifies that signature before doing anything else.

The static invariant is enforced by a test: tests/test_p0_regressions.py::test_worker_triage_never_invokes_proxy greps the worker modules at import time and at source level for any reference to discord_bot.proxy_client.execute_action. If the worker ever imports that symbol, the test fails. The check covers both eager imports and lazy imports inside functions.

Rule 5: DRY_RUN by default

The Rust proxy's dry_run flag is true by default. Production has to opt in to real execution by setting DRY_RUN=false in the environment. The check happens before the EDR client is even constructed, so mis-configuration cannot accidentally call the vendor's API.

#![allow(unused)]
fn main() {
// vyrox-proxy/src/main.rs
let response = if state.dry_run {
    info!(/* ... */, "DRY_RUN: skipping EDR call");
    ExecuteResponse { status: "dry_run".to_string(), dry_run: true }
} else {
    state.edr.dispatch(payload.action_type, &payload.host).await
};
}

The audit entry written on a DRY_RUN action looks identical to a real action except for the dry_run: true field. That is intentional. An operator looking at the audit log can tell the difference, and a compliance review on the JSONL stream sees the same chain integrity either way.

Rule 6: LLM output never directly executed

The LLM returns a JSON object with five fixed fields: verdict, confidence, reasoning, mitre_techniques, suggested_action. The triage_with_llm function in worker/llm.py runs the parsed object through _parse_triage_json which checks every field against a fixed allow-list (verdict in {CRITICAL, HIGH, MEDIUM, LOW, BENIGN}, confidence clamped to [0, 1], suggested_action in the action allow-list). A response that fails validation produces a conservative MEDIUM verdict at 0.5 confidence, not a partial commit.

The validated object never touches exec, eval, subprocess, the filesystem, or SQL. It only sets fields on a TriageResult. The Pydantic model itself is frozen so even a downstream caller cannot mutate fields after the fact.

Multi-tenancy

Tenant isolation is a property of the data layer, not a runtime check in business logic.

SurfaceHow tenants are separated
Redis queueKey namespace: vyrox:alerts:{tenant_id}
SQLite tablesEvery row carries tenant_id; queries filter on it
Discord channelsDiscordGuild.tenant_id maps Discord server to tenant
Webhook secretsLooked up per tenant in tenant_credentials.webhook_secret_encrypted
Audit logEach entry carries tenant_id; export endpoints filter server-side
Token budgetDaily ledger keyed on tenant_id and date
Verdict cacheCache key (tenant_id, fingerprint)

The webhook routes resolve the tenant from the vendor payload's own identifier field (customer_id, accountId, tenantId), then look up that tenant's secret in tenant_credentials before verifying the signature. A payload that authenticates with the wrong tenant's secret fails the HMAC check and returns 401. A payload with no identifier returns 400. There is no path where an unmatched payload lands on a shared queue.

Cross-tenant access from inside the Discord bot is blocked by discord_bot/main.py:312. The custom_id of every approval button embeds the alert's tenant ID. Before calling the approval handler, the bot checks that the alert tenant matches the Discord guild's tenant. A mismatch returns "This action is not valid for this server" without contacting the proxy.

Two-stage triage

Triage runs in worker/triage.py::triage. Five stages, three early returns.

                                                              ┌────────────────────────┐
       NormalizedAlert ──▶ verdict cache ──▶ cache hit ──▶ │ return cached verdict │
                                  │                            └────────────────────────┘
                                  │ cache miss
                                  ▼
                          heuristics engine
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        ▼                          ▼                          ▼
  confidence ≥ 0.75         confidence ≤ 0.25       0.25 < confidence < 0.75
  accept heuristic         return BENIGN               LLM fallback
       verdict                                          │
                                                            ▼
                                                  token budget check
                                                            │
                                ┌───────────────────────────┼───────────────────────────┐
                                ▼                            ▼                            ▼
                          budget exhausted        primary model            primary 429/5xx
                          MEDIUM / 0.5            parse + return          ▼
                                                                              fallback model 1
                                                                              parse + return
                                                                                      │
                                                                                      ▼
                                                                              fallback model 2
                                                                              parse + return
                                                                                      │
                                                                                      ▼
                                                                              all rate limited
                                                                              MEDIUM / 0.5

The two-stage design solves three problems at once. Determinism and explainability for the eighty percent of alerts that are obvious. Low cost because the LLM is reserved for the ambiguous middle band. A conservative default verdict for any failure mode, so the queue never jams on a provider outage. The LLM provider is not named in this doc because the choice is operational. The model chain is configured in environment variables (LLM_PRIMARY_MODEL, LLM_FALLBACK_MODEL_1, LLM_FALLBACK_MODEL_2).

Approval flow

   Discord button click
            │
            ▼
   bot /interactions   ◀──── Ed25519 verify against settings.discord_public_key
            │
            ▼
   custom_id parse  ──▶ approve / deny / investigate
            │
            ▼  (approve only)
   AlertRecord lookup by alert_id + tenant_id
            │
            ▼
   Idempotency check  ──▶ if status already executed/executing/approved → no-op
            │
            ▼
   Mark alert "executing"
   Persist ActionRecord "approved"
   Audit "approve.requested"      ◀──── written before any outbound call
            │
            ▼
   proxy_client.execute_action()
   body signed with vyrox_hmac_secret (deterministic JSON)
            │
            ▼
   Rust proxy /execute
   ├─ HMAC verify
   ├─ replay window check (±30s)
   ├─ nonce.claim_or_replay(request_id)
   ├─ audit::append_audit  ◀──── written before EDR call
   └─ edr.dispatch (or DRY_RUN short-circuit)
            │
            ▼
   ActionRecord.status = "executed" or "dry_run"
   Alert.status        = "executed"
   Audit "approve.executed"

The flow's idempotency story has three layers. The bot checks the AlertRecord.status before generating a request ID, so a double-click returns "already approved". The proxy keeps a per-request-ID nonce store with ten minute retention, so a network retry replays the cached response instead of calling the EDR twice. The audit entry is written once per state transition; replayed requests do not double-log.

Configuration

All configuration is read at startup from environment variables through shared/config.py::Settings. The settings class uses pydantic_settings so a missing required field raises a ValidationError before the service serves traffic.

The full env contract is in .env.example in the private monorepo. The fields that an OSS contributor needs to know about:

VariableComponentPurpose
VYROX_HMAC_SECRETallSixty four hex characters. Signs Python ↔ Python and Python ↔ Rust traffic.
REDIS_URLingestion, workerredis:// or rediss:// URL. The legacy Upstash REST variables are still accepted for backward compatibility but new deployments should set this.
OPENCODE_ZEN_API_KEYworkerLLM provider key. Empty falls back to the legacy OPENROUTER_API_KEY during the migration window.
DISCORD_BOT_TOKENbotDiscord application token.
DISCORD_PUBLIC_KEYbotApplication public key for interaction Ed25519 verification. Empty skips verification (local dev only).
CROWDSTRIKE_WEBHOOK_SECRETingestionVendor-default HMAC secret. Per-tenant secrets stored in tenant_credentials override this.
SENTINELONE_WEBHOOK_SECRETingestionVendor-default bearer token.
DEFENDER_WEBHOOK_SECRETingestionDefender Graph clientState value used as bearer.
AUDIT_LOG_PATHall writersDirectory for daily JSONL files. The hash chain depends on this surviving restart.
VYROX_PROXY_URLbotBase URL of the Rust proxy.
DRY_RUNproxytrue by default. Production opts in to real EDR calls.

What is in the private side

Reading the public docs without seeing the private code is intentional. The boundary makes contribution clear.

The private monorepo holds the implementation of the pipeline above. File names mirror the layout described here (ingestion/, worker/, discord_bot/, shared/, playbook/, migrations/, tests/). The Python tests covering the public contracts have public-safe names (test_p0_regressions.py, test_p05_blockers.py). Anyone with access can map a private fix to a public contract in seconds.

The detection patterns, the LLM prompts, and the operational configs stay private. Those are the layer that creates the business; the proxy and the contracts are the layer that creates the trust. The split is deliberate.

Operating commitments

We do not publish hard SLA percentages in this repo. The reasons are honest. Numbers we cannot defend across all pilots today belong in negotiated contracts, not in OSS docs.

What we can commit publicly:

  • The audit log is customer-owned. We do not lose it, we do not modify it, and we provide export at any time. The format is the contract, not our retention policy.
  • Containment proceeds only after a human in Discord clicks Approve. There is no autonomous containment path.
  • Webhook authentication failures and proxy signature failures both return generic 401 responses. We never tell a caller which part of the credential was wrong.

Per-customer SLAs that involve uptime targets and triage latency live in signed contracts.

Decisions worth knowing

A short list, written for the reader who is asking "why this and not that".

Rust for the proxy. The proxy is the only Vyrox process that can cause customer-side side effects. The set of properties we wanted in one binary: memory safety without a garbage collector, a small static binary, a constant-time HMAC implementation in the ecosystem, no runtime dependency on a vendor library. The Rust choice gave us all of them. The proxy is intentionally small. About a thousand lines of code including tests, splitting across main, hmac, audit, nonce, edr, and actions.

SQLite for the pilot. SQLite with WAL mode and a single writer process handles the pilot scale (ten tenants, low hundreds of alerts per day per tenant). Write contention bites somewhere around twenty five tenants, which is the trigger for the Postgres migration. The schema is already SQLModel-compatible, so the migration is a SQL dump plus a connection string change, not a rewrite.

Discord as the operator UI. The first ten pilots use Discord exclusively. The bot handles onboarding, alert review, approval, and slash commands for stats and audit export. The cost is one extra infrastructure provider; the benefit is that a customer's first five-minute experience is "I added your bot to my server and a synthetic alert appeared." A web dashboard ships when a prospect refuses Discord or when customer count reaches eleven, whichever comes first.

Two-stage triage. A pure LLM design is slow, expensive at scale, and not auditable without careful prompt engineering. A pure rules design misses anything novel. The split lets us run the heuristics for free, run the LLM only on the ambiguous middle band, fall back to a conservative MEDIUM on any failure, and keep the LLM output strictly inside a Pydantic schema before it touches anything else.

Human in the loop for execution. Auto-isolating hosts on false positives is the kind of incident that loses you the customer. Until we have a year of per-tenant false-positive data, every CRITICAL and HIGH containment is gated on a human Approve click. LOW auto-approval is opt-in per tenant and logged identically to manual approvals.

Cross-references