Architecture

This document is the engineering reference for the Vyrox platform. It is written for the person who is about to read or modify the code, or who is evaluating Vyrox for a regulated workload and needs to know exactly what the system does. It does not describe what the product will become. It describes what runs in CI today.

If you are looking for setup steps, see QUICKSTART.md. If you are looking for the threat model, see THREAT_MODEL.md. If you want the on-disk audit format, see AUDIT_CHAIN.md.

Pipeline at a glance

   EDR vendors                                   Vyrox platform

  CrowdStrike Falcon  ─┐
  SentinelOne          ├─▶  POST /webhook/{vendor}  ─▶  Ingestion (FastAPI)
  Defender Graph       │      HMAC or bearer auth          │
  Generic JSON         ─┘      per-tenant secret           ▼
                                                  NormalizedAlert  ─▶ Redis  LPUSH/RPOP
                                                                          │   vyrox:alerts:{tid}
                                                                          ▼
                                              Worker (asyncio)

                                              1. Cache lookup     (24h TTL by alert fingerprint)
                                              2. Heuristics       (Noisy OR, <5ms)
                                                                  ├─ confidence ≥ 0.75 ▶ accept
                                                                  ├─ confidence ≤ 0.25 ▶ BENIGN
                                                                  └─ otherwise ▶ LLM
                                              3. LLM fallback     (primary + 2 fallback models)
                                                                  + Pydantic schema validation
                                                                  + per-tenant daily token budget
                                              4. Persist          (tenant-scoped tables)
                                              5. Surface          (operational console; optional notifier)

                                              Operational console (FastAPI + web apps)
                                              ├─ operator console  (Supabase auth)
                                              ├─ super-admin app   (founder)
                                              └─ approval flow  ▶ shared.approval_service ▶ Rust proxy

                                              Optional notifier (e.g. Discord bot)
                                              └─ approval flow  ▶ shared.approval_service ▶ Rust proxy

                                              Rust proxy
                                              ├─ HMAC verify       (constant time)
                                              ├─ replay window     (±30s)
                                              ├─ nonce dedup       (DashMap, 10min retention)
                                              ├─ audit append      (hash-chained JSONL)
                                              └─ EDR connector     (real vendor, or demo mock; simulated)

The services are independent processes. They communicate over HTTP and Redis only. There is no shared in-process state across services. The worker, the console API, and any optional notifier share one database; the production deployment is Postgres, addressed by DATABASE_URL, with the console API in front.

Components

Component	Language	Process	What it owns
Ingestion	Python, FastAPI	`uvicorn ingestion.main:app`	Webhook auth, vendor payload normalisation, Redis enqueue
Worker	Python, asyncio	`python -m worker.main`	Triage pipeline, persistence, surfacing alerts
Console API	Python, FastAPI	`uvicorn console.main:app`	The product surface: operator + super-admin endpoints, approval flow via `shared.approval_service`
Optional notifier	Python, FastAPI	`uvicorn discord_bot.main:app`	Retired surface kept as a notifier; approval flow via the same `shared.approval_service`
Containment proxy	Rust, Axum	`vyrox-proxy`	HMAC verify, replay window, nonce dedup, audit, EDR API call
Heuristics engine	Python	imported by the worker	Pattern matching, Noisy OR aggregation

The heuristics engine is private. The shape of its API (HeuristicsEngine.score(alert: dict) -> HeuristicResult) is documented here because callers depend on it. The pattern weights and the MITRE technique mapping are not.

Critical rules

These six rules are enforced by tests and reviewed in every PR. Violating one is a blocking issue, not a stylistic choice.

Rule 1: Tenant isolation

Every database query carries a tenant_id filter. Every Redis key is namespaced vyrox:alerts:{tenant_id}. There is no shared bucket and no fallback tenant.

The previous default-tenant fallback was removed on 2026-05-21 after the first audit caught it. The replacement contract: if a payload arrives without the vendor's tenant identifier (customer_id, accountId, tenantId), the ingestion route returns HTTP 400 and the EDR retries. The function is resolve_tenant_id in ingestion/main.py. It raises MissingTenantIdentifier on a missing or empty value.

The schema invariant is checked at boot. shared/db.py:_assert_tenant_id_present walks every table in _TENANT_SCOPED_TABLES (alerts, actions, verdict_cache, token_usage) and refuses to start the service if any of them is missing the tenant_id column. The check uses PRAGMA table_info, runs once at startup, and raises SchemaIntegrityError loudly enough that the deploy fails.

Rule 2: Audit before response

Every state-changing operation writes an audit entry before the response goes back to the caller. The audit log is append-only JSONL. Each entry carries previous_hash (the SHA-256 of the prior entry) and hash (the SHA-256 of previous_hash || canonical_json(entry)). The first entry of the very first log file links to a sentinel genesis hash of sixty four zeros.

The chain survives process restarts. AuditWriter.__init__ in shared/audit.py reads the last hash from today's log file before accepting the first write. The Rust proxy uses the same approach in audit::ChainState::from_file. Both implementations agree on the wire format. The independent specification is in AUDIT_CHAIN.md.

Audit writes are durable. _sync_write in shared/audit.py flushes and os.fsync after every entry. The Rust side does flush followed by sync_data. A power cut between the write and the OS writeback does not lose entries.

Rule 3: HMAC before processing

Every webhook payload is verified before any parser touches its bytes. The verification uses hmac.compare_digest on the Python side and the subtle::ConstantTimeEq trait on the Rust side. Both run in time proportional to the MAC length, not to where the first byte mismatch appears.

The wire format on the Python side: sign(payload: str, secret: str) returns f"sha256={hex_digest}". The Rust verifier strips the "sha256=" prefix before comparing. The round-trip is locked by tests/test_p0_regressions.py::test_hmac_python_sign_uses_sha256_prefix.

For requests carrying JSON bodies that travel between Vyrox services (the worker calling the bot, the bot calling the proxy), the body is serialised with separators=(",", ":") and sort_keys=True. Without that pinning, Python's default json.dumps and Rust's serde_json disagree on whitespace and key order, which produces a different MAC on the verifier side and a silent 401.

Rule 4: No autonomous containment

The LLM cannot trigger a containment action. The heuristics engine cannot trigger a containment action. The worker cannot trigger a containment action.

The only code path that calls the Rust proxy is shared.approval_service, reached after a human approves. Both surfaces go through it: the operational console (an operator clicks Approve in the web app, authenticated by a Supabase session) and the optional notifier (a Discord button click, verified end to end with Ed25519). Whichever surface is used, approval_service writes the audit entry, signs an ActionRequest with the shared HMAC secret, and only then calls the proxy, which verifies that signature before doing anything else. The console and the notifier share this path byte for byte, so the approval contract is identical no matter where the click came from.

The static invariant is enforced by a test: tests/test_p0_regressions.py::test_worker_triage_never_invokes_proxy greps the worker modules at import time and at source level for any reference to the proxy execute path. If the worker ever imports that symbol, the test fails. The check covers both eager imports and lazy imports inside functions.

Rule 5: Per-tenant EDR connector (no global DRY_RUN)

There is no global DRY_RUN kill-switch. The Rust proxy always dispatches to the tenant's configured EDR connector. Safety comes from two places, not a global switch: human approval (Rule 4) gates every containment action, and connector configuration. A demo tenant (is_demo=true) points its edr_base_url at a bundled mock EDR and runs the real execute/rollback path against a simulated fleet; the proxy tags such actions simulated=true, an honesty label recorded in the audit and the evidence pack. The flag does not change behavior, so a demo action genuinely executes and stays rollback-able. A tenant configured with a real vendor but missing or invalid credentials fails closed at the connector.

#![allow(unused)]
fn main() {
// vyrox-proxy/src/main.rs — the proxy always dispatches; `simulated` is an
// honesty label carried from the tenant's is_demo, never an execution switch.
let outcome = dispatch_edr(&creds, &state.fallback_edr, action, direction, host).await;
// the response echoes the signed `simulated` flag: { status, simulated }
}

The simulated flag is recorded in the audit entry, so an operator (or a compliance review on the JSONL stream) can tell a demo-tenant action from a real one while the hash-chain integrity stays identical either way.

Rule 6: LLM output never directly executed

The LLM returns a JSON object with five fixed fields: verdict, confidence, reasoning, mitre_techniques, suggested_action. The triage_with_llm function in worker/llm.py runs the parsed object through _parse_triage_json which checks every field against a fixed allow-list (verdict in {CRITICAL, HIGH, MEDIUM, LOW, BENIGN}, confidence clamped to [0, 1], suggested_action in the action allow-list). A response that fails validation produces a conservative MEDIUM verdict at 0.5 confidence, not a partial commit.

The validated object never touches exec, eval, subprocess, the filesystem, or SQL. It only sets fields on a TriageResult. The Pydantic model itself is frozen so even a downstream caller cannot mutate fields after the fact.

Multi-tenancy

Tenant isolation is a property of the data layer, not a runtime check in business logic.

Surface	How tenants are separated
Redis queue	Key namespace: `vyrox:alerts:{tenant_id}`
Database tables	Every row carries `tenant_id`; queries filter on it
Console API	Every route runs `assert_tenant_allowed` before any query
Optional notifier (Discord)	`DiscordGuild.tenant_id` maps a Discord server to a tenant
Webhook secrets	Looked up per tenant in `tenant_credentials.webhook_secret_encrypted`
Audit log	Each entry carries `tenant_id`; export endpoints filter server-side
Token budget	Daily ledger keyed on `tenant_id` and date
Verdict cache	Cache key `(tenant_id, fingerprint)`

The webhook routes resolve the tenant from the vendor payload's own identifier field (customer_id, accountId, tenantId), then look up that tenant's secret in tenant_credentials before verifying the signature. A payload that authenticates with the wrong tenant's secret fails the HMAC check and returns 401. A payload with no identifier returns 400. There is no path where an unmatched payload lands on a shared queue.

Cross-tenant access from inside the Discord bot is blocked by discord_bot/main.py:312. The custom_id of every approval button embeds the alert's tenant ID. Before calling the approval handler, the bot checks that the alert tenant matches the Discord guild's tenant. A mismatch returns "This action is not valid for this server" without contacting the proxy.

Two-stage triage

Triage runs in worker/triage.py::triage. Five stages, three early returns.

                                                              ┌────────────────────────┐
       NormalizedAlert ──▶ verdict cache ──▶ cache hit ──▶ │ return cached verdict │
                                  │                            └────────────────────────┘
                                  │ cache miss
                                  ▼
                          heuristics engine
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        ▼                          ▼                          ▼
  confidence ≥ 0.75         confidence ≤ 0.25       0.25 < confidence < 0.75
  accept heuristic         return BENIGN               LLM fallback
       verdict                                          │
                                                            ▼
                                                  token budget check
                                                            │
                                ┌───────────────────────────┼───────────────────────────┐
                                ▼                            ▼                            ▼
                          budget exhausted        primary model            primary 429/5xx
                          MEDIUM / 0.5            parse + return          ▼
                                                                              fallback model 1
                                                                              parse + return
                                                                                      │
                                                                                      ▼
                                                                              fallback model 2
                                                                              parse + return
                                                                                      │
                                                                                      ▼
                                                                              all rate limited
                                                                              MEDIUM / 0.5

The two-stage design solves three problems at once. Determinism and explainability for the eighty percent of alerts that are obvious. Low cost because the LLM is reserved for the ambiguous middle band. A conservative default verdict for any failure mode, so the queue never jams on a provider outage. The LLM provider is not named in this doc because the choice is operational. The model chain is configured in environment variables (LLM_PRIMARY_MODEL, LLM_FALLBACK_MODEL_1, LLM_FALLBACK_MODEL_2).

Approval flow

   Human clicks Approve
   ├─ operator console (Supabase session)         the product surface
   └─ optional notifier, e.g. Discord button      Ed25519 verified
            │
            ▼
   shared.approval_service.approve_alert     ◀──── one path for both surfaces
            │
            ▼  (approve only)
   AlertRecord lookup by alert_id + tenant_id
            │
            ▼
   Idempotency check  ──▶ if status already executed/executing/approved → no-op
            │
            ▼
   Mark alert "executing"
   Persist ActionRecord "approved"
   Audit "approve.requested"      ◀──── written before any outbound call
            │
            ▼
   proxy_client.execute_action()
   body signed with vyrox_hmac_secret (deterministic JSON)
            │
            ▼
   Rust proxy /execute
   ├─ HMAC verify
   ├─ replay window check (±30s)
   ├─ nonce.claim_or_replay(request_id)
   ├─ audit::append_audit  ◀──── written before EDR call
   └─ edr.dispatch (real vendor, or demo mock EDR)
            │
            ▼
   ActionRecord.status = "executed"
   Alert.status        = "executed"
   Audit "approve.executed"

The flow's idempotency story has three layers. The approval service checks the AlertRecord.status before generating a request ID, so a double-click (from either surface) returns "already approved". The proxy keeps a per-request-ID nonce store with ten minute retention, so a network retry replays the cached response instead of calling the EDR twice. The audit entry is written once per state transition; replayed requests do not double-log.

Configuration

All configuration is read at startup from environment variables through shared/config.py::Settings. The settings class uses pydantic_settings so a missing required field raises a ValidationError before the service serves traffic.

The full env contract is in .env.example in the private monorepo. The fields that an OSS contributor needs to know about:

Variable	Component	Purpose
`VYROX_HMAC_SECRET`	all	Sixty four hex characters. Signs Python ↔ Python and Python ↔ Rust traffic.
`REDIS_URL`	ingestion, worker	`redis://` or `rediss://` URL. The legacy REST-style Redis variables are still accepted for backward compatibility but new deployments should set this.
`OPENCODE_ZEN_API_KEY`	worker	LLM provider key. Empty falls back to the legacy `OPENROUTER_API_KEY` during the migration window.
`DISCORD_BOT_TOKEN`	bot	Discord application token.
`DISCORD_PUBLIC_KEY`	bot	Application public key for interaction Ed25519 verification. Empty skips verification (local dev only).
`CROWDSTRIKE_WEBHOOK_SECRET`	ingestion	Vendor-default HMAC secret. Per-tenant secrets stored in `tenant_credentials` override this.
`SENTINELONE_WEBHOOK_SECRET`	ingestion	Vendor-default bearer token.
`DEFENDER_WEBHOOK_SECRET`	ingestion	Defender Graph `clientState` value used as bearer.
`AUDIT_LOG_PATH`	all writers	Directory for daily JSONL files. The hash chain depends on this surviving restart.
`VYROX_PROXY_URL`	bot	Base URL of the Rust proxy.
`VYROX_PROXY_SECRET`	proxy	Dedicated signing secret for proxy requests; falls back to `VYROX_HMAC_SECRET`.

What is in the private side

Reading the public docs without seeing the private code is intentional. The boundary makes contribution clear.

The private monorepo holds the implementation of the pipeline above. File names mirror the layout described here (ingestion/, worker/, discord_bot/, shared/, playbook/, migrations/, tests/). The Python tests covering the public contracts have public-safe names (test_p0_regressions.py, test_p05_blockers.py). Anyone with access can map a private fix to a public contract in seconds.

The detection patterns, the LLM prompts, and the operational configs stay private. Those are the layer that creates the business; the proxy and the contracts are the layer that creates the trust. The split is deliberate.

Operating commitments

We do not publish hard SLA percentages in this repo. The reasons are honest. Numbers we cannot defend across all pilots today belong in negotiated contracts, not in OSS docs.

What we can commit publicly:

The audit log is customer-owned. We do not lose it, we do not modify it, and we provide export at any time. The format is the contract, not our retention policy.
Containment proceeds only after a human approves the action (in the operational console, or via an optional notifier such as the Discord bot). There is no autonomous containment path.
Webhook authentication failures and proxy signature failures both return generic 401 responses. We never tell a caller which part of the credential was wrong.

Per-customer SLAs that involve uptime targets and triage latency live in signed contracts.

Decisions worth knowing

A short list, written for the reader who is asking "why this and not that".

Rust for the proxy. The proxy is the only Vyrox process that can cause customer-side side effects. The set of properties we wanted in one binary: memory safety without a garbage collector, a small static binary, a constant-time HMAC implementation in the ecosystem, no runtime dependency on a vendor library. The Rust choice gave us all of them. The proxy is intentionally small. About a thousand lines of code including tests, splitting across main, hmac, audit, nonce, edr, and actions.

Postgres for the deployment; SQLite only for a local checkout. The deployed stack runs Postgres, addressed by DATABASE_URL, with the console API in front. A local checkout with no DATABASE_URL set falls back to SQLite with WAL mode, which is fine for a single-developer run but is not the deploy target. The schema is SQLModel-compatible across both, so the only difference is the connection string and the migration runner.

The operational console is the surface; Discord is a notifier. The product surface is the web operational console: cross-tenant work queue, decision view, per-tenant autonomy controls, evidence export, audit search. It is two authenticated web apps (the operator console and a founder super-admin app) in front of a FastAPI console API, and it is the path an operator approves through. Discord is a retired surface, kept only as one optional notifier (Slack and email follow) that mirrors the decision card. The important property is that whichever surface a human clicks Approve in, the action runs through the same shared.approval_service, so the audit trail and the approval contract do not depend on the surface.

Two-stage triage. A pure LLM design is slow, expensive at scale, and not auditable without careful prompt engineering. A pure rules design misses anything novel. The split lets us run the heuristics for free, run the LLM only on the ambiguous middle band, fall back to a conservative MEDIUM on any failure, and keep the LLM output strictly inside a Pydantic schema before it touches anything else.

Human in the loop for execution. Auto-isolating hosts on false positives is the kind of incident that loses you the customer. Until we have a year of per-tenant false-positive data, every CRITICAL and HIGH containment is gated on a human Approve click. LOW auto-approval is opt-in per tenant and logged identically to manual approvals.

Cross-references

THREAT_MODEL.md: assets, threats, mitigations, out of scope.
API_REFERENCE.md: every public endpoint with schemas and error codes.
AUDIT_CHAIN.md: on-disk format spec for the hash-chained audit log.
ADAPTERS.md: contributor guide for adding a new EDR vendor.
QUICKSTART.md: from git clone to a signed alert in about ten minutes.
Rust proxy source: https://github.com/vyrox-security/vyrox-proxy.
Simulator: https://github.com/vyrox-security/vyrox-simulator.

Vyrox Security