What this is

BSS-CLI is a Business Support System for a small mobile operator, built to run entirely from a terminal. It covers the real work of running a subscriber base — CRM with case and ticket management, product catalog, commercial and service order management, a provisioning simulator, eSIM profile management, subscription and bundle balance, mediation, rating, and payment — with nine small services behind real TMF Open API surfaces.

It is opinionated. Bundled prepaid only, card-on-file mandatory, block-on-exhaust enforced at the policy layer. Every write travels through that policy layer so domain invariants hold even when an agent is driving. The payloads on the wire match the spec — camelCase, conformant, pointable at a real TMF client — not naming theater over a bespoke schema.

It is also deliberately small. The full stack cold-starts in 25 seconds, runs in under 4 GB of RAM, and costs about $0.005 per agent dev session. That is the measurable version of "lightweight."

BSS-CLI exists as three things at once:

  1. A reference implementation for engineers learning telco BSS/OSS.
  2. A deployable MVP for a small mobile operator — one make up and you are running.
  3. A substrate for agentic experiments — a realistic telco backend an LLM can drive end-to-end, behind a policy layer that will not let it corrupt state.

The interactive REPL

The v0.6 BSS-CLI REPL: a conversational prompt driving typed tools, with results rendered as boxed ASCII cards inline instead of raw JSON dumps.
The v0.6 REPL with renderer dispatch — the agent calls typed tools, results render as ASCII cards inline, no JSON dumps.

This is what "LLM-native" actually feels like in practice. The agent is not a separate surface grafted on top of the system — it goes through the same HTTP endpoints, the same typed tool signatures, the same write-through policy layer, and the same audit trail a human CLI user does. The only difference is the front door: a conversation instead of a verb. When the agent tries to do something illegal, the policy layer hands back a structured PolicyViolation with a rule code and machine-readable context, and the agent self-corrects from that.


The seven principles

These are the soul of the project. They live in a CLAUDE.md at the repo root that reads like a contract, and they are enforced — there is a make doctrine-check target that fails the build if anyone calls datetime.now() in business logic instead of the injected clock. Rules only matter if something refuses to merge when you break them.

  1. 1
    Bundled-prepaid only. Pay upfront; a bundle either has quota or it does not. No proration, no dunning, no collections, no credit-risk modelling to reason about.
  2. 2
    Card-on-file is mandatory. Every customer has a payment method before activation. A failed charge means no service — no exceptions, no grace period.
  3. 3
    Block-on-exhaust. Service stops the same instant a bundle hits zero. Only two paths back: the scheduled renewal on the period boundary, or an explicit VAS top-up the customer initiates.
  4. 4
    CLI-first, LLM-native. Every capability is a typed tool the agent can call. ASCII art is the visualization language — the terminal is the product surface, not a compromise.
  5. 5
    TMF-compliant where it counts. Real TMF620, 621, 622, 629, 635, 638, 640, 641, 676, 678, and 683 payloads. CamelCase on the wire, conformant to the spec. You can point a real TMF client at it.
  6. 6
    Lightweight is measurable. Full stack under 4 GB RAM, cold start in 25 seconds, p99 internal API under 50 ms. Numbers on the tin, verified in CI.
  7. 7
    Write through policy, read freely. Every write goes through a policy layer that enforces domain invariants. The LLM cannot corrupt state even when asked to. That is the whole point.

Architecture

Three callers reach nine services through one of two paths. The CLI (terminal, REPL, scenario runner) and the self-serve customer portal — the latter production-shape as of v0.12 — go direct via bss-clients for everything deterministic: every read, every signup step (since v0.11), every post-login self-serve write. The CSR console on port 9002 is still a work in progress — the v0.5 customer-360 + ask-the-agent surface ships, but the operator workbench has not yet seen the login wall, RBAC, or step-up auth that landed on the customer side. Flows that need LLM judgment go through the orchestrator: the CSR ask agent surface, and the chat route on the self-serve portal (with the v0.12 customer_self_serve profile).

Inside, two planes connect the services: synchronous HTTP over the typed bss-clients layer (TMF APIs, X-BSS-API-Token) for calls that need an immediate answer, and asynchronous events over a RabbitMQ topic exchange bss.events for reactions. Every service writes directly to its own schema in one shared PostgreSQL 16; audit.domain_event is written in the same transaction as the domain write, with the RabbitMQ publish happening after commit (simplified outbox). Every service exports OpenTelemetry spans to Jaeger; bss trace renders the same spans as an ASCII swimlane in the terminal.

┌──────────────────┐  ┌──────────────────┐  ┌──────────────────────────┐
│ Self-serve UI    │  │ CSR console UI   │  │  bss (CLI + REPL)        │
│ port 9001 (v0.4) │  │ port 9002 (v0.5) │  │  + LangGraph orchestrator│
│   ✓ ready        │  │   ⚠ WIP          │  │                          │
└────────┬─────────┘  └─────────┬────────┘  └────────────┬─────────────┘
         │                      │                        │
┌────────┴────────┐   ┌─────────┴────────┐               │
│ direct          │   │ ask agent surface│               │
│  bss-clients    │   │  agent_bridge.*  │               │
│ (signup, post-  │   │  → astream_once  │               │
│  login, reads)  │   └─────────┬────────┘               │
│ chat → astream  │             │                        │
│ (customer_self_ │             │                        │
│  serve profile, │             │                        │
│  v0.12)         │             │                        │
└────────┬────────┘             │                        │
         │                      ▼                        ▼
         │      ┌──────────────────────────────────────────────────┐
         │      │  bss_orchestrator.session.astream_once(channel,  │
         │      │   actor=…) · ReAct over 95-tool registry · pin   │
         │      │   allow_destructive=False                        │
         │      └──────────────────────────┬───────────────────────┘
         │                                 │
         └────────────────────►────────────┴──────────────────────►
                                  │ HTTP (TMF APIs) + bss-clients
   ┌──────┬────────┬──────────────┼───────┬────────┐
   ▼      ▼        ▼              ▼       ▼
┌─────┐┌─────┐ ┌─────┐         ┌─────┐┌─────┐
│CRM* ││Pay  │ │Cat  │         │COM  ││Subs │
│8002 ││8003 │ │8001 │         │8004 ││8006 │
└──┬──┘└──┬──┘ └──┬──┘         └──┬──┘└──┬──┘
   │      │       │               │      │
   │      └─ HTTP (e.g. Pay→CRM "customer exists?")
   │                              │
   │      ┌───────────────────────┼──────────────────┐
   │      │       ┌─────┐ ┌─────┐ ┌─────┐ ┌──────┐
   │      │       │SOM  │ │Med  │ │Rate │ │Prov  │
   │      │       │8005 │ │8007 │ │8008 │ │ Sim  │
   │      │       └──┬──┘ └──┬──┘ └──┬──┘ │ 8010 │
   │      │          │       │       │   └──┬───┘
   ▼      ▼          ▼       ▼       ▼      ▼
═══════════════════════════════════════════════════════════
║         RabbitMQ — topic exchange: bss.events            ║
║  order.* · service_order.* · service.* · provisioning.*  ║
║  subscription.* · usage.* · crm.* · payment.*            ║
═══════════════════════════════════════════════════════════

Each service writes directly to its own schema in ONE shared
Postgres instance. audit.domain_event is written in the same
transaction as the domain write; RabbitMQ publish happens
after commit (simplified outbox). Every service exports OTel
spans to Jaeger (v0.2+).

┌────────────────────────────────────────────────┐  ┌──────────────┐
│        PostgreSQL 16 (single instance)         │  │   Jaeger     │
│                                                │  │  (v0.2+)     │
│  crm · catalog · inventory · payment ·         │  │  OTLP/HTTP   │
│  order_mgmt · service_inventory · provisioning │  │  → traces UI │
│  subscription · mediation · billing · audit ·  │  └──────────────┘
│  knowledge                                     │
└──────────────────────┬─────────────────────────┘
                       │ read-only
                       ▼
                ┌──────────────┐
                │   Metabase   │
                └──────────────┘

* CRM hosts the Inventory sub-domain (MSISDN + eSIM pools) on port 8002
  under /inventory-api/v1/...; not a separate container in v0.x.

Trace anything end-to-end

The bss trace CLI command rendering a product-order trace as an ASCII swimlane in a dark terminal: one top-level POST spanning 766ms with child spans across eight services.
bss trace for-order ORD-… — the swimlane, rendered in the terminal.
The same distributed trace open in the Jaeger UI, showing the span tree, service colours, and timing bars.
The same trace in Jaeger — one click away when you want to drill in.

Same trace, two surfaces — bss trace for the terminal, Jaeger UI for when you want to drill down.

OpenTelemetry is the source of truth. The ASCII view is a terminal-native lens on the same spans, not a parallel tracing system; no second pipeline to maintain, nothing to drift. If something is happening in the real traces, it is happening in the swimlane — and if you need timing breakdowns, span tags, or log correlation, Jaeger is right there.

trace 4825e0bb25ae0870  766ms  125 spans  8 services  0 errors

POST /tmf-api/productOrderingManagement/v4/productOrder              [com         ]  ████████████████████████  766ms
  └─ POST /tmf-api/customerManagement/v4/customer/CUST-022           [crm         ]    ██                        18ms
  └─ POST /tmf-api/paymentMethod/v1/charge                           [payment     ]      ████                    34ms
  └─ POST /tmf-api/serviceOrderingManagement/v4/serviceOrder         [som         ]        ████████████████      512ms
        └─ INSERT INTO service_inventory.cfs                         [postgres    ]          ▌                    2ms
        └─ POST /tmf-api/resourceInventoryManagement/v4/resource     [inventory   ]          ███                  31ms
        └─ POST /provisioning/task                                   [provisioning]              ████████████     381ms
              └─ AMQP publish bss.events provisioning.task.completed [rabbitmq    ]                          ▌    3ms
        └─ POST /tmf-api/subscription/v1/activate                    [subscription]                          ███  47ms
  └─ AMQP publish bss.events order.completed                         [rabbitmq    ]                            ▌  3ms

The portals

Demo-grade chrome, production-shape platform. No design system, no i18n, no accessibility audit — the portals are built to make the backend visible, not to win a UX award. The platform underneath is the real story: a magic-link login wall (v0.8), a named-token perimeter that flows service_identity into every audit row (v0.9), direct-API post-login self-serve with step-up auth on every sensitive write (v0.10), and a chat surface scoped to the logged-in customer with a hard ownership trip-wire (v0.12).

Two small server-rendered portals sit next to the CLI: a self-serve customer portal and a CSR workbench. Both are HTMX plus Jinja, no SPA framework, vendored htmx.min.js, which keeps the dependency surface tiny and lets the chat widget stream tool calls into the page over SSE without fighting a client-side router.

Since v0.11, the signup funnel writes directly through bss-clients from the route handlers — one route, one BSS write, no orchestrator hop. Wall time dropped from about 85 seconds to under five. The chat widget is the only orchestrator-mediated surface that remains, and v0.12 narrowed its reach (see the next section).

Self-serve signup form: minimal fields — name, phone, payment card — on a clean single page, with the verified-email identity carrying the rest of the customer record.
Self-serve signup — direct-write since v0.11. Minimal form (name + phone + card) because the verified-email identity already carries the rest.
Self-serve confirmation page rendered after activation, showing the new subscription summary, eSIM activation QR code, and LPA code for manual entry.
Confirmation with eSIM activation QR + LPA code.
The v0.12 chat widget mid-conversation: a popup pinned to the bottom-right of the dashboard, multi-turn history, markdown-rendered replies, auto-scrolling on new tokens.
v0.12 chat widget mid-conversation — pops up on every post-login page; multi-turn memory within a session; markdown rendering; auto-scroll on stream.
CSR portal customer 360 view: a single page with the customer's profile, subscriptions, bundle balances, recent orders, payments, and open cases.
CSR customer 360 — read-only, single page.
CSR portal with the in-page agent answering a free-text question about why the customer's data is not working, tokens streaming mid-reply.
CSR asks the agent mid-stream: "why is their data not working?"

The chat surface (v0.12)

The chat widget is one modality of access, not a privileged path. It calls the same typed tools through the same policy layer as the CLI — just through a tighter prompt-visible window. The architectural point is the chokepoint: the LLM cannot reach a tool the customer's direct UI doesn't already expose. If the chat ever has a capability the dashboard doesn't, that's a doctrine bug, not a feature.

The window is a tool profile, customer_self_serve — sixteen curated tools: eight read wrappers (subscription.list_mine, get_balance_mine, usage.history_mine, etc.), four write wrappers (vas.purchase_for_me, schedule_plan_change_mine, cancel_pending_plan_change_mine, terminate_mine), three public catalog reads, and one escalation tool (case.open_for_me). No *.mine tool accepts a customer_id parameter; the actor is bound from auth_context.current() and a startup self-check refuses to boot if any signature drifts.

  customer chat  ──▶  ┌─────────────────────────────────────────┐
                      │  Layer 1 — server-side policies         │
                      │  (the primary boundary, unchanged)      │
                      ├─────────────────────────────────────────┤
                      │  Layer 2 — *.mine wrapper pre-check     │
                      │  customer_id bound from auth_context;   │
                      │  cross-customer ids refused at the seam │
                      ├─────────────────────────────────────────┤
                      │  Layer 3 — output ownership trip-wire   │
                      │  every emitted tool result scanned for  │
                      │  cross-customer fields; mismatch → P0   │
                      └─────────────────────────────────────────┘
                                      │
                              generic safety reply
                              on trip — no leaked tool name

Caps bound the blast radius of a runaway customer (or a runaway prompt). Twenty requests per hour and two dollars per month by default, configurable via BSS_CHAT_*_PER_* environment variables, accounted from OpenRouter usage metadata into an audit.chat_usage row per customer per period. Cap checks fail closed: a database error means the chat refuses to call astream_once, not the other way round.

Five categories the chat is not allowed to resolve on its own — fraud, billing_dispute, regulator_complaint, identity_recovery, bereavement. The system prompt names them with examples; when the agent recognises one, it calls case.open_for_me, which writes a CRM case linked to a SHA-256 hash of the conversation. The transcript itself lives in audit.chat_transcript, addressed by hash, append-only. A CSR opens /case/{id} in the v0.5 console and reads the conversation in the new "Chat transcript" panel.

Pre-signup browse mode is the same widget with a different system prompt. Any verified-email visitor without a customer record yet sees the chat on /welcome and /plans; the *.mine wrappers refuse cleanly because there is no actor to bind. The visitor can ask "what plans do you have?" and get an answer from the public catalog reads without being forced through signup first.


The 14-day soak

v0.12 ships with a soak runner (scenarios/soak/run_soak.py) that drives 30 synthetic customers through 14 simulated days under a frozen accelerated clock — chat queries, dashboard hits, escalation triggers, and deliberate cross-customer probes. It is an internal-beta soak under accelerated time, not real user traffic; the report (soak/report-v0.12.md) is what the v0.12 release tagged on.

Ownership-check trips
0 / target 0
Cross-customer leaks
0 / target 0
Chat-usage drift
0.0% / tolerance 5%
p99 chat latency
8.35 s · alarm tier (fail at 15 s)

What's shipped

Services
9 + 2 portals
Typed LLM tools
95
Tests passing
~1,129
Hero scenarios
12 end-to-end
Cold start
25 seconds
Runtime RAM
< 4 GB
Image footprint
~1.1 GB BYOI · ~2.65 GB bundled
Agent dev session
~$0.005 (Gemma 4 26B A4B)
Architecture decisions
~64 numbered DECISIONS.md entries

Tech stack

Python 3.12 FastAPI SQLAlchemy 2.0 + asyncpg Alembic Pydantic v2 transitions aio-pika Typer + Rich LangGraph OpenRouter · Gemma 4 26B A4B bss-portal-auth (v0.8+) PostgreSQL 16 RabbitMQ OpenTelemetry Jaeger HTMX + Jinja uv workspace Apache-2.0


Versions

  1. v0.12 Chat scoping + escalation + 14-day soak. customer_self_serve tool profile, output ownership trip-wire, per-customer rate + cost caps, five non-negotiable escalation categories, popup chat widget on every post-login page, pre-signup browse mode. v1.0 is what swaps Singpass + Stripe + SM-DP+.
  2. v0.11 Signup funnel migrates to direct API; the v0.4 LLM-mediated signup demo retires. Each step is one direct write from a route handler. Wall time: ~85s → <5s. Chat is the only orchestrator-mediated route that remains.
  3. v0.10 Post-login self-serve goes direct — top-up, plan-change scheduler, COF management, eSIM redownload, line cancel, contact updates, billing history. One route, one bss-clients write, no orchestrator hop. Step-up auth gates every sensitive label.
  4. v0.9 Named-token perimeter. Each external surface gets its own BSS_*_API_TOKEN; service_identity flows into audit.domain_event so audit answers "which surface initiated this write?" not just "which actor".
  5. v0.8 Self-serve portal login wall — email + magic-link OTP + step-up auth for sensitive writes. New portal_auth schema; tokens HMAC-SHA-256 with a server pepper; PortalSessionMiddleware binds request.state.customer_id from the verified session.
  6. v0.7 Catalog versioning + plan-change snapshot doctrine. No proration; price snapshotted at order time so renewal locks in the price even when the catalog re-prices. Operator-initiated migrate_to_new_price with regulatory notice.
  7. v0.6 Docs sweep, REPL renderer dispatch (rendered ASCII cards in the interactive REPL), tech-debt sweep, snapshot test framework.
  8. v0.5 CSR portal (customer 360 + ask-the-agent), case threading, scenario engine extensions.
  9. v0.4 Self-serve portal (HTMX + SSE agent log).
  10. v0.3 BSS_API_TOKEN middleware — the smallest possible auth story, behind the auth_context.py seam that has been in every service since day one.
  11. v0.2 OpenTelemetry + Jaeger + bss trace ASCII swimlane.
  12. v0.1 Nine services, full TMF surface, write-through-policy, hero scenarios.