Prose in
Pick a preset, or type your own sentence.
Every conversation with an AI agent today is four translations per turn: my intent into prose, prose into the model’s graph, graph into the model’s prose, prose back into my head. Three chances for drift, every turn. This is the prototype of a different interface — one where the human and the model edit the same typed structure, and the model’s response is a diff of that structure, approved row by row, with each row citing the exact constraint it came from.
Every conversation with an AI agent is a small act of translation. I have a structured intent in my head — a decision about my portfolio, a constraint on a disclosure, a sort order for a list of bonds — and I flatten it into English. The model reads the English, rebuilds a structured representation of its own, acts on that representation, and re-flattens the result back into English for me to read. Four translations per turn. Three chances for drift.
In a casual context that drift is the running joke of AI products: the model confidently gets something slightly wrong, you laugh, you retype the prompt. In finance it’s a compliance finding. A disclosure that reads as “margin requirements may change” instead of “margin requirements will change above 25% concentration” is the difference between an audit note and a headline.
The fix people reach for is better prompts. Longer prompts. System prompts with twenty bullet points. I spent two years writing those prompts for my own team at an ASIC-regulated broker. The prompts got longer. The drift stayed.
The problem isn’t that our prose is bad. It’s that prose was never supposed to be the protocol.
An Intent Canvas is a structured representation of what the user is trying to do, rendered as a graph the user can see and edit. The model doesn’t read the user’s prose and guess at the graph. The model reads the graph. Every node is typed: constraint, entity, target, time-window, exclusion, preference. Every edge is labeled: applies-to, limits, requires, blocks.
When the user types a sentence — “reduce my tech concentration to under 30% by quarter-end” — the system extracts four nodes (Target: concentration, Entity: sector=tech, Constraint: ≤30%, Time-window: Q2 close) and shows them on a canvas. The user can correct any of them before the model proposes an action. Got the sector wrong? Click. Fix. Done. No re-prompting. No arguing with a chat bubble about what you meant.
The model’s response is not prose either. It’s a Semantic Diff: a set of proposed changes to the canvas, each with its own rationale citation. “I’m adding a node: Rebalance trade on AAPL, −2.1% position. Rationale: current AAPL weight 7.8%, tech sector weight 34.2%, largest contributor to over-limit.”
The human approves each proposed change with one click. No prompt engineering. No drift. Every approval is a signed row in an audit log.
How a portfolio rebalance actually flows through the canvas — once at the start, once at the end, and nowhere else.
User types one sentence, or picks a preset. This is the only place prose touches the system.
A parser — deterministic rules plus a small model — turns the sentence into typed nodes on the canvas. User sees them immediately, rendered in plain view.
User adjusts any extracted node. This replaces prompt-engineering: if the model got “tech” wrong and meant “large-cap tech excluding NVDA”, the user clicks the node and edits.
The model reads the canvas (not the prose) and proposes a set of changes. Each change carries its own citation — which node in the canvas triggered it, which portfolio position it affects, what the dollar impact is, what the confidence level is.
User approves, rejects, or edits each proposed change. The approved diff is applied. The rejected and edited rows are logged as training signal for the next turn.
The prose step happens once. The canvas state is what persists across turns. Ten turns in, the canvas is a rich structured object; the prose log is a footnote.
Below is a working Intent Canvas on a mock 12-holding portfolio. Type one of the three preset prompts (or your own sentence), watch the graph extraction happen, edit the nodes if the extractor got anything wrong, and review the proposed Semantic Diff. This is deterministic — no real LLM call, no API key. The mock mirrors the structural behaviour described above. The demo isn’t the point; the shape of the interaction is.
Pick a preset, or type your own sentence.
Click any node to edit its label. Drag to reposition. This is the shared artifact the model will read.
Pick a preset above, or type a sentence and press Extract to canvas. Nodes will appear here.
Each row carries a rationale citation, an evidence block, and a confidence score. Approve, reject, or edit row by row.
Once the canvas is populated, press Propose Semantic Diff above to see the model’s proposed changes.
12 mock holdings. Applying the approved diff changes the sector weight totals. No real orders placed — pressing Apply logs to console.
The canvas is deliberately small. The six node types below cover roughly 90% of the
product-rebalance and disclosure-edit intents I saw in four years at ACY. The remaining 10% falls back to a
free-text Note node that the model is told to treat as informational, never authoritative.
The thing being changed. “Portfolio concentration”, “Disclosure wording”, “Sort order”. There is exactly one target per active intent.
The subject. “Sector=tech”, “Order book=equities”, “Client=institutional”. Typed, not free text.
The rule. “≤30%”, “Reg T compliant”, “No hedge funds as counterparty”. Machine-readable predicate.
The horizon. “Quarter-end”, “Before market open”, “By Friday”. Parsed to an absolute timestamp at extraction.
A negation. “Exclude NVDA”, “Not including options”, “Minus Treasury positions”.
A soft hint. “Prefer lower tax-lot impact”, “Favour liquid names”. Influences but does not force.
And four edge types:
applies-to — an entity is the subject of a target.limits — a constraint bounds a target.requires — one node is a precondition for another.blocks — an exclusion prevents a target.Every row has the same shape, whether it adds a node, removes a node, or mutates an existing one.
The format borrows from git diff — with every change annotated by why it was
proposed, not just what it does.
+ Add node: Rebalance trade
kind: action
target: AAPL
delta: −2.1% position (−$64,300)
rationale-source: node#3 (Constraint ≤30%), node#1 (Target: concentration)
evidence: current AAPL 7.8% of portfolio;
tech sector 34.2%; over-limit by 4.2pp
confidence: 0.91
status: [approve] [reject] [edit]
This format matters because it makes the model’s reasoning auditable at the granularity of a row. When a compliance officer asks “why did the model propose this trade”, the answer isn’t a paragraph. It’s two pointers into a typed graph plus the specific dollar numbers that triggered the action. No prose paraphrase. No re-interpretation.
Rejected rows don’t disappear. They’re logged with the reject reason, which becomes input to the next session’s extractor tuning. A human who says “no, don’t touch NVDA” is teaching the next extractor to add an exclusion node automatically.
The audit log is a flat table of approved rows. Each row traceable to a node. Each node traceable to the prose sentence that created it. The chain is complete in either direction.
None of these regulations mention AI. They mention records, rationales, and approvals. The canvas format maps directly; a prose chat transcript doesn’t.
A suitability decision needs a documented match between a customer profile and the recommended action. The canvas stores the customer profile as nodes; the Semantic Diff stores the recommended action with citations back to those nodes. The canvas is the suitability audit trail by construction, not by retrofit.
Article 24(1) requires firms to act in the client’s best interest. In UI terms
that means the system must show the client the rationale for every recommended action in a form they can
verify. The Semantic Diff row — with its evidence block and rationale-source
pointers — is exactly that form.
The rule requires trading recommendations and their basis to be retained in an immutable form for three to six years. The canvas state plus the approved diff log is a structured, replayable record. A prose chat transcript is retention; it’s not replayability.
Every material financial decision needs a signed chain of approvals. The line-by-line approval model on the Semantic Diff is a signed chain of approvals — each approved row is who-approved-what-when, in a form a SOX auditor can walk without asking follow-up questions.
These aren’t marketing numbers. They’re the three cost lines I’d defend in a technical review, with the method note attached.
The prose-only protocol has to carry the full chat history into each turn to maintain context. With a canvas, only the canvas state (typed nodes, usually under 500 tokens for a complex rebalance) is carried. For a 10-turn session the prose baseline accumulates ~8,000 tokens of rolling history; the canvas baseline carries ~500 tokens of state plus ~200 tokens of new input per turn.
Method Rough estimate against a typical 10-turn rebalance session with commercial model pricing at the time of writing. Production variance will be higher; this is the directional claim.
The model is never asked to produce free-text rationales — only to mutate a
typed graph with a constrained schema. Structured-output prompting (OpenAI response_format:
json_schema, Anthropic tool use, Google controlled generation) ships with schema validation;
malformed responses are retried at the API layer before reaching the user. The remaining error surface
is semantic — wrong sector classification — not structural — fabricated company name.
Method Observed pattern from structured-output API documentation; the structural error class is eliminated by validation at the transport layer, leaving only semantic misclassification as the residual.
At ACY I watched compliance review a chat-log-style disclosure session. The reviewer read the prose, reconstructed the decision tree, compared it to the source rule. ~45 minutes per session. With a Semantic Diff log, the reviewer scans the approved-row table — each row already citing the rule — in about 10 minutes.
Method Internal observation across eight disclosure reviews between 2023–2025. Not benchmarked against production yet. The claim is directional; the mechanism (row-level citations vs prose decode) is the point.
Chat is fine for exploration, for reading, for loose conversation. The canvas is for action. Most products will have both surfaces.
The six node types and four edge types were derived from portfolio-management and disclosure-editing intents. A triage medical workflow has a different vocabulary. The pattern generalizes; the specific types don’t.
This demo is deterministic. No real LLM call, no API key, no retraining. The mock is calibrated to behave the way the thesis predicts. A production version wires the Semantic Diff format into a real structured-output API.
No real orders are placed. The portfolio is 12 mock holdings with plausible numbers.
The demo stops at the approval step — Apply logs to the browser console, not to a
brokerage.
This is a working thesis demonstration. Six node types cover maybe 90% of my target domain; the remaining 10% lives in a free-text escape hatch that a production version would formalize. Users who want to collaborate on extending the type system, reach out.
response_format: json_schema documentation.tool_result patterns.If you’re working on any of these primitives — canvas UI for AI agents, structured-output prompting patterns, auditable AI-human workflows in regulated domains — I’d like to compare notes. Email is the fastest path.
Every problem we solve for clients has multiple valid approaches — different costs, different ROI, different risk profiles. These threads show how the approach on this page compares to others in the portfolio.
Portfolio-level math primitives — HHI, beta, VaR, regime — rendered into UI defaults and AI-assisted decision surfaces.
How upstream regulation and macro prints become downstream product defaults and Legal-safe disclosure.
How we prove design claims with data — A/B, pooled-SD, cohort analysis, and the rigor behind every number quoted on this site.