The Reliability Triad
Why the Next Generation of AI Might Finally Be Ready for Finance
If you’ve been trying to make AI a meaningful part of your finance and accounting workflows for the past couple of years, you already know the AI pattern:
It drafts quickly, but then sometimes trips over the details. It explains a variance with confidence, then randomly cites a source that does not exist. You can get value from it, but trusting it with board-level work has felt like a stretch.
Three new research threads point to a different near future. OpenAI’s work on hallucinations and on anti-scheming behavior, paired with Thinking Machines’ push for deterministic outputs, targets the exact reasons finance leaders hold back. Not headline-grabbing, but practical. If vendors bring these methods into production models, finance teams get closer to systems they can rely on.
Hallucinations: stop rewarding confident guessing
The problem. Models invent facts that sound credible but are wrong. In finance, that might be "margin compression due to commodity prices" for a business with zero commodity exposure.
The breakthrough. OpenAI's "Why Language Models Hallucinate" (September 2025) reveals the root cause: training rewards producing any answer over admitting uncertainty. Current methods incentivize guessing when evidence is thin.
The solution flips the script. Instead of rewarding fluent fabrication, new training gives credit for calibrated honesty. When data is insufficient, the best response isn't a compelling story. It's "insufficient evidence."
What this unlocks for you. Ask for a variance explanation and get citations to specific ledger entries. When drivers are unclear, the model says so explicitly and suggests next steps. Your analysts move faster because they start from evidence, not fiction.
Scheming: catch covert misalignment before it bites
The problem. Some models learn to game the test rather than tell the truth. They appear compliant during evaluation but conceal information or downplay risks when deployed.
The breakthrough. OpenAI and Apollo Research's "Stress Testing Deliberative Alignment for Anti-Scheming Training" (September 2025) introduces "deliberative alignment." Models must read and reason about an anti-scheming specification before responding.
In stress tests across 180+ environments, scheming behavior plummeted. One model dropped from 13 percent scheming to below 1 percent. Similar reductions appeared across different model sizes.
What this unlocks for you. While you're not fine-tuning frontier models, your vendors are. With anti-scheming training embedded, your tools become far less likely to gloss over uncomfortable audit findings or soften disclosures to pass surface-level reviews.
Determinism: same input, same output, every time
The problem. Run identical prompts twice and get different answers, even with randomness disabled. Tiny computational differences in processing create visible output variations.
The breakthrough. Thinking Machines' "Defeating Nondeterminism in LLM Inference" (September 2025) shows how models produce different outputs from identical inputs due to low-level computation differences. Their solution: batch-invariant kernels that guarantee bit-for-bit reproducibility across runs, trading minimal speed for absolute consistency.
What this unlocks for you. Close your books, rerun analysis, get identical narratives. You can hash outputs, store seeds, and attach both to workpapers. Audit questions simplify because your record stays stable.
Why this matters now
For much of the past year, the conversation about AI progress has focused on scale: bigger models, more data, more compute, and diminishing returns. These papers point to a different lever. Reliability improves not by adding capacity but by rethinking how models are trained, evaluated, and deployed.
For finance leaders, the benefits are straightforward:
Accuracy. Reward honesty and fabricated facts decline.
Integrity. Teach models to reason about an anti-scheming spec and omissions become less likely.
Reproducibility. Enforce determinism and outputs can be defended in a control environment.
Each addresses a separate risk, and together they reduce the hesitation that has kept many teams from going all in.
Looking ahead, improvements in uncertainty handling and source citation could begin to show up within the next few months. Safeguards against scheming may follow in the next six to nine months, judging by how quickly past safety research has moved into production. Deterministic inference is likely to take longer, with reproducibility features more realistically arriving in enterprise tools within a year.
What to do next
Check how your chosen model handles these three areas:
Accuracy: Ask it a question you know has no clear answer in your data. Does it admit uncertainty or fabricate? Review provider documentation for how the model handles citations and source attribution.
Integrity: Look for published research or safety notes that show the model has been stress-tested beyond narrow benchmarks. Favor providers who are transparent about misalignment testing and ongoing monitoring.
Reproducibility: Run the same prompt several times at temperature zero. Do you get identical outputs? Check whether the provider offers deterministic modes, logging of seeds, or model versioning for audit.
Why it matters. Finance teams can’t change how frontier models are trained, but they can choose which ones they trust and how they configure them. A few quick checks on accuracy, integrity, and reproducibility can tell you whether a model is ready for variance analysis, disclosure drafting, or audit support.
Tune internal expectations
How you frame AI outputs inside the team matters. If analysts or auditors treat every generated paragraph as “the answer,” you will get over-reliance. If they understand that abstention, hedging, and citation are signs of quality, you build healthier use.
Treat “I do not know” as a safeguard. When a model flags insufficient evidence, that is not failure. It is the right behavior. Encourage the team to treat it as a sign the model is working responsibly.
Require citations for numeric claims and policies. Any figure, policy reference, or regulatory citation should be linked to a source. If the model does not provide one, it is not ready for downstream use.
Log the context. For audit-relevant work, capture the prompt, model version, decoding settings, seed, and output hash. This creates an audit trail that can be checked months later.
Expand where you use AI, carefully
Once you set expectations, broaden the scope step by step. Start with low-risk tasks where accuracy and reproducibility matter, but consequences are manageable. Scale into higher-stakes workflows as you gain confidence.
Variance analysis. Use AI to explain variances, but require it to cite actual account data. This reduces narrative filler and gives analysts a clearer starting point.
Disclosure drafting. Let the model generate draft notes for risk or compliance, but train the team to confirm that risks are surfaced, not softened. Transparency should be the default.
Close and board materials. Apply AI to generate recurring narratives for close packages or board decks. Ensure reproducibility by running deterministic checks so the same input always yields the same output.
Bottom line for finance leaders
You already know AI can draft, summarize, and accelerate analysis. What has been missing is reliability you can defend. Reward honesty, block scheming, and enforce determinism, and the tool changes character. It stops trying to impress you and starts helping you run the business.
The next step is a model that cites its sources, acknowledges when evidence is thin, and produces the same output every time you ask. That is progress finance teams can rely on.
Upgrade to Pro to unlock:
AI Reliability Vendor Checklist: Compare models and tools on accuracy, integrity, and reproducibility.
Finance Prompt Pack: Tested prompts that enforce “cite or abstain” behavior in variance analysis and disclosures.
Determinism Smoke Test: Verify whether your chosen model produces reproducible outputs you can defend in audit.
🔒 Upgrade to Pro to get the full asset pack:
(Pro version comes out every Sunday morning.)

