
How UpToDate Expert AI Builds Trust at the Point of Care
Yesterday on rounds, I watched someone use a clinical AI tool in the middle of a discussion and I felt proud. Part of me thought, look how far this has come. Then I caught myself: are we sure we should be doing this?
Clinical AI is everywhere now, and the default behavior is to treat the output like a consult. The problem is that these tools can sound certain even when they’re guessing, omitting something important, or applying the right guideline to the wrong patient—like reflexively recommending antibiotics for asymptomatic bacteriuria—and we’re the ones responsible for what happens next.
In this Deep Dive, I’ll lay out what “trust” should mean at the point of care, break down how UpToDate stress-tests Expert AI, and show how I’m using it in practice.
What physicians actually mean by “trust”
Trust, for me as a clinician, means two things when it comes to clinical AI tools:
I can verify what I’m reading quickly.
The system fails safely, or at least transparently, when it’s uncertain.
Trust doesn’t necessarily mean the tool is always correct. It means I can audit it fast enough to use it responsibly at the point of care.
When I try a clinical AI tool, I’m looking for basic behaviors: does it surface the assumptions it’s making, does it flag where the evidence is thin, and does it point me back to the source quickly enough that I can sanity-check it before I act?
Here’s a quick example. I’m taking care of a kidney transplant patient on immunosuppression who has asymptomatic bacteriuria on a routine urinalysis. I ask: Do I treat this? What I’m looking for is whether the tool surfaces the key assumptions (time since transplant, symptoms, upcoming urologic procedure), links me back to the source, and flags uncertainty instead of sounding falsely definitive. I’m not looking for a yes/no.
Tools that sound confident when they’re guessing train clinicians into complacency. That’s why I care less about benchmark scores and more about how a tool behaves in real clinical context. UpToDate’s latest whitepaper is an attempt to measure that directly.
How Wolters Kluwer Stress-Tests and Validates UpToDate Expert AI
Wolter Kluwers generative AI tool, UpToDate Expert AI, lets physicians interact with UpToDate’s curated clinical content through natural-language queries. As I wrote in my last piece on UpToDate Expert AI:
Think of it as having a clinical colleague who has memorized all of UpToDate and can discuss any topic in real-time.
It’s fast. But can I trust it?
UpToDate is well aware that the actual output UpToDate Expert AI produces—the words—can greatly influence clinical decisions. So, they want to ensure UpToDate Expert AI is reliable. And to do that, they stress-test the heck out of it.
In their latest white paper A Measured Approach to Evaluating AI at the Point of Care, instead of asking, “Can Expert AI pass a test like USMLE?”, they ask, “Is the answer UpToDate Expert AI produces clinically useful, grounded in trusted knowledge, and safe when things get messy?” To evaluate Expert AI’s answers, they group them into three different buckets:
Clinical intent: is the answer faithful to their point-of-care standards?
Knowledge integrity: is the answer grounded in trusted clinical knowledge?
Potential risks: how does the system behave under stress, uncertainty, and adversarial use?
This framework is what I wish more companies would lead with. It’s certainly closer to how we actually practice. I break it down further in the next section.
1) Clinical intent: rubrics, not vibes
UpToDate Expert AI answers were evaluated and validated against physician-authored rubrics. These rubrics, built by UpToDate physician editors across 25 specialties, rate what an “ideal” answer should contain for a given question. In other words, what a “good” answer looks like. Take the example question “when to anticoagulate for atrial fibrillation?” Below is the physician-authored rubric:
The rubric breaks a free-text answer into independently scored elements: what’s present, what’s missing, what’s incorrect, and what details matter at the point of care. The more interesting part is omissions. Rubric testing forces you to look at what didn’t show up in the answer. And omissions are where clinical tools quietly harm people. A model can look polished and still miss the one sentence that changes the plan. Omissions are what get you paged at 2 AM. Check out my video below to see what this actually looks like:
Three results stand out in the white paper:
UpToDate Expert AI provided clinically aligned information for 99.9% of assessed criteria (1,669 queries; 15,000+ criteria).
It met 13–15% more Essential criteria than two general-purpose LLM comparators.
Those general-purpose comparators had omission rates about 15% higher, translating to one additional omission error for every seven queries.
This is also how you monitor drift. If a model starts getting worse as it iterates, rubrics catch it.
2) Knowledge integrity: “model knowledge” leakage (this is the x-factor)
UpToDate Expert AI is designed to derive answers from UpToDate content. The system prompt specifies that only UpToDate content can be used, it uses a multi-step retrieval process, and it’s designed not to respond if relevant UpToDate content can’t be found. If content in the answer is not derived from UpToDate’s trusted content, this is called model knowledge leakage. Sometimes that leakage can be correct (like interpreting a weird abbreviation), which is exactly why it’s dangerous. Plausible isn’t the same as grounded.
In UpToDate’s analysis, they estimated model knowledge use to be similar to their test’s background noise of 1–4%. This shows UpToDate Expert AI is focused on already trusted, curated content within UpToDate.
The through line is verifiability. UpToDate Expert AI is designed to derive answers from UpToDate content, and the system is built so you can trace an output back to the underlying source material. That’s what makes the tool usable at the point of care: I can click, confirm, and move. This is the ‘verify quickly’ part of trust.
3) Potential risks: red teaming the failure modes we worry about
A dedicated team of clinical AI specialists and domain experts—the Red Team—was assigned the job to stress test UpToDate Expert AI by throwing it:
Complex clinical scenarios
Purposely jumbling how a question is structured
Multi-turn conversations
Adversarial attempts to provoke an undesirable answer
This team has put in 200+ hours and tried 1000+ times to get UpToDate Expert AI to produce undesirable answers. And when it does, they codify known risks into rubrics for ongoing surveillance testing.
They concluded that harm is best assessed by experts, and they’re developing an expert audit process to catch meaningful errors that automated scoring (e.g. testing against multiple choice questions) can miss.
Benchmarks and ratings, though, aren’t worthless. They’re just incomplete. Standardized exams, vignettes, and user ratings can give signals, but they weren’t designed to determine whether AI-generated content is appropriate at the point of care. Third-party validation, too, is essential. No one should be grading their own homework, and UpToDate agrees.
With that framework in mind, here’s how I think about UpToDate Expert AI on the wards—and what earns trust fast enough to use at the point of care.
Dashevsky Dissection
UpToDate Expert AI stands out as a system built around a model:
Curated knowledge base
Outputs you can trace back to source material
Evaluation that matches how clinicians use answers in the moment

At the bedside, that combination matters more than raw cleverness. I’m making decisions under time pressure, and I need an answer I can audit fast enough to safely act on.
I treat clinical AI the way I treat peers on my team in the hospital. The most trusted team members move quickly and surface uncertainty early. They show their work. They name what they’re assuming. They make it easy for you to check them before you sign the order. That’s the bar I want these tools held to, and it’s essentially what UpToDate is operationalizing with rubrics, knowledge-integrity checks, red teaming, and a feedback loop back to the underlying content.
Here’s my simple rounds test:
Does it broaden the differential instead of anchoring me?
Does it surface uncertainty instead of smoothing it over?
Can I click straight to the source and verify in seconds?
When a tool consistently passes those checks, it earns a place in my workflow.
In summary, clinical AI earns trust at the bedside when we can verify the reasoning in seconds, the tool shows its sources and assumptions, and it communicates uncertainty instead of smoothing it over, because at the point of care, “good enough” answers that sound definitive are exactly how omissions and misapplied guidelines slip into real decisions.






