Lesson 3 · Measurement ~11 min

Can You Trust the Detector?

You have the right domains and an individual baseline. Now the hard part: knowing that a low score means impaired — not noise, not a bad night, not "still learning the game." This lesson is the measurement science that separates a real detector from a plausible-looking one, worked through your three tasks.

Your setup, and why this lesson is aimed at it You track reaction time on every task and compare each person to their own ever-changing baseline — both excellent choices. But an ever-changing baseline hides a specific failure mode, and your goal ("detect any impairment") makes one metric — sensitivity — matter more than any other. Let's make your detector trustworthy on purpose.

Recall first

Which domain is closest to a signature of cannabis?

The core question

Every impairment decision is really one question: is today's score far enough below normal that it can't just be noise? "Normal" is the baseline. "Far enough" is a decision rule. Get either wrong and you either cry wolf (false alarms) or sleep through the fire (misses). Three things make the answer trustworthy — a good baseline, a principled threshold, and a task with room to show a drop.

① The baseline — and the "ever-changing" trap

You chose an individual baseline over population norms. Right call: norms falsely flag naturally slow people and falsely clear naturally fast ones; a personal baseline cancels the between-person differences you named — ability and task familiarity.[1] But "ever-changing" introduces two ways the baseline can quietly betray you:

Trap A — Practice absorption

Early sessions are the learning curve. Your tasks are strategic and learnable, so a user keeps improving for many sessions. If those improving sessions feed the baseline, it keeps dropping — and later, real impairment only brings them back to an old level, so it reads as "normal."

Trap B — Impairment absorption

If the baseline updates too fast and includes recent impaired sessions, it chases the user downward. Someone impaired every Friday night slowly teaches the baseline that slow-and-erratic is their normal — and the detector goes blind to a recurring problem.

Fix A

Let the baseline mature — don't trust it until performance has plateaued past the learning curve (watch each user's curve flatten), or explicitly model/subtract the practice trend.

Fix B

Update robustly and slowly: use a median or trimmed mean over a rolling window, and exclude sessions you already flagged as impaired from the baseline. The baseline should represent the person un-impaired.

② The decision rule — Reliable Change Index

A baseline is a center point; you also need its spread. The Reliable Change Index (RCI) asks whether today's change exceeds normal test-retest noise, by measuring the change in units of the person's own variability.[2] Roughly:

baseline ± normal noise your usual score today beyond the band → flag

change score = (today − baseline) ÷ (baseline's own standard deviation)

If that number is bigger than a chosen cutoff, the change is unlikely to be noise. Two payoffs for you: (1) it turns "seems slow today" into a defensible statistical call, and (2) it needs each person's variability — which means the RT variability you're already able to compute is doing double duty: it's both a fatigue signal and the denominator of your decision rule. Track and store it deliberately, not just mean RT.

Don't average away your best signal Mean reaction time is the obvious metric — but impairment, especially fatigue, often shows up as increased variability and lapses before the mean moves.[3] A user can keep a normal average while their responses become erratic. Compute intra-individual variability and a lapse count (e.g., responses beyond a threshold) on every task, not just the average.

③ Sensitivity vs. specificity — pick your dial

Every threshold trades two errors: sensitivity (catching truly impaired people) against specificity (not falsely flagging sober ones). You can't max both — moving the cutoff to catch more impaired people also flags more sober ones.

Your stated goal — detect impairment of any kind — is a decision to prioritize sensitivity: a missed impaired user is worse than a second-look on a sober one. So set the RCI cutoff looser, and consider an "any task flags it" (OR) rule across your three tasks rather than requiring all three. The cost is more false positives — which you manage with a cheap confirmatory retest, not by tightening until you start missing real impairment.

④ Ceiling & practice — the task-design guardrails

Two effects can silently zero out sensitivity, and both bite game-like tasks hardest:

On your time budget You want the whole thing under ~2 minutes on a phone. Good news: brief tasks can work — a 3-minute PVT-B retains much of the full PVT's sensitivity to sleep loss.[4] The honest caveat: shortening is a real trade-off — at least one study found a 3-minute version diverging from the 10-minute reference under some conditions.[5] So if you add a vigilance task, validate your short version against a longer one rather than assuming the sensitivity carries over.

Check yourself

An ever-changing baseline that updates too fast mainly risks...

For a "detect any impairment" goal, you should tune the threshold toward...

A memory task's accuracy is always perfect. You can still detect impairment by...

Your single win

You can now state what makes your detector trustworthy: a baseline that has matured past the learning curve and excludes impaired sessions; a decision rule (RCI) that judges today's drop in units of the person's own variability; a threshold tuned toward sensitivity; and tasks kept off the ceiling by scoring RT, not just accuracy. Every one of those hangs on the RT-variability you're already positioned to capture — so the biggest immediate lever is to store and use variability and lapse counts, not just mean RT.

Primary source — read this next Evidence for Added Value of Baseline Testing in Computer-Based Cognitive Assessment (2013). The direct evidence for the design choice you've already made — read it to see exactly why an individual baseline beats norms, and what a mature baseline requires.
I'm your teacher — ask me anything. Want to work out a concrete RCI cutoff, design the rule for "how many sessions before a baseline is trusted," or spec a <40-second vigilance task that fits your 2-minute budget? Bring it to the chat.

References

  1. [1] Evidence for Added Value of Baseline Testing in Computer-Based Cognitive Assessment, PMC (2013).
  2. [2] Reliable Change on Neuropsychological Tests in the Uniform Data Set, PMC (2016).
  3. [3] Basner & Dinges, Maximizing Sensitivity of the PVT to Sleep Loss, SLEEP (2011); Van Dongen et al. (2003).
  4. [4] Basner, Mollicone & Dinges, Validity and Sensitivity of a Brief PVT (PVT-B) (2011).
  5. [5] The 3-Minute PVT Demonstrates Inadequate Convergent Validity…, Frontiers in Neuroscience (2022).
← Lesson 2 Lesson 4: A Vigilance Task in Your Budget →