A sleep tracking device on your wrist or finger is not watching sleep the way a sleep lab watches sleep. It is not reading brain waves. It is usually combining movement from an accelerometer, optical heart-rate signals from PPG, and sometimes temperature, then using software to infer what probably happened after you closed your eyes. PSG, the sleep-lab standard, uses signals such as EEG, EOG, and EMG to classify sleep directly from brain, eye, and muscle activity; consumer wearables do not collect those signals. [1][2]

Clinical PSG brain-wave measurement compared with wearable pulse and movement signal inference

That distinction explains most of the accuracy debate. When a device says you fell asleep around 11:20 p.m. and woke around 6:40 a.m., it may be giving you a useful approximation. When it says you got exactly 47 minutes of deep sleep and should rethink your recovery, the confidence level should drop. The first claim is closer to what the sensors can observe. The second is a much harder inference.

The best short answer is this: trust a sleep tracker most for heart rate, HRV, and broad sleep/wake timing trends. Be much more cautious with single-night stage graphs, especially deep sleep, wake after sleep onset, and any readiness or recovery score built on top of those estimates.

The evidence is strongest for sleep vs. wake, not sleep stages

A useful validation study should compare a tracker against PSG, not just against another wearable. In a 2024 study in Sensors, Robbins and colleagues compared Apple Watch Series 8, Fitbit Sense 2, and Oura Ring Gen3 with PSG in 35 healthy adults at Brigham and Women’s Hospital / Harvard. All three devices detected sleep versus wake with at least 95% sensitivity, which put them in the same broad range as research-grade actigraphy for that task. [3]

That is the good news, and it matters. If you are trying to learn whether your bedtime has drifted later, whether weekend sleep is compensating for weekday sleep, or whether caffeine after dinner is pushing your sleep onset later, a wearable can be genuinely helpful. You do not need perfect stage scoring to notice that your sleep window is shrinking.

The same Robbins study becomes less flattering when the task changes from “asleep or awake?” to four-stage classification. Oura had the best four-stage agreement, with Cohen’s Kappa of 0.65 and sensitivity of 76% to 80%. Apple Watch had Kappa of 0.43 and sensitivity of 51% to 86%, while Fitbit had Kappa of 0.47 and sensitivity of 62% to 78%. Apple Watch also underestimated deep sleep by an average of 43 minutes per night. [3]

Metric or claimHow much confidence it deservesWhy
Sleep/wake timing trendsModerate to highMajor devices show high sleep-detection sensitivity against PSG, but can miss quiet wakefulness.
Resting heart rate during sleepHighWearable HR has validated strongly against ECG during sleep.
HRV during sleepHigh for trendsWearable HRV has also validated strongly against ECG, though interpretation still depends on context.
Deep sleep minutesLow to moderateFour-stage classification is much weaker than sleep/wake detection, and deep sleep can be substantially misestimated.
Single-night readiness or recovery scoreLowThe score often blends accurate signals with noisier stage estimates and proprietary weighting.
Diagnosis of insomnia, apnea, or another sleep disorderNot appropriateConsumer wearables are not clinical diagnostic instruments.

The funding detail is also worth keeping in view. The Robbins study was funded by Oura Ring Inc., and Oura performed best in that study. That does not make the results useless; it does mean the narrow claim should stay narrow. The study supports that these devices can be strong at sleep/wake detection in healthy adults under lab validation conditions, and that Oura Gen3 performed better than the tested Apple and Fitbit models on four-stage classification in that sample. It does not prove that every newer ring, watch, or algorithm update can stage sleep accurately in every bedroom. [3]

Heart rate and HRV are the cleanest wearable sleep signals

If there is a part of the dashboard I would take seriously before the sleep score, it is the overnight heart-rate and HRV trend. Those signals are much closer to what the hardware is actually measuring. A wearable optical sensor is not perfect, but during sleep the conditions are friendlier: less motion, more stable contact, and fewer abrupt wrist movements than during exercise.

Stone and colleagues validated six consumer wearables against ECG in a 2022 Australian Institute of Sport study. Across the tested devices, heart-rate ICC was 0.97 to 0.99 and HRV ICC was 0.96 to 0.98 during sleep. [4]

That does not mean HRV is a magic readiness meter. It means the measurement itself is on firmer ground than the interpretation layered above it. A lower-than-usual overnight HRV after alcohol, illness, hard training, or stress is worth noticing. A single bad recovery badge, especially if you slept well and feel fine, does not deserve the same authority.

The same distinction applies to resting heart rate. A rising overnight heart-rate trend can be a useful flag that something is different. It cannot, by itself, tell you whether the cause is infection, anxiety, late food, heat, alcohol, overtraining, medication, or a poor sensor fit. The number may be reliable while the story the app tells about it remains speculative.

The quiet-awake problem is where trackers annoy real users

The everyday failure mode is not dramatic. It is lying still in bed, awake, while the device quietly counts the time as sleep. You know you were awake. The tracker sees low movement, a resting heart rate, and a body that looks sleep-like from the outside. If the algorithm has to guess, it often guesses sleep.

This is not just a cranky user complaint. Chinoy and colleagues found poor wake-detection specificity across tested consumer sleep trackers, in the 50% to 57% range, and Robbins 2024 reported the same broad problem: devices routinely score quiet wakefulness as sleep. [5][3]

Specificity is the important word here. High sleep sensitivity means the device is good at recognizing sleep when you are asleep. Poor wake specificity means it is much worse at recognizing wake when you are awake but still. That combination can make total sleep time look better than it was, sleep efficiency look cleaner than it felt, and wake after sleep onset look artificially low.

This is also why two people can have opposite frustrations. One person wakes up feeling fine and sees a terrible deep-sleep number. Another spends an hour awake at 3 a.m. and sees a tidy sleep graph with barely any interruption. Both reactions are reasonable. The device is trying to reconstruct an internal state from external signals.

PSG measurement of brain, eye, and muscle signals compared with wearable measurement of heart rate, movement, and temperature

Deep sleep minutes look precise because apps make them look precise

Deep sleep is the stage most likely to get treated like a nightly grade. The graph gives a number. The number has a unit. The app may compare it with a benchmark and imply that more would be better. That presentation makes the estimate feel more solid than the evidence allows.

Four-stage staging is where consumer sleep tracking devices lose much of their accuracy advantage. In the Robbins PSG comparison, Apple Watch underestimated deep sleep by an average of 43 minutes per night, and the tested devices’ four-stage agreement was meaningfully weaker than their sleep/wake detection. [3]

That does not make every stage graph worthless. If your tracker shows a consistent change after a schedule shift, a new medication, heavy alcohol use, or repeated short nights, the pattern may be worth comparing with how you feel. The problem is treating one night’s stage split as a fact. “You got 38 minutes of deep sleep” should be read as “the device’s model classified roughly this much of the night as deep sleep from proxy signals.” That is less catchy, but it is closer to the truth.

What to do with a sleep score

A sleep score is not a measurement. It is a product decision. The company decides which ingredients matter, how much to weight them, how to penalize wake time, how to reward consistency, and how to turn all of that into a number that feels understandable before coffee.

Some ingredients may be solid. Overnight heart rate and HRV have credible validation against ECG. Sleep timing can be useful over time. But once the score blends those signals with estimated stages, proprietary thresholds, and recovery language, it becomes harder to know whether a low score reflects your body, the algorithm, or a quiet hour in bed that the device misread. [4][5]

The World Sleep Society’s 2025 recommendations land in the right lane: use consumer sleep technologies for consistency and trends, not single-night interpretation; keep subjective sleep quality in the picture; and do not use consumer wearables to diagnose sleep disorders. [2]

That subjective piece matters. If the app says your night was poor but you feel alert, functional, and normal, the app does not automatically win. If the app says your night was excellent but you were awake for long stretches, your memory of being awake is not invalidated by a clean graph. The tracker is useful evidence. It is not the judge.

A practical trust map for wearable sleep data

For most people, the cleanest way to use a sleep tracker is to separate measurement from interpretation. The app will not do this for you. It wants one dashboard. You need a hierarchy.

  • Use bedtime, wake time, and total sleep duration as trend data, especially across weeks rather than single nights.
  • Take overnight resting heart rate and HRV seriously as physiological trends, while staying cautious about the app’s explanation for why they changed.
  • Treat deep sleep, REM sleep, and wake-after-sleep-onset as estimates, not direct observations.
  • Be skeptical when a device reports very precise stage percentages from one night.
  • Give extra doubt to nights when you lay still while awake, read in bed, meditated, or woke early and did not move much.
  • Do not use a consumer wearable to diagnose insomnia, sleep apnea, periodic limb movement disorder, or any other sleep disorder.

If you are comparing devices before buying, PSG validation data matters more than app polish. A more detailed device-level comparison belongs in a head-to-head review such as Apple Watch vs. Oura Ring vs. Fitbit, while broader context on connected sleep products is covered in Smart Sleep Devices: What the Research Says About Accuracy and Efficacy. The main rule does not change much by brand: the closer the metric is to something the sensors directly capture, the more respect it deserves.

Where the line should be

A consumer sleep tracking device can be useful without being clinically authoritative. It can show that your sleep schedule is inconsistent, that your heart rate runs higher after alcohol, that your HRV drops during illness, or that your average sleep window is shorter than you thought. Those are legitimate uses.

It should not be treated as a sleep lab on your wrist. It does not measure brain waves, it struggles with quiet wakefulness, and its stage graphs are model outputs dressed in clean colors. Read the trends. Respect the heart-rate and HRV data. Keep single-night deep-sleep numbers at arm’s length.

References

  1. Do Sleep Trackers Really Work? — Johns Hopkins Medicine.
  2. Consumer sleep technology: an American Academy of Sleep Medicine clinical practice guideline — World Sleep Society / Sleep Medicine, 2025.
  3. Evaluation of Consumer Sleep Technologies for Sleep Stage Classification and Sleep Metrics in Healthy Adults — Sensors, 2024.
  4. Validation of Wearable Devices for Sleep Measurement in Healthy Adults — Australian Institute of Sport, 2022.
  5. Performance of seven consumer sleep-tracking devices compared with polysomnography — Sleep, 2021.