
Every night, millions of smartwatches produce a tidy summary: 7 hours 22 minutes of sleep, 1 hour 48 minutes of deep sleep, 22 minutes of REM. The numbers look clinical. They are not.
The fundamental limitation is straightforward: smartwatches lack an electroencephalogram (EEG). They cannot detect the brain's electrical activity — the only direct measure of whether you are awake, in light sleep, deep sleep, or REM. Instead, they rely on proxy signals: movement and heart rate. These are correlated with sleep stages, but correlation is not measurement.
As the American Academy of Sleep Medicine notes, none of these devices replace polysomnography or home sleep apnea testing for diagnosis. Johns Hopkins Medicine puts it even more plainly: trackers measure inactivity as a surrogate for sleep, not sleep itself. For exact sleep data, a medical sleep study measuring brain waves is required.
Understanding this distinction is essential before you trust the numbers on your wrist. The rest of this article walks through exactly how the proxy pipeline works: what the sensors capture, how algorithms turn that raw data into sleep stages, and what the validation studies actually reveal about accuracy.
Sensor Primer: Accelerometers and Photoplethysmography (PPG)
Two sensors do the heavy lifting in every sleep-tracking smartwatch: an accelerometer and a photoplethysmography (PPG) sensor. Each captures a different physiological signal, and each has well-documented limitations.
Accelerometry: The Movement Proxy
The accelerometer detects motion in three axes. When you are still for extended periods, the device logs that as probable sleep. When you shift position, it may log a brief arousal or a transition between stages.
This is the same principle behind actigraphy — a validated clinical tool used in sleep research for decades. But there is a critical gap: lying still does not equal being asleep. You can lie motionless while awake (a phenomenon called "quiet wakefulness"), and the accelerometer cannot tell the difference. Conversely, you can move during sleep without waking, and the device may misclassify that movement as wake.
The 2023 multicenter validation study published in JMIR confirmed this pattern across 11 consumer devices: wearables consistently misclassified wake as light sleep. The accelerometer alone cannot resolve the ambiguity.
PPG: The Heart Rate and HRV Proxy
The PPG sensor uses green or red LEDs to measure blood volume changes under the skin — the same optical technique used in hospital pulse oximeters. From this signal, the device calculates heart rate and, with sufficient signal quality, heart rate variability (HRV).
Heart rate and HRV follow predictable patterns across sleep stages. During deep sleep (N3), heart rate typically drops and HRV increases as the parasympathetic nervous system dominates. During REM, heart rate becomes more variable and can approach waking levels. These patterns are real, but they are not unique to sleep. Exercise, caffeine, stress, and illness all shift heart rate and HRV in ways that can mimic or mask sleep-stage patterns.
PPG also has a practical limitation on smartwatches: optical noise. Movement during sleep can dislodge the sensor from the skin, introducing artifacts. Poor fit, tattoos, and darker skin pigmentation can reduce signal quality. The device may still report a sleep stage during these periods, but the underlying data is degraded.

How Algorithms Turn Movement and Heart Rate into Sleep Stages
Raw sensor data is not useful on its own. The accelerometer outputs a stream of acceleration vectors; the PPG sensor outputs a photoplethysmogram waveform. Neither directly says "this is deep sleep." That translation is the job of the algorithm.
The pipeline works in three stages:
- Feature extraction. The algorithm calculates summary statistics from the raw data — movement intensity over 30-second epochs, heart rate trend, HRV metrics like RMSSD (root mean square of successive differences), and signal quality indicators.
- Classification. These features are fed into a machine learning model — typically a random forest, gradient-boosted tree, or neural network — that was trained on polysomnography (PSG) data. The model assigns each 30-second epoch to one of four categories: wake, light sleep (N1/N2), deep sleep (N3), or REM.
- Post-processing. The raw classifications are smoothed to remove implausible transitions (e.g., a single REM epoch surrounded by deep sleep) and to produce the hypnogram you see in the morning.
Each manufacturer uses a proprietary algorithm trained on its own dataset. Fitbit, for example, has published that its algorithm uses a combination of accelerometry and heart rate variability to estimate sleep stages. The Fitbit Sleep Score is a composite of these stage estimates plus additional factors like sleep duration and restoration quality. Apple's approach, as described by the AASM, relies primarily on accelerometer patterns for sleep staging on the Apple Watch Series 9 through 11.
The key insight is that the algorithm is only as good as its training data. Most training datasets are collected from healthy adults in controlled lab conditions. Real-world sleep — disrupted by noise, children, pets, alcohol, or anxiety — looks different from lab sleep. The algorithm has never seen those patterns, so it guesses.
What the Validation Studies Actually Found
Two major peer-reviewed studies provide the clearest picture of how well this proxy pipeline works — and where it falls short.
The JMIR 2023 Multicenter Study
Published in the Journal of Medical Internet Research, this study tested 11 consumer sleep trackers — including the Google Pixel Watch, Galaxy Watch 5, Fitbit Sense 2, and Apple Watch 8 — against in-lab polysomnography in 75 participants. The results paint a consistent picture across devices.
For four-stage sleep classification (wake, light, deep, REM), the top wearables achieved macro F1 scores of 0.57 to 0.58. Cohen's kappa values ranged from 0.4 to 0.6, indicating moderate agreement with PSG. The Apple Watch 8 showed lower agreement, with a macro F1 of 0.4910 and a kappa of 0.30. Sleep efficiency bias ranged from −3.49 to +12.80 percentage points across devices — meaning some watches consistently overestimated or underestimated how much sleep you actually got.
The Brigham and Women's 2024 Study
A 2024 study from Brigham and Women's Hospital and Harvard Medical School, published in Sensors, compared the Oura Ring Gen3, Fitbit Sense 2, and Apple Watch Series 8 against PSG in 35 healthy adults aged 20 to 50. The results are instructive at the device level.
| Metric | Apple Watch Series 8 vs PSG | Fitbit Sense 2 vs PSG | Oura Ring Gen3 vs PSG |
|---|---|---|---|
| Sleep/wake sensitivity | ≥95% | ≥95% | ≥95% |
| Light sleep bias | Overestimated by 45 min | Overestimated by 18 min | No significant difference |
| Deep sleep bias | Underestimated by 43 min | Underestimated by 15 min | No significant difference |
| Wake bias | Underestimated by 7 min | No significant difference | No significant difference |
| Deep sleep ICC | 0.13–0.36 (poor) | 0.13–0.36 (poor) | 0.13–0.36 (poor) |
| Total sleep time ICC | 0.85 (best) | Not reported | Not reported |
| Parameters not significantly different from PSG | None of the above | None of the above | 7 of 8 parameters |
All three devices achieved ≥95% sensitivity for detecting sleep versus wake — meaning they rarely miss a night of sleep. But when it comes to assigning specific sleep stages, the picture changes dramatically. The Apple Watch underestimated deep sleep by 43 minutes per night on average. Fitbit overestimated light sleep by 18 minutes and underestimated deep sleep by 15 minutes. Oura showed no statistically significant difference from PSG on 7 of 8 parameters, though deep sleep concordance remained poor across all devices (ICC 0.13–0.36).
For a full breakdown of device-specific accuracy findings from these and other recent studies, see our detailed evidence review. For Apple Watch-specific data, see our Apple Watch accuracy analysis.
Accuracy Ranges: What the Numbers Mean for You
When you look at your smartwatch's sleep report in the morning, it helps to know what the numbers actually represent. Based on the available validation evidence, here are the typical accuracy ranges you can expect:
- Sleep/wake detection: >90% sensitivity. Your watch is very good at knowing whether you were asleep or awake overall. This is the most reliable metric across all devices.
- Sleep-stage classification: 50–80% accuracy depending on the device and the specific stage. This is where the proxy approach shows its limits. The device is making an educated guess, not a measurement.
- Deep sleep estimation: The hardest stage to estimate accurately. Low movement combined with subtle HRV changes creates a signal that is easily confused with light sleep or quiet wakefulness. The Brigham study found poor concordance (ICC 0.13–0.36) for deep sleep across all three tested devices.
- Wake detection: The weakest metric. The JMIR study confirmed that wearables consistently misclassify wake as light sleep. If you lie still while awake, your watch will likely count that as sleep.
- Day-to-day variability is high. Even if the average bias over a week is small, individual nights can be off by 30 minutes or more for a specific stage. A single night's data should not be interpreted as diagnostic.
Sleep scientist Dean J. Miller (CQUniversity Australia) summarizes the situation concisely: most devices correctly identify >90% of sleep epochs but struggle with wake (26–73% correct) and sleep-stage assignment (53–60% correct for four-stage classification). These are estimates, not measurements.
Practical Takeaways: Use Trends, Not Diagnosis
The proxy-based methodology of smartwatches does not make them useless — it makes them useful for the right things and dangerous for the wrong ones.
What smartwatches are good for:
- Tracking long-term trends. Weekly or monthly averages smooth out the night-to-night noise and can reveal meaningful patterns — how your sleep changes with exercise, caffeine timing, or stress.
- Detecting large shifts. If your sleep duration drops by two hours for several nights in a row, that signal is likely real, even if the exact stage breakdown is not.
- Motivating consistency. The act of tracking itself can reinforce sleep hygiene behaviors — going to bed at a consistent time, reducing light exposure before bed — even if the specific numbers are imperfect.
What smartwatches are not good for:
- Diagnosing sleep disorders. The AASM is clear: wearables do not replace polysomnography or home sleep apnea testing. If you suspect sleep apnea, restless legs, or another clinical condition, a medical sleep study is required.
- Interpreting single-night data. A single night's deep sleep percentage is as likely to reflect algorithm noise as actual physiology. Do not change your behavior based on one bad score.
- Creating anxiety about sleep. There is a documented phenomenon called orthosomnia — the anxiety that arises from obsessing over imperfect tracker data. If checking your sleep score makes you stressed about sleep, the tracker is harming rather than helping.
The bottom line: your smartwatch is a useful pattern-recognition tool, not a clinical instrument. It can tell you whether your sleep is generally improving or declining over weeks and months. It cannot tell you whether you had 22 minutes or 38 minutes of deep sleep last night. Understanding that distinction is the difference between using the data wisely and being misled by it.



Comments
Join the discussion with an anonymous comment.