The useful question is not whether a smartwatch can “track sleep” at all. It is which part of sleep it is actually measuring. On the strongest peer-reviewed comparisons, modern devices are surprisingly good at the basic sleep-versus-wake call, but much less trustworthy once they start naming stages like deep sleep, REM, and light sleep. In a 2024 head-to-head study from Brigham & Women’s / Harvard, Oura Ring Gen3, Fitbit Sense 2, and Apple Watch Series 8 all reached at least 95% sensitivity for sleep/wake detection against PSG, but their stage-specific performance varied much more widely.[1]

What the best head-to-head study actually found
That 2024 study is worth more attention than the usual smartwatch roundup because it compared three consumer devices side by side against polysomnography, the lab standard that reads brain activity rather than guessing from movement alone. It was funded by Oura Ring Inc., but it was independently conducted and peer-reviewed, which matters more than the sponsorship line by itself.[1] The result was not that one device “works” and the others do not. It was that all three were solid on sleep-versus-wake detection, while stage classification stayed uneven.
| Device | Sleep/wake sensitivity vs PSG | Stage-specific sensitivity vs PSG | Notable finding |
|---|---|---|---|
| Oura Ring Gen3 | ≥95% | 76–80% | Strong sleep/wake detection; stage labels still imperfect.[1] |
| Fitbit Sense 2 | ≥95% | 62–78% | More variation once the device had to name stages.[1] |
| Apple Watch Series 8 | ≥95% | 51–86% | Underestimated deep sleep by a mean of 43 minutes per night.[1] |
That spread is the whole story in miniature. A watch can be good enough to notice that you were asleep, yet still be shaky when it tries to sort one asleep state from another. The labels look precise because the app gives them a polished interface. The underlying signal is still coming from accelerometry and photoplethysmography, not EEG, so the device is inferring sleep biology from proxies.[3]

Why stage scores wobble even when sleep/wake detection looks good
A useful older baseline comes from Oxford’s 2021 synthesis of prior work: consumer sleep trackers were about 78% accurate for sleep versus wake, but only about 38% accurate for estimating sleep onset latency.[2] That is a huge drop, and it explains why a device may feel trustworthy on the broad question of “Did I sleep?” while becoming much less reliable the moment it tries to estimate when sleep started. The newer head-to-head data improve the picture, but they do not erase the basic hierarchy: wakefulness detection is easier than stage scoring, and both are easier than timing sleep onset precisely.[1][2]
The failure modes are not subtle. Quiet wakefulness can look like sleep when you are still in bed. Someone with insomnia who lies motionless can be scored more like a sleeper than a frustrated insomniac. PPG-based sensors also have known accuracy limitations in people with darker skin tones. And because most validation work is done in controlled settings with healthy volunteers, the clean numbers from the lab should not be treated as a guarantee for every real-world night.[1][3] That is also why device reputation is a weak shortcut; the better question is whether a specific model has been validated against PSG, and in whom.
A small Samsung Galaxy Watch 3 validation study makes the same point from another angle. In 2023, the Journal of Sleep Medicine reported sleep-wake validation in 32 participants, 87.5% of them male.[4] That does not make the study worthless. It does make it a reminder to avoid treating one model, one sample, or one generation as timeless proof for the whole brand.
What to trust in practice
The cleanest way to use a smartwatch that tracks sleep is to separate the sturdy numbers from the decorative ones. Total sleep, sleep-versus-wake patterns, and longer-term trends are usually the most useful outputs. Deep sleep, REM, and sleep onset latency deserve more caution, especially if the number changes your mood more than your habits. A single night’s stage breakdown is better treated as an estimate than as a readout of what your brain definitively did minute by minute.[1][2]
- Trust sleep-versus-wake detection more than stage labels.
- Use total sleep and week-to-week trends as directionally useful, not clinically exact.
- Treat deep sleep, REM, and sleep onset latency as estimates with wider error bars.
- Check whether the exact device generation you own has been validated against PSG, not just whether the brand has a sleep feature.
- Do not use a consumer smartwatch to diagnose a sleep disorder.
That is the practical limit that keeps the category honest. Smartwatches are good enough to help you observe sleep patterns, and often good enough to notice when something is off. They are not good enough to tell you with clinical confidence what stage you were in every minute of the night, and the newer the model, the more important it is to ask which validation data actually applies to it.
References
- Accuracy of Three Commercial Wearable Devices for Sleep Tracking in Healthy Adults — Sensors, 2024
- Are sleep trackers accurate? — Oxford Neuroscience, 2021
- Comparing sleep features of popular smartwatches — American Academy of Sleep Medicine
- Validation of the Samsung Smartwatch for Sleep–Wake Determination — Journal of Sleep Medicine, 2023



Comments
Join the discussion with an anonymous comment.