Depression Screener — The 9-Item Kroenke 2001 Instrument

Definition

A depression screener is a brief self-report instrument used to identify probable cases of major depressive disorder (MDD) in primary care, occupational health, and general practice contexts. A screening instrument is not a diagnostic tool — it identifies people likely to benefit from clinical evaluation, with a positive screen always requiring clinical confirmation. The most widely-used brief depression screener is the 9-item instrument developed by Kroenke, Spitzer & Williams in 2001.

The 9-item structure

The 9-item depression screener developed by Kroenke et al. 2001 asks how often the user has been bothered by 9 specific depression symptoms over the past 2 weeks. Each item is rated on a 4-point frequency scale:

0 — Not at all
1 — Several days
2 — More than half the days
3 — Nearly every day

The 9 items map directly onto the 9 DSM major depressive episode criteria — making the screen unusually transparent. The total score ranges from 0 to 27.

Item	Symptom captured	DSM criterion
1	Anhedonia (loss of interest/pleasure)	A2
2	Depressed mood	A1
3	Sleep disturbance	A4
4	Fatigue / low energy	A6
5	Appetite change	A3
6	Worthlessness / guilt	A7
7	Concentration difficulty	A8
8	Psychomotor changes	A5
9	Self-harm ideation	A9

The 10th functional impairment item

The instrument also includes a 10th item asking how difficult these problems have been to live with (4-point scale: not at all / somewhat / very / extremely difficult). This item is captured but not added to the score. It serves as a separate clinical signal — a moderate-band score with "extremely difficult" functional impairment is interpreted as warranting more urgent follow-up than the same symptom score with minimal functional difficulty.

The five severity bands

Standard cutpoints from Kroenke 2001, used by NHS IAPT, VA/DoD clinical practice guidelines, and APA major depressive disorder treatment guidelines:

Score	Band	Interpretation
0-4	Minimal	Symptoms unlikely clinically significant
5-9	Mild	Subthreshold for major depressive episode
10-14	Moderate	Probable MDE per standard ≥10 cutoff (sens 88%, spec 88%)
15-19	Moderately severe	Active treatment usually appropriate
20-27	Severe	High symptom burden; combined treatment often warranted

Validation evidence

Property	Value	Source
Internal consistency (Cronbach's α)	0.89	Kroenke 2001 (n=580); confirmed 0.88 in Hinz 2017 (n=5,018)
Test-retest reliability	r = 0.84	Kroenke 2001 (n=300, 48 hours)
Sensitivity at ≥10	88%	Kroenke 2001 (primary care, n=580)
Specificity at ≥10	88%	Kroenke 2001 (primary care, n=580)
Negative predictive value (NPV)	≈ 0.99	At ≥10 cutoff, primary care prevalence ~7%
Convergent validity (vs BDI)	r = 0.73	Kroenke 2001
Convergent validity (vs HAM-D)	r = 0.79	Cameron 2008
Manea meta-analysis range	Sens 0.78-0.88, Spec 0.85-0.94	Manea 2012, 18 primary care studies

The instrument has been validated in dozens of cross-cultural studies (English, German, Spanish, Mandarin Chinese, Japanese, Arabic, Brazilian Portuguese, French, others). Its psychometric properties are robust across translations.

Population norms

Population	Sample	Mean	SD	Source
General population (Germany)	n = 5,018	2.91	3.52	Kocalevent et al. 2013
Female (Germany)	n ≈ 2,608	3.13	3.61	Kocalevent et al. 2013
Male (Germany)	n ≈ 2,410	2.66	3.41	Kocalevent et al. 2013
Primary care (US)	n = 6,000	3.3	3.8	Kroenke et al. 2001
Psychiatric outpatient	n = 502	13.8	6.5	Beard 2016

Six limitations

Snapshot, not trajectory. The screener captures last-2-weeks symptoms. Depression naturally fluctuates with life events; one administration is a single moment in time.
Self-report dependent. Honest self-report is the foundation. Self-criticism, alexithymia, recall bias, and self-presentation effects can all distort scores.
Not diagnostic. The screen is sensitive (88% per Kroenke 2001) but does not establish MDE diagnosis. Clinical interview against DSM-5 criteria is required for diagnosis.
Cultural variation. Validated primarily in US/European populations. Cultural expression of depression varies, and somatic-vs-cognitive emphasis differs across cultures.
Differential diagnosis ignored. A high score may reflect bipolar depression, persistent depressive disorder, adjustment disorder, or grief — the screen flags depression symptoms, not their kind or source.
Adolescent validation differs. The instrument was validated in adults. The PHQ-A version has separate validation (Richardson 2010) with different cutoffs; this is the appropriate instrument for users under 18.

Frequently asked questions

What does a 'screener' do that a diagnostic instrument doesn't?

A screener is designed to be brief, sensitive, and easy to administer. It identifies people likely to have a particular condition — flagging them for further clinical evaluation — but it does not establish a diagnosis. Screeners trade specificity for sensitivity; they are designed to catch most true cases at the cost of some false positives. A diagnostic instrument is more comprehensive, more time-intensive, and is administered by a trained clinician using structured criteria like DSM-5. The 9-item depression screener takes 2-3 minutes; a diagnostic clinical interview takes 30-90 minutes.

Who developed this instrument and when?

Kurt Kroenke, Robert Spitzer, and Janet Williams developed the 9-item depression screener in 2001, publishing the validation study in the Journal of General Internal Medicine. It was created as part of the PRIME-MD project funded by Pfizer to develop brief mental health screening tools for primary care. The instrument has been cited over 100,000 times across psychology, psychiatry, primary care, occupational health, and sleep medicine, making it the most-cited brief depression screener in the world.

How well does it work?

Validation evidence is strong. The original Kroenke 2001 study (n=580 with structured diagnostic interview) reported Cronbach's α = 0.89 (internal consistency), test-retest reliability r = 0.84, and at the standard cutoff of ≥10: sensitivity 88% and specificity 88% for major depressive disorder. Manea, Gilbody & McMillan's 2012 meta-analysis of 18 studies confirmed the ≥10 cutoff has the best sensitivity/specificity balance across primary care populations. Convergent validity with the Beck Depression Inventory: r = 0.73.

What does a score in each band mean?

Standard cutpoints from Kroenke 2001 (used by NHS IAPT, VA/DoD, and APA): 0-4 minimal depression (most adults score here); 5-9 mild (subthreshold for MDE); 10-14 moderate (probable MDE per the standard ≥10 cutoff); 15-19 moderately severe; 20-27 severe. The cutoff of 10 has 88% sensitivity and 88% specificity per the original validation. A positive screen (≥10) does not establish MDE diagnosis — it identifies probable cases that warrant clinical confirmation.

Is the cutoff really 10? What about other cutoffs?

The standard cutoff of ≥10 is most widely used and most-cited. Some research contexts use ≥8 to maximize sensitivity (catching more true cases at the cost of more false positives); this is reasonable for screening protocols in high-risk populations like perinatal depression. Some contexts use ≥12 to maximize specificity (fewer false positives at cost of missed cases); this is more conservative. Manea 2012 meta-analysis found ≥10 has the best balance for primary care populations. NICE 2022 guidelines, US Preventive Services Task Force, and APA all use ≥10.

How is item 9 special?

Item 9 asks about thoughts of being better off dead or hurting yourself. Unlike other items, item 9 has standalone clinical significance — any non-zero response indicates self-harm ideation that warrants specific clinical attention regardless of total score. The LBL Depression Test surfaces a crisis modal with crisis resources (988 Suicide & Crisis Lifeline, Crisis Text Line, Samaritans, Talk Suicide Canada, findahelpline.com) immediately when item 9 receives any non-zero response. Mann et al. 2005's systematic review of suicide prevention identifies direct connection to crisis resources as one of the most evidence-supported interventions.

What are the limitations?

Six main limitations: (1) Snapshot, not trajectory — measures last-2-weeks symptoms only. (2) Self-report dependent — subject to recall bias, alexithymia, and self-presentation effects. (3) Not diagnostic — identifies probable cases but does not establish MDE diagnosis. (4) Cultural variation — validated primarily in US/European populations; cultural expression of depression varies. (5) Differential diagnosis ignored — does not distinguish unipolar MDD from bipolar depression, persistent depressive disorder, adjustment disorder, or grief. (6) Adolescent validation differs — the PHQ-A version has different cutoffs and is the appropriate instrument for users under 18.

How does this differ from the Beck Depression Inventory or HAM-D?

The Beck Depression Inventory (BDI-II) has 21 items and takes 5-10 minutes; it captures more depression dimensions (cognitive distortions, behavioral symptoms, somatic complaints) but is less efficient for primary-care screening. The Hamilton Depression Rating Scale (HAM-D) is clinician-administered, 17-21 items, takes 15-30 minutes, and is the standard outcome measure in clinical trials but not appropriate as a self-screen. The 9-item Kroenke 2001 instrument trades depth for brevity and self-administration — the right trade-off for screening contexts but not for severity tracking in research.

References

Kroenke, K., Spitzer, R. L., & Williams, J. B. W. (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613. doi.org/10.1046/j.1525-1497.2001.016009606.x
Manea, L., Gilbody, S., & McMillan, D. (2012). Optimal cut-off score for diagnosing depression with the Patient Health Questionnaire (PHQ-9): a meta-analysis. CMAJ, 184(3), E191–E196. doi.org/10.1503/cmaj.110829
Kocalevent, R. D., Hinz, A., & Brähler, E. (2013). Standardization of the depression screener Patient Health Questionnaire (PHQ-9) in the general population. General Hospital Psychiatry, 35(5), 551–555. doi.org/10.1016/j.genhosppsych.2013.04.006
Beard, C., Hsu, K. J., Rifkin, L. S., Busch, A. B., & Björgvinsson, T. (2016). Validation of the PHQ-9 in a psychiatric sample. Journal of Affective Disorders, 193, 267–273. doi.org/10.1016/j.jad.2015.12.075
Hinz, A., Klein, A. M., Brähler, E., et al. (2017). Psychometric evaluation of the Generalized Anxiety Disorder screener and the Patient Health Questionnaire. Journal of Affective Disorders, 210, 338–344.
Cameron, I. M., Crawford, J. R., Lawton, K., & Reid, I. C. (2008). Psychometric comparison of PHQ-9 and HADS for measuring depression severity in primary care. British Journal of General Practice, 58(546), 32–36.
Mann, J. J., Apter, A., Bertolote, J., et al. (2005). Suicide prevention strategies: a systematic review. JAMA, 294(16), 2064–2074. doi.org/10.1001/jama.294.16.2064