Effect size
What is effect size?
Effect size is a quantitative measure of the magnitude of a relationship or difference observed in research, distinct from statistical significance — which captures only whether the effect is unlikely to be zero. Effect size is the bridge between statistics and substantive meaning: it tells you whether a finding actually matters in the real world. Formalized by Jacob Cohen in 1969, contemporary statistics increasingly treats effect size estimation rather than significance testing as the primary inferential goal, particularly in the wake of the replication crisis that began in earnest with the Open Science Collaboration's 2015 reproducibility study.
Without effect size, statistics cannot answer the most important question. Statistical significance tells you the effect is unlikely to be zero. Effect size tells you whether being non-zero is worth caring about.
Why effect size matters
Effect size matters because statistical significance is a poor guide to importance on its own. A treatment that reduces dementia risk by 0.5% can be highly statistically significant in a large study and clinically negligible. A treatment that reduces risk by 20% can be statistically inconclusive in a small study and clinically transformative. Headlines that report "X significantly reduces Y" without effect size information are reporting only half the story — and often the less informative half.
The replication crisis brought this distinction into sharp focus. The Open Science Collaboration's 2015 Reproducibility Project reproduced 100 psychology studies; while 97% of originals reported significant results, only 36% of replications did, and replication effect sizes averaged half the original magnitude. A 2024 Institute for Replication paper coordinated by Brodeur and over 350 coauthors examined 110 articles from leading economics and political science journals and continued to find systematic effect size inflation. The lesson is consistent across fields: original effect sizes are systematically inflated, and effect size estimation rather than significance testing is the more reliable inferential goal.
Contemporary statistical reform — advocated by Geoff Cumming, Daniel Lakens, and the Society for the Improvement of Psychological Science, among others — has emphasized reporting effect sizes with confidence intervals, registering studies in advance, and treating estimation as primary. The 2025 PMC paper "Improving statistical reporting in psychology" introduced the Transparent Statistical Reporting in Psychology (TSRP) Checklist as a practical implementation guide for these reforms.
Where the concept comes from and how it works
The technical concept of effect size was substantially formalized by Jacob Cohen in his 1969 textbook Statistical Power Analysis for the Behavioral Sciences, which proposed standardized effect size measures (Cohen's d, r, f) and benchmarks for interpreting them as small (d = 0.2), medium (d = 0.5), or large (d = 0.8). Cohen's framework was foundational. He himself noted, however, that "small" and "large" depend heavily on context: an effect size of d = 0.2 in a public health intervention applied to billions of people may be enormously consequential, while d = 0.5 in a tightly-controlled lab study of a process supposedly mechanistic may be unremarkable.
The mechanics depend on research design. Cohen's d divides the difference between group means by a standard deviation, producing a unit-free measure comparable across studies. The correlation coefficient r is itself an effect size on a -1 to +1 scale. Odds ratios and risk ratios serve a similar function for categorical outcomes. Each captures a different aspect of magnitude.
Modern best practice reports effect sizes with confidence intervals, which capture both the central estimate and the precision of that estimate. A report of "d = 0.3, 95% CI [0.05, 0.55]" tells the reader that the central estimate is small-to-medium but that the data are also consistent with effect sizes ranging from very small to medium-large. The width of the interval reflects sample size and within-group variability; narrow intervals signal precision, wide intervals signal uncertainty.
The major effect size measures
Different effect size measures are appropriate for different research designs. Recognizing which is being reported matters for interpretation.
- Cohen's d. The standardized mean difference between two groups. Useful for comparing the magnitude of effects across studies that used different scales. Cohen's benchmarks: 0.2 small, 0.5 medium, 0.8 large — but always interpreted with context.
- Pearson's r. The correlation coefficient between two continuous variables, ranging from -1 to +1. Cohen's benchmarks: 0.1 small, 0.3 medium, 0.5 large. The square (r2) gives the proportion of variance explained, which is often more interpretable.
- Odds ratio (OR). The ratio of odds in two categorical groups. Common in epidemiology and case-control studies. An OR of 1 indicates no difference; OR > 1 indicates increased odds in the exposed group, OR < 1 indicates decreased odds.
- Risk ratio (RR) and risk difference. Used in cohort studies and randomized trials. Risk ratio is the ratio of event probabilities; risk difference is the absolute difference. Risk difference is usually more interpretable for clinical decision-making because it answers "how much does this treatment help in absolute terms?"
- Eta-squared (η2) and partial eta-squared. The proportion of variance in an outcome explained by a predictor or experimental factor. Common in ANOVA contexts. Cohen's benchmarks: 0.01 small, 0.06 medium, 0.14 large.
- Population attributable fraction (PAF). The proportion of cases of an outcome attributable to a given risk factor at population level. The 2024 Lancet Commission on dementia uses PAFs throughout — for instance, low education is reported as a 5% PAF for global dementia.
Different measures are not interchangeable, and different fields favor different defaults. Comparisons across studies require attention to which measure is being reported and how it was calculated.
What effect size can — and can't — tell you
What it can do. Effect size provides a quantitative basis for evaluating practical importance. Combined with cost, risk, and feasibility information, effect size lets decision-makers compare interventions on a common scale. Reporting effect sizes with confidence intervals also surfaces the precision of evidence — small studies with large effects produce wide confidence intervals that signal substantial uncertainty even when the central estimate looks impressive.
What it can't do. Effect size alone is not a complete evaluation. A small effect size applied at population scale can be enormously consequential; a large effect size in a tightly controlled lab study of a single intervention may not generalize. The "small/medium/large" benchmarks Cohen proposed are heuristic, not normative — what counts as a meaningful effect depends on the research context, the cost of the intervention, and the stakes of the decision. Effect size also does not address the validity of the underlying measurement: a precisely estimated effect on a poorly validated instrument can be precisely wrong.
Common misconceptions
"Statistical significance means the effect is large." No. A p-value of 0.001 does not mean a large effect; it means the effect is unlikely to be zero given a large enough sample. Statistical significance and effect size are independent concepts. Large samples can produce highly significant tiny effects; small samples can produce nonsignificant large effects.
"Cohen's small/medium/large benchmarks are universal standards." Cohen himself warned against universalizing them. The 2024 Lancet Commission on dementia identifies modifiable risk factors with population-attributable fractions of 1-7% — all "small" by Cohen's benchmarks — yet their cumulative effect accounts for ~45% of global dementia cases. Small effect sizes at population scale routinely matter more than large effects in narrow lab studies.
"A single study's effect size is the truth about the effect." Largely false. Effect sizes from single studies — particularly small or unreplicated studies — are systematically inflated. The Open Science Collaboration's 2015 Reproducibility Project found replication effect sizes averaging half the magnitude of the originals. Reliable effect size estimation typically requires meta-analysis across many studies, with appropriate corrections for publication bias.
"All effect size measures are interchangeable." No. Cohen's d, Pearson's r, odds ratios, risk ratios, and population-attributable fractions all measure different aspects of magnitude and apply to different research designs. Comparing studies requires attention to which measure is being reported. Some inter-conversions exist (e.g., d can be converted to r under specific assumptions), but the conversions can mislead when assumptions don't hold.
A practical example
Consider two news headlines, both technically accurate. Headline A: "New drug significantly reduces heart attack risk." Headline B: "Daily exercise significantly reduces heart attack risk." Both report statistically significant findings. Without effect sizes, the headlines are equivalent in their reporting weight.
The effect sizes tell a different story. Suppose the drug reduces heart attack risk by 0.3 percentage points (RR = 0.97) — significant in a large trial but a small absolute effect. The exercise intervention reduces risk by 4 percentage points (RR = 0.65) — also significant, but a larger absolute effect. The effect-size-aware reading is that exercise is the larger intervention by a factor of more than ten, with lower cost and risk profile. Significance alone obscured this; effect size revealed it.
The lesson generalizes. Whenever a research result is reported as "significant," the next question is "how large?" — and the answer requires effect size, not just p-value. The same applies to popular framings of self-help and wellness interventions: many produce statistically detectable effects on small populations, but effect sizes that don't justify the marketing claims being built on them.
How LifeByLogic uses effect sizes
Effect sizes underlie the published research on which all LifeByLogic tools rest. The 2024 Lancet Commission's population-attributable fractions are effect size measures at the population scale; the meta-analytic correlations in the turnover literature (Steel & Ovalle 1984, r = 0.50) are effect sizes at the individual scale. Each tool's methodology page reports the underlying effect sizes explicitly, so users can gauge the strength of evidence behind any recommendation. The full methodology framework is documented on the editorial policy page.
Frequently asked questions
What is effect size?
Effect size is a quantitative measure of the magnitude of a relationship or difference observed in research, distinct from statistical significance — which captures only whether the effect is unlikely to be zero. Effect size is the bridge between statistics and substantive meaning: it tells you whether a finding actually matters. Formalized by Jacob Cohen in 1969, contemporary statistics increasingly treats effect size estimation rather than significance testing as the primary inferential goal.
How is effect size different from statistical significance?
Statistical significance tells you whether an effect is unlikely to be zero given the sample size. Effect size tells you how large the effect is. The two are independent concepts. Large samples can produce highly significant tiny effects (a treatment that reduces dementia risk by 0.5% can be highly significant in a large trial). Small samples can produce nonsignificant large effects (a treatment that reduces risk by 20% can be statistically inconclusive in a small study). Headlines reporting "significantly reduces X" without effect size information are reporting only half the story.
What are the main effect size measures?
Different measures suit different designs. Cohen's d is the standardized mean difference between two groups (small = 0.2, medium = 0.5, large = 0.8). Pearson's r is the correlation coefficient (-1 to +1; small = 0.1, medium = 0.3, large = 0.5). Odds ratios and risk ratios are common in epidemiology. Eta-squared captures variance explained in ANOVA. Population-attributable fraction (used in the 2024 Lancet Commission on dementia) captures the proportion of cases attributable to a risk factor at population scale. Different measures are not interchangeable.
Are Cohen's small/medium/large benchmarks universal?
No. Cohen himself warned against universalizing them. "Small" and "large" depend heavily on context. An effect size of d = 0.2 in a public health intervention applied to billions of people may be enormously consequential. The 2024 Lancet Commission on dementia identifies modifiable risk factors with population-attributable fractions of 1-7%, which would all be classified as "small" by Cohen's benchmarks. Yet the cumulative population-level effect accounts for approximately 45% of global dementia cases. Small effect sizes at population scale routinely matter more than large effect sizes in narrow lab studies.
Why are single-study effect sizes unreliable?
Original effect sizes are systematically inflated. The Open Science Collaboration's 2015 Reproducibility Project reproduced 100 psychology studies; while 97% of originals reported significant results, only 36% of replications did, and replication effect sizes averaged half the magnitude of the originals. A 2024 Institute for Replication study of 110 articles in economics and political science continued to find systematic inflation. Reliable effect size estimation typically requires meta-analysis across many studies, with corrections for publication bias.
How should effect sizes be reported?
Modern best practice reports effect sizes with confidence intervals, which capture both the central estimate and the precision of that estimate. A report of "d = 0.3, 95% CI [0.05, 0.55]" tells the reader that the central estimate is small-to-medium but that the data are also consistent with effect sizes ranging from very small to medium-large. The 2025 PMC paper on transparent statistical reporting in psychology introduced the TSRP Checklist as a practical guide for implementing these reforms across hypothesis formulation, sample size, preregistration, and inferential reporting.
How to cite this entry
This entry is intended as a citable scholarly reference. Choose the format that matches your context. The retrieval date should reflect when you accessed the page, which may differ from the entry's last-reviewed date shown above.
APA 7th edition
LifeByLogic. (2026). Effect Size: Why It Matters More Than P-Values. https://lifebylogic.com/glossary/effect-size/
MLA 9th edition
LifeByLogic. "Effect Size: Why It Matters More Than P-Values." LifeByLogic, 17 May 2026, https://lifebylogic.com/glossary/effect-size/.
Chicago (author-date)
LifeByLogic. 2026. "Effect Size: Why It Matters More Than P-Values." May 17. https://lifebylogic.com/glossary/effect-size/.
BibTeX
@misc{lbleffectsize2026,
author = {{LifeByLogic}},
title = {Effect Size: Why It Matters More Than P-Values},
year = {2026},
month = {may},
publisher = {LifeByLogic},
url = {https://lifebylogic.com/glossary/effect-size/},
note = {Accessed: 2026-05-17}
}
This entry is educational and is not medical, psychological, financial, or professional advice. The concepts and research described here are intended to support informed personal reflection, not to diagnose or treat any condition or to recommend specific decisions. People with concerns that affect their health, finances, careers, or relationships should consult a qualified professional. See our editorial policy and disclaimer for the broader framework.