Statistical Significance Is Not the Same as Business Significance
Statistical significance in marketing campaigns tells you whether a result is likely real or just noise. It does not tell you whether that result matters. That distinction sounds obvious, but it gets ignored constantly, often by people who should know better.
A test can be statistically significant and commercially irrelevant at the same time. A 0.3% lift in click-through rate might clear a 95% confidence threshold with enough traffic, but if it doesn’t move revenue, it doesn’t move the business. The number is real. The implication is not.
Key Takeaways
- Statistical significance confirms a result is unlikely to be random chance. It says nothing about whether the result is worth acting on commercially.
- Sample size is the variable most marketers underestimate. Too small and your test is meaningless. Too large and trivial differences become “significant.”
- Running multiple simultaneous tests without correction inflates your false positive rate. This is how teams convince themselves they’re winning when they’re not.
- The 95% confidence threshold is a convention, not a law. Some decisions warrant 99%. Others are fine at 90%. Context determines the threshold, not habit.
- Business significance, effect size, and practical impact should sit alongside statistical significance in every test readout, not be treated as optional extras.
In This Article
- Why Marketers Keep Getting This Wrong
- What Statistical Significance Actually Means
- The Sample Size Problem Nobody Talks About Honestly
- Multiple Testing and the False Positive Trap
- Effect Size: The Number That Actually Tells You Something
- When 95% Confidence Is the Wrong Threshold
- The Attribution Problem That Sits Underneath All of This
- What Good Experimental Practice Actually Looks Like
- The Honest Limitation: Significance Can’t Fix a Bad Product
Why Marketers Keep Getting This Wrong
I spent years inside agencies where A/B testing was treated as a proxy for rigour. If you ran a test and it came back significant, you shipped the winner. The process felt scientific. The results often weren’t.
The problem wasn’t a lack of intelligence. It was a lack of statistical literacy combined with tools that made it too easy to get a number and too hard to interrogate what that number actually meant. Most A/B testing platforms are designed to give you an answer. They are not designed to tell you whether your question was worth asking.
This matters more now than it did a decade ago. Marketing teams are running more tests, faster, across more channels. The infrastructure for experimentation has improved dramatically. The thinking behind it, in many organisations, has not kept pace.
If you’re building a go-to-market approach that relies on testing and iteration, the quality of your conclusions is only as good as the quality of your statistical reasoning. The broader principles behind that are worth exploring in the Go-To-Market and Growth Strategy hub, where I cover the commercial frameworks that sit underneath decisions like these.
What Statistical Significance Actually Means
Let’s be precise. When a test result is described as statistically significant at the 95% confidence level, it means there is a 5% probability that you would have seen a result this extreme if there were actually no difference between your variants. That’s it. Nothing more.
It does not mean your variant is better. It does not mean the effect will hold at scale. It does not mean the improvement is large enough to matter. It means the result is unlikely to be explained by random variation alone, given your sample size and the magnitude of the observed difference.
The p-value, which is what most testing tools report, is the probability of observing your data (or something more extreme) if the null hypothesis were true. The null hypothesis is usually that there is no difference between your control and variant. A p-value below 0.05 is conventionally treated as the threshold for significance. This threshold was largely established by convention in academic statistics, not derived from first principles about marketing decisions.
Understanding this properly changes how you read test results. A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. Treating one as a win and the other as a failure is a category error that has real commercial consequences.
The Sample Size Problem Nobody Talks About Honestly
Sample size is where most marketing tests fall apart. Too small a sample and you lack the statistical power to detect a real effect. Too large a sample and you will detect effects that are real but trivial.
I’ve sat in review meetings where a team has called a test after four days because it hit significance. The traffic was there, the numbers looked clean, and the variant was ahead. Six weeks later, when the change was fully rolled out, the improvement had evaporated. The test had been run on a Monday-to-Thursday window during a promotional period. The “winner” was measuring the promotional lift, not the variant’s effect.
The correct approach is to calculate your required sample size before you run the test, not after. You need three inputs: your baseline conversion rate, the minimum detectable effect you care about, and your chosen significance level. There are calculators that will give you a sample size from those inputs. The discipline is committing to that number before you look at the data.
The minimum detectable effect is where commercial judgment enters. If a 5% improvement in conversion rate would meaningfully change your unit economics, design your test to detect a 5% improvement. If a 5% improvement would be commercially irrelevant given your volumes, you probably shouldn’t be testing that variable at all. You’re spending experimental budget on a question that doesn’t matter.
This connects to something I noticed when judging the Effie Awards. The campaigns that demonstrated genuine effectiveness almost always had a clear commercial hypothesis before execution, not just a creative idea with measurement bolted on afterward. The same logic applies to experimentation. Start with the business question, then design the test.
Multiple Testing and the False Positive Trap
Here is a scenario that plays out in growth teams everywhere. You run ten tests simultaneously across your funnel. Three of them come back significant. You ship all three winners and attribute the subsequent revenue improvement to your testing programme.
What you may not have accounted for is that if you run ten independent tests at the 95% confidence level and there is genuinely no effect in any of them, you would expect to see approximately one false positive by chance alone. With ten tests, your probability of getting at least one spurious significant result is considerably higher than 5%.
This is the multiple comparisons problem, sometimes called the problem of multiple testing. It is not exotic statistics. It is a straightforward consequence of how probability works, and it is routinely ignored in marketing experimentation because testing platforms don’t surface it and most reporting frameworks don’t account for it.
There are corrections for this, the Bonferroni correction being the most widely known, though it is conservative. More practically, you can reduce the problem by being selective about which tests you run simultaneously, by requiring replication of significant results before shipping, and by being appropriately sceptical of any result that surprises you. Surprising results are more likely to be false positives than results that confirm a well-reasoned hypothesis.
The discipline of pre-registration, committing your hypothesis and analysis plan before you look at data, is standard in clinical research for exactly this reason. It’s worth borrowing. Forrester has written about the organisational conditions needed to sustain rigorous experimentation at scale, and the structural discipline around test design is consistently underweighted.
Effect Size: The Number That Actually Tells You Something
Statistical significance is a binary judgment. Effect size is a continuous measure of how large the difference actually is. Both matter. Most marketing reporting leads with significance and buries or omits effect size entirely.
Effect size answers the question: even if this result is real, is it large enough to care about? A statistically significant improvement of 0.1% in conversion rate on a product with thin margins and modest volume is not worth the engineering time to ship. A statistically significant improvement of 12% in the same metric is worth taking seriously even if your confidence interval is wider than you’d like.
When I was running performance marketing across accounts with hundreds of millions in annual spend, the teams that made the best decisions were the ones that had learned to read confidence intervals, not just point estimates. A point estimate tells you the most likely value of the effect. A confidence interval tells you the plausible range. If your 95% confidence interval for a conversion rate improvement runs from 0.2% to 18%, the point estimate in the middle is almost meaningless. You don’t know if you’ve found something small or something substantial.
Reporting effect sizes with confidence intervals is a discipline that takes about ten minutes to learn and produces substantially better decisions. It is not standard practice in most marketing teams. It should be.
When 95% Confidence Is the Wrong Threshold
The 95% confidence threshold is a default, not a principle. The appropriate threshold depends on the cost of being wrong in each direction.
If you’re testing a subject line change on a mid-tier email campaign and the cost of shipping a false positive is minimal, 90% confidence might be entirely appropriate. You’re making a low-stakes, reversible decision. Being slightly more permissive with uncertainty costs you little.
If you’re testing a pricing change that will be applied across your entire customer base and rolling it back is operationally complex, you probably want 99% confidence before you act. The asymmetry of consequences warrants a higher bar.
This sounds obvious when stated plainly. In practice, most teams apply 95% everywhere because that’s what the tool defaults to and nobody has thought carefully about whether it’s appropriate for the specific decision they’re making. The threshold should be a deliberate choice, not an inherited setting.
There’s a broader point here about how growth strategy and market penetration decisions get made. Semrush’s overview of market penetration strategy touches on the commercial conditions under which different levels of risk tolerance make sense. The same logic applies to experimental design: your risk tolerance should be calibrated to the commercial context, not set by convention.
The Attribution Problem That Sits Underneath All of This
Statistical significance in A/B testing assumes you’re measuring the right thing. Attribution problems mean you often aren’t.
Earlier in my career, I overweighted lower-funnel performance signals. Click-through rates, last-click conversions, cost per acquisition from paid search. These numbers were clean and available, and they responded to the levers I could pull. What I underappreciated was how much of what those channels were “converting” was demand that already existed. Someone who searched for your brand name was probably going to buy from you anyway. The paid click captured the intent. It didn’t create it.
When you run a test on a lower-funnel channel and it shows a significant improvement, you need to ask whether you’re measuring the channel’s contribution to demand or just its efficiency at capturing demand that was already there. These are different things, and conflating them leads to systematic overinvestment in capture and underinvestment in creation.
The same logic applies to incrementality testing, which is the more honest cousin of standard A/B testing. Incrementality tests ask not “did the variant outperform the control” but “would this conversion have happened without our intervention.” That’s the commercially relevant question, and it’s harder to answer. Semrush’s breakdown of growth examples illustrates how teams that focus on genuine incremental growth tend to build more durable positions than those optimising for captured demand.
BCG’s work on go-to-market pricing strategy makes a related point: the decisions that look cleanest in the data are not always the decisions that create the most value. Analytical rigour has to be paired with commercial judgment, or you end up optimising for the measurable at the expense of the important.
What Good Experimental Practice Actually Looks Like
I’ve seen experimentation programmes that generated genuine commercial value and ones that generated impressive-looking dashboards with no corresponding business improvement. The difference almost always came down to process discipline rather than technical sophistication.
Good experimental practice starts with a written hypothesis before any test runs. Not “let’s test the green button” but “we believe that changing the CTA colour from grey to green will increase click-through rate by at least 8% among mobile users, because our session recordings show high abandonment at this point and our qualitative research suggests the current CTA lacks visual prominence.” That specificity forces you to think about what you’re actually testing and why.
It continues with a pre-calculated sample size based on a minimum detectable effect that is commercially meaningful. It runs for a predetermined duration that covers at least one full business cycle, usually a week at minimum, to account for day-of-week variation. It evaluates the primary metric and a small number of pre-specified secondary metrics. It does not go looking for significance in subgroups after the fact.
When the test ends, the result is read against the pre-specified hypothesis, not reverse-engineered to find something worth reporting. If the primary metric didn’t move significantly, that’s a valid result. A well-designed null result tells you something useful: this variable probably isn’t the lever you thought it was. That’s worth knowing.
Hotjar’s work on growth loops captures something important here: sustainable growth comes from understanding user behaviour deeply enough to form good hypotheses, not from running more tests. The volume of experimentation matters far less than the quality of the thinking that precedes it.
The commercial rigour that sits behind good experimentation is part of a broader discipline around growth strategy. If you’re building or refining your approach to go-to-market planning, the Go-To-Market and Growth Strategy hub covers the frameworks and commercial thinking that connect experimental practice to business outcomes.
The Honest Limitation: Significance Can’t Fix a Bad Product
There’s a version of this conversation I’ve had many times, usually with a founder or a CMO who has built an impressive testing programme and is frustrated that it isn’t moving the needle on growth. The experimentation is rigorous. The results are real. The business isn’t growing.
The uncomfortable answer is often that you’re optimising the wrong thing. If customers aren’t converting because the product doesn’t solve their problem well enough, no amount of button colour testing will fix that. If retention is poor because the experience disappoints relative to the promise, improving your onboarding email sequence will slow the bleed but won’t stop it.
Marketing optimisation, including rigorous experimentation, is most valuable when the underlying product and experience are genuinely good. When they’re not, marketing is often a blunt instrument being used to compensate for something more fundamental. The statistical significance of your test results is irrelevant if you’re optimising a leaky bucket.
BCG’s research on go-to-market strategy in financial services makes a point that generalises well: the most effective go-to-market approaches are built around a genuine understanding of customer needs, not around optimising the mechanics of acquisition. Experimentation in service of a clear customer value proposition is powerful. Experimentation in service of extracting more from a weak proposition has diminishing returns.
That’s not an argument against experimentation. It’s an argument for being clear about what problem you’re actually trying to solve before you design your first test.
About the Author
Keith Lacy is a marketing strategist and former agency CEO with 20+ years of experience across agency leadership, performance marketing, and commercial strategy. He writes The Marketing Juice to cut through the noise and share what works.
