Statistical Significance Is Not the Same as Business Significance

Statistical significance in marketing campaigns tells you whether a result is likely real or just noise. It does not tell you whether that result matters. That distinction sounds obvious, but it gets ignored constantly, often by people who should know better.

A test can be statistically significant and commercially irrelevant at the same time. A 0.3% lift in click-through rate might clear a 95% confidence threshold with enough traffic, but if it doesn’t move revenue, it doesn’t move the business. The number is real. The implication is not.

Key Takeaways

  • Statistical significance confirms a result is unlikely to be random chance. It says nothing about whether the result is worth acting on commercially.
  • Sample size is the variable most marketers underestimate. Too small and your test is meaningless. Too large and trivial differences become “significant.”
  • Running multiple simultaneous tests without correction inflates your false positive rate. This is how teams convince themselves they’re winning when they’re not.
  • The 95% confidence threshold is a convention, not a law. Some decisions warrant 99%. Others are fine at 90%. Context determines the threshold, not habit.
  • Business significance, effect size, and practical impact should sit alongside statistical significance in every test readout, not be treated as optional extras.

Why Marketers Keep Getting This Wrong

I spent years inside agencies where A/B testing was treated as a proxy for rigour. If you ran a test and it came back significant, you shipped the winner. The process felt scientific. The results often weren’t.

The problem wasn’t a lack of intelligence. It was a lack of statistical literacy combined with tools that made it too easy to get a number and too hard to interrogate what that number actually meant. Most A/B testing platforms are designed to give you an answer. They are not designed to tell you whether your question was worth asking.

This matters more now than it did a decade ago. Marketing teams are running more tests, faster, across more channels. The infrastructure for experimentation has improved dramatically. The thinking behind it, in many organisations, has not kept pace.

If you’re building a go-to-market approach that relies on testing and iteration, the quality of your conclusions is only as good as the quality of your statistical reasoning. The broader principles behind that are worth exploring in the Go-To-Market and Growth Strategy hub, where I cover the commercial frameworks that sit underneath decisions like these.

What Statistical Significance Actually Means

Let’s be precise. When a test result is described as statistically significant at the 95% confidence level, it means there is a 5% probability that you would have seen a result this extreme if there were actually no difference between your variants. That’s it. Nothing more.

It does not mean your variant is better. It does not mean the effect will hold at scale. It does not mean the improvement is large enough to matter. It means the result is unlikely to be explained by random variation alone, given your sample size and the magnitude of the observed difference.

The p-value, which is what most testing tools report, is the probability of observing your data (or something more extreme) if the null hypothesis were true. The null hypothesis is usually that there is no difference between your control and variant. A p-value below 0.05 is conventionally treated as the threshold for significance. This threshold was largely established by convention in academic statistics, not derived from first principles about marketing decisions.

Understanding this properly changes how you read test results. A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. Treating one as a win and the other as a failure is a category error that has real commercial consequences.

The Sample Size Problem Nobody Talks About Honestly

Sample size is where most marketing tests fall apart. Too small a sample and you lack the statistical power to detect a real effect. Too large a sample and you will detect effects that are real but trivial.

I’ve sat in review meetings where a team has called a test after four days because it hit significance. The traffic was there, the numbers looked clean, and the variant was ahead. Six weeks later, when the change was fully rolled out, the improvement had evaporated. The test had been run on a Monday-to-Thursday window during a promotional period. The “winner” was measuring the promotional lift, not the variant’s effect.

The correct approach is to calculate your required sample size before you run the test, not after. You need three inputs: your baseline conversion rate, the minimum detectable effect you care about, and your chosen significance level. There are calculators that will give you a sample size from those inputs. The discipline is committing to that number before you look at the data.

The minimum detectable effect is where commercial judgment enters. If a 5% improvement in conversion rate would meaningfully change your unit economics, design your test to detect a 5% improvement. If a 5% improvement would be commercially irrelevant given your volumes, you probably shouldn’t be testing that variable at all. You’re spending experimental budget on a question that doesn’t matter.

This connects to something I noticed when judging the Effie Awards. The campaigns that demonstrated genuine effectiveness almost always had a clear commercial hypothesis before execution, not just a creative idea with measurement bolted on afterward. The same logic applies to experimentation. Start with the business question, then design the test.

Multiple Testing and the False Positive Trap

Here is a scenario that plays out in growth teams everywhere. You run ten tests simultaneously across your funnel. Three of them come back significant. You ship all three winners and attribute the subsequent revenue improvement to your testing programme.

What you may not have accounted for is that if you run ten independent tests at the 95% confidence level and there is genuinely no effect in any of them, you would expect to see approximately one false positive by chance alone. With ten tests, your probability of getting at least one spurious significant result is considerably higher than 5%.

This is the multiple comparisons problem, sometimes called the problem of multiple testing. It is not exotic statistics. It is a straightforward consequence of how probability works, and it is routinely ignored in marketing experimentation because testing platforms don’t surface it and most reporting frameworks don’t account for it.

There are corrections for this, the Bonferroni correction being the most widely known, though it is conservative. More practically, you can reduce the problem by being selective about which tests you run simultaneously, by requiring replication of significant results before shipping, and by being appropriately sceptical of any result that surprises you. Surprising results are more likely to be false positives than results that confirm a well-reasoned hypothesis.

The discipline of pre-registration, committing your hypothesis and analysis plan before you look at data, is standard in clinical research for exactly this reason. It’s worth borrowing. Forrester has written about the organisational conditions needed to sustain rigorous experimentation at scale, and the structural discipline around test design is consistently underweighted.

Effect Size: The Number That Actually Tells You Something

Statistical significance is a binary judgment. Effect size is a continuous measure of how large the difference actually is. Both matter. Most marketing reporting leads with significance and buries or omits effect size entirely.

Effect size answers the question: even if this result is real, is it large enough to care about? A statistically significant improvement of 0.1% in conversion rate on a product with thin margins and modest volume is not worth the engineering time to ship. A statistically significant improvement of 12% in the same metric is worth taking seriously even if your confidence interval is wider than you’d like.

When I was running performance marketing across accounts with hundreds of millions in annual spend, the teams that made the best decisions were the ones that had learned to read confidence intervals, not just point estimates. A point estimate tells you the most likely value of the effect. A confidence interval tells you the plausible range. If your 95% confidence interval for a conversion rate improvement runs from 0.2% to 18%, the point estimate in the middle is almost meaningless. You don’t know if you’ve found something small or something substantial.

Reporting effect sizes with confidence intervals is a discipline that takes about ten minutes to learn and produces substantially better decisions. It is not standard practice in most marketing teams. It should be.

When 95% Confidence Is the Wrong Threshold

The 95% confidence threshold is a default, not a principle. The appropriate threshold depends on the cost of being wrong in each direction.

If you’re testing a subject line change on a mid-tier email campaign and the cost of shipping a false positive is minimal, 90% confidence might be entirely appropriate. You’re making a low-stakes, reversible decision. Being slightly more permissive with uncertainty costs you little.

If you’re testing a pricing change that will be applied across your entire customer base and rolling it back is operationally complex, you probably want 99% confidence before you act. The asymmetry of consequences warrants a higher bar.

This sounds obvious when stated plainly. In practice, most teams apply 95% everywhere because that’s what the tool defaults to and nobody has thought carefully about whether it’s appropriate for the specific decision they’re making. The threshold should be a deliberate choice, not an inherited setting.

There’s a broader point here about how growth strategy and market penetration decisions get made. Semrush’s overview of market penetration strategy touches on the commercial conditions under which different levels of risk tolerance make sense. The same logic applies to experimental design: your risk tolerance should be calibrated to the commercial context, not set by convention.

The Attribution Problem That Sits Underneath All of This

Statistical significance in A/B testing assumes you’re measuring the right thing. Attribution problems mean you often aren’t.

Earlier in my career, I overweighted lower-funnel performance signals. Click-through rates, last-click conversions, cost per acquisition from paid search. These numbers were clean and available, and they responded to the levers I could pull. What I underappreciated was how much of what those channels were “converting” was demand that already existed. Someone who searched for your brand name was probably going to buy from you anyway. The paid click captured the intent. It didn’t create it.

When you run a test on a lower-funnel channel and it shows a significant improvement, you need to ask whether you’re measuring the channel’s contribution to demand or just its efficiency at capturing demand that was already there. These are different things, and conflating them leads to systematic overinvestment in capture and underinvestment in creation.

The same logic applies to incrementality testing, which is the more honest cousin of standard A/B testing. Incrementality tests ask not “did the variant outperform the control” but “would this conversion have happened without our intervention.” That’s the commercially relevant question, and it’s harder to answer. Semrush’s breakdown of growth examples illustrates how teams that focus on genuine incremental growth tend to build more durable positions than those optimising for captured demand.

BCG’s work on go-to-market pricing strategy makes a related point: the decisions that look cleanest in the data are not always the decisions that create the most value. Analytical rigour has to be paired with commercial judgment, or you end up optimising for the measurable at the expense of the important.

What Good Experimental Practice Actually Looks Like

I’ve seen experimentation programmes that generated genuine commercial value and ones that generated impressive-looking dashboards with no corresponding business improvement. The difference almost always came down to process discipline rather than technical sophistication.

Good experimental practice starts with a written hypothesis before any test runs. Not “let’s test the green button” but “we believe that changing the CTA colour from grey to green will increase click-through rate by at least 8% among mobile users, because our session recordings show high abandonment at this point and our qualitative research suggests the current CTA lacks visual prominence.” That specificity forces you to think about what you’re actually testing and why.

It continues with a pre-calculated sample size based on a minimum detectable effect that is commercially meaningful. It runs for a predetermined duration that covers at least one full business cycle, usually a week at minimum, to account for day-of-week variation. It evaluates the primary metric and a small number of pre-specified secondary metrics. It does not go looking for significance in subgroups after the fact.

When the test ends, the result is read against the pre-specified hypothesis, not reverse-engineered to find something worth reporting. If the primary metric didn’t move significantly, that’s a valid result. A well-designed null result tells you something useful: this variable probably isn’t the lever you thought it was. That’s worth knowing.

Hotjar’s work on growth loops captures something important here: sustainable growth comes from understanding user behaviour deeply enough to form good hypotheses, not from running more tests. The volume of experimentation matters far less than the quality of the thinking that precedes it.

The commercial rigour that sits behind good experimentation is part of a broader discipline around growth strategy. If you’re building or refining your approach to go-to-market planning, the Go-To-Market and Growth Strategy hub covers the frameworks and commercial thinking that connect experimental practice to business outcomes.

The Honest Limitation: Significance Can’t Fix a Bad Product

There’s a version of this conversation I’ve had many times, usually with a founder or a CMO who has built an impressive testing programme and is frustrated that it isn’t moving the needle on growth. The experimentation is rigorous. The results are real. The business isn’t growing.

The uncomfortable answer is often that you’re optimising the wrong thing. If customers aren’t converting because the product doesn’t solve their problem well enough, no amount of button colour testing will fix that. If retention is poor because the experience disappoints relative to the promise, improving your onboarding email sequence will slow the bleed but won’t stop it.

Marketing optimisation, including rigorous experimentation, is most valuable when the underlying product and experience are genuinely good. When they’re not, marketing is often a blunt instrument being used to compensate for something more fundamental. The statistical significance of your test results is irrelevant if you’re optimising a leaky bucket.

BCG’s research on go-to-market strategy in financial services makes a point that generalises well: the most effective go-to-market approaches are built around a genuine understanding of customer needs, not around optimising the mechanics of acquisition. Experimentation in service of a clear customer value proposition is powerful. Experimentation in service of extracting more from a weak proposition has diminishing returns.

That’s not an argument against experimentation. It’s an argument for being clear about what problem you’re actually trying to solve before you design your first test.

About the Author

Keith Lacy is a marketing strategist and former agency CEO with 20+ years of experience across agency leadership, performance marketing, and commercial strategy. He writes The Marketing Juice to cut through the noise and share what works.

Frequently Asked Questions

What is statistical significance in marketing campaigns?
Statistical significance in marketing campaigns is a measure of whether an observed difference between two variants, such as a control and a test version of an ad or landing page, is likely to be real rather than a product of random chance. A result is typically described as statistically significant when the probability of observing it by chance alone falls below a chosen threshold, most commonly 5%. It tells you the result is probably real. It does not tell you whether the result is large enough to matter commercially.
How much traffic do you need to run a statistically significant A/B test?
The required sample size depends on three factors: your baseline conversion rate, the minimum improvement you want to be able to detect, and your chosen confidence level. Lower baseline rates and smaller minimum detectable effects require larger samples. As a general principle, you should calculate your required sample size before the test begins, using a sample size calculator, and commit to running the test until that sample is reached. Stopping early when results look promising is one of the most common sources of false positives in marketing experimentation.
What is the difference between statistical significance and business significance?
Statistical significance tells you a result is unlikely to be random. Business significance tells you whether the result is large enough to matter commercially. A test can be statistically significant and commercially irrelevant at the same time, particularly when sample sizes are large enough to detect very small effects. Business significance requires you to evaluate effect size and practical impact alongside the significance threshold. A 0.2% improvement in conversion rate may be statistically real but commercially meaningless depending on your volumes and margins.
Why do marketing A/B tests produce false positives?
Marketing A/B tests produce false positives for several reasons. Stopping tests early when they happen to show a positive result is one of the most common causes. Running multiple simultaneous tests without adjusting the significance threshold inflates the overall false positive rate across the programme. Testing during atypical periods, such as promotional windows or seasonal peaks, can produce results that don’t hold in normal conditions. Analysing subgroups after the fact to find something significant is another common source of spurious results. Pre-specifying your hypothesis, sample size, and analysis plan before the test begins reduces all of these risks.
Should you always use a 95% confidence threshold for marketing tests?
No. The appropriate confidence threshold depends on the cost of being wrong in each direction. For low-stakes, easily reversible decisions such as a subject line test on a small email segment, a 90% threshold may be entirely reasonable. For high-stakes decisions with significant rollback costs, such as a pricing change applied across your full customer base, a 99% threshold is more appropriate. The 95% default is a convention inherited from academic statistics, not a principle derived from marketing decision-making. The threshold should be set deliberately based on the specific commercial context of each test.

Similar Posts