A/B Testing: What It Is and Why Teams Do It Wrong

Q: What is the difference between A/B testing and multivariate testing?

A/B testing compares two versions of a single variable to determine which performs better. Multivariate testing tests multiple variables simultaneously and measures how different combinations interact. Multivariate testing can surface interaction effects that A/B testing misses, but it requires significantly more traffic to reach reliable conclusions. For most teams, A/B testing is the right default. Multivariate testing becomes valuable when you have high traffic volume and want to understand how elements on a page work together.

A/B testing is the practice of running two versions of something, a webpage, an email, an ad, a call-to-action, simultaneously against each other to determine which performs better with real users. One version goes to one segment of your audience, the other version goes to a second segment, and you measure the outcome you care about: clicks, conversions, sign-ups, purchases. The version that wins becomes the control for the next test.

That is the clean definition. The messier truth is that most teams run A/B tests the wrong way, draw conclusions too early, test the wrong things, and use the results to justify decisions they had already made. The mechanics are simple. The discipline required to do it properly is not.

Key Takeaways

A/B testing only produces reliable results when you have sufficient traffic, a clear hypothesis, and the patience to reach statistical significance before calling a winner.
Most teams test cosmetic changes first because they are easy, not because they move the needle. The highest-impact tests usually involve copy, offer structure, or page layout.
Peeking at results early and stopping a test when it looks good is one of the most common and costly mistakes in CRO. It produces false positives that erode trust in the process.
A/B testing is a measurement discipline before it is a conversion discipline. If your tracking is unreliable, your test results are unreliable.
The goal of a testing programme is not to win individual tests. It is to build a compounding body of knowledge about what your audience responds to.

What A/B Testing Actually Measures
The Anatomy of a Proper A/B Test
What to Test and What to Leave Alone
The Relationship Between A/B Testing and User Experience
Where A/B Testing Fits in the Conversion Funnel
The Measurement Problem Underneath A/B Testing
Common A/B Testing Mistakes That Waste Time and Money
How to Build a Testing Programme That Compounds Over Time
A/B Testing Tools: What to Look For
What Good A/B Testing Results Look Like in Practice

I spent years managing large media budgets across performance channels before I fully appreciated how little most teams actually knew about what was working. Not because the data was absent, but because the discipline around interpreting it was weak. A/B testing, done properly, is one of the few tools in marketing that forces you to be honest with yourself. It is harder than it looks, and that is exactly why it matters.

What A/B Testing Actually Measures

The most important thing to understand about A/B testing is what it does and does not tell you. A test tells you that, under these conditions, with this audience, at this point in time, version B produced more of the outcome you measured than version A. That is all it tells you.

It does not tell you why. It does not tell you whether the result will hold next month. It does not tell you whether the winning variant would perform better with a different audience segment. And it absolutely does not tell you that the thing you tested was the most important variable on the page.

This matters because teams routinely over-generalise from test results. They run a single test on a button colour, declare that green outperforms red, and write it into their design standards. Three years later, nobody remembers it was one test on one page with one audience segment at one point in the commercial calendar. It has become received wisdom.

The Conversion Rate Optimization work that produces lasting commercial impact is built on a programme of tests, not individual experiments. If you want to understand the broader discipline that A/B testing sits within, our CRO and Testing Hub covers the full picture, from measurement foundations through to running structured testing programmes at scale.

What a well-run A/B test does give you is a reliable, directional signal about user behaviour in a controlled environment. That is genuinely valuable. It is just not the oracle that the conversion optimisation industry sometimes presents it as.

The Anatomy of a Proper A/B Test

There are five components that every A/B test needs to have before you should trust the results. Most failing tests are missing at least one of them.

A Specific, Falsifiable Hypothesis

A hypothesis is not “let’s see if a shorter form converts better.” A hypothesis is: “Reducing the contact form from seven fields to three will increase form completions because we are asking for information the user does not need to provide at this stage of the funnel.” The hypothesis names the change, predicts the direction of the outcome, and gives a reason based on user behaviour. If you cannot articulate the reason, you are not testing a hypothesis. You are guessing.

The reason matters because it is how you learn. If the test confirms your hypothesis, you have evidence for the underlying mechanism. If it disconfirms it, you have a reason to investigate further. If you had no reason to begin with, a failed test teaches you nothing except that this particular change did not work this time.

A Single Variable

Classic A/B testing changes one thing at a time. One headline. One image. One call-to-action. One layout. The moment you change two or more things simultaneously, you lose the ability to attribute the result to a specific cause. You might know that variant B won, but you do not know whether it was the headline, the image, or the button text that drove the difference.

Multivariate testing exists precisely for situations where you want to test multiple variables simultaneously and understand how they interact. Optimizely has written clearly about the interaction effects that multivariate testing can surface, which standard A/B tests will miss entirely. But multivariate testing requires significantly more traffic to reach reliable conclusions, which is why most teams should default to A/B until their volume justifies the complexity. Tools like Unbounce now offer both formats within the same platform, which makes the choice more accessible, but the traffic requirement does not change.

Sufficient Traffic and a Pre-Determined Sample Size

This is where most tests fail, not because teams do not understand statistics, but because they do not have the patience to wait. You need to calculate your required sample size before the test starts, based on your current baseline conversion rate, the minimum detectable effect you care about, and your desired confidence level. Then you run the test until you hit that sample size. Not until it looks like one version is winning.

Peeking at results and stopping a test early when it looks promising is called optional stopping, and it is the single most common cause of false positives in A/B testing. The statistical fluctuations in the early stages of a test can look convincing. They are not. I have seen teams call tests after three days on low-traffic pages, implement the “winning” variant, and then watch their conversion rate drift back to baseline within a fortnight. The test was noise, not signal.

Statistical Significance, Interpreted Correctly

Most A/B testing tools report statistical significance as a percentage, typically 95% confidence. What this means is that if the null hypothesis were true (i.e., there is no real difference between variants), you would see a result this extreme or more extreme only 5% of the time by chance. It does not mean you are 95% certain that variant B is better. That is a subtle but important distinction.

95% confidence is the industry standard, but it is worth knowing that if you run 20 tests, statistical chance alone suggests one of them will produce a false positive at that threshold. This is why a testing programme matters more than any individual test result. Patterns that replicate across multiple tests carry far more weight than a single statistically significant result.

A Clear Primary Metric

Every test needs one primary metric that determines the winner. Not three metrics. Not “we will look at conversion rate, time on page, and bounce rate.” One metric. If you track multiple metrics, you will find that some move in your favour and some do not, and you will be tempted to cherry-pick the ones that support the conclusion you wanted. Define the primary metric before the test starts and hold to it.

What to Test and What to Leave Alone

There is a hierarchy of testing impact that most teams ignore in favour of testing whatever is easiest to change. Button colours and font sizes are easy to test. They are also, in most cases, among the lowest-impact changes you can make to a page.

The variables that tend to produce the largest lifts are the ones that require the most creative and strategic effort to test. Headlines and primary copy. The offer itself: what you are asking people to do and what you are giving them in return. The structure and sequence of information on the page. Social proof: what kind, how much, and where it appears. Form length and field order. The match between the ad or email that drove the click and the page the user lands on.

When I was running agency teams across performance channels, the tests that moved revenue were almost never the cosmetic ones. They were the ones that changed the fundamental proposition being made to the user. Changing a headline from a feature statement to a benefit statement. Restructuring a pricing page to lead with the most popular tier rather than the cheapest. Replacing generic social proof with specific, quantified testimonials. These are the tests worth running.

The design and structure of your landing page is one of the highest-leverage areas for testing. If you are not already clear on what a well-structured landing page looks like, our landing page guide covers the core components and the decisions that affect conversion before you run a single test.

A note on what to leave alone: do not test things that are broken. If your page has a tracking issue, a broken form, or a mobile layout that does not render correctly, fix those first. Testing on top of a broken experience is a waste of traffic. You are measuring user frustration, not user preference.

The Relationship Between A/B Testing and User Experience

A/B testing and user experience work are not the same discipline, but they are deeply connected. A/B testing tells you what users do. User experience research tells you why. The best testing programmes use both.

Before you run a test, qualitative research can help you form better hypotheses. Session recordings, heatmaps, and usability testing can surface friction points that quantitative data alone would never reveal. If you can see that users consistently hover over a particular element without clicking, or that they scroll past your call-to-action without engaging, you have a specific problem to test a solution against. That is a far stronger starting point than guessing what to change.

Understanding the fundamentals of user experience is not optional if you want your A/B testing programme to produce meaningful results. The tests that win are usually the ones rooted in a genuine understanding of user behaviour, not the ones born from internal opinion about what looks better.

There is also a design infrastructure question worth addressing. The pages you test need to be built in a way that makes iteration fast and low-friction. If every variant requires a developer sprint and a two-week QA cycle, you will run fewer tests and learn more slowly. This is one reason why the tooling around page design matters. The wireframing tools your team uses upstream of development affect how quickly you can prototype and test ideas before they reach production.

And if your pages are not performing consistently across devices, your test results will be polluted. A variant that wins on desktop but loses on mobile will produce a blended result that obscures what is actually happening. Responsive design is not just a technical requirement. It is a testing prerequisite. If your page behaves differently across screen sizes, you are running multiple experiments simultaneously without knowing it.

Where A/B Testing Fits in the Conversion Funnel

A/B testing is most commonly applied at the bottom of the funnel, on landing pages, checkout flows, and sign-up forms. This makes commercial sense: the traffic is already there, the intent is high, and even small improvements in conversion rate can produce meaningful revenue impact at scale.

But the funnel has more surface area than most teams test against. The full conversion funnel, from top-of-funnel awareness through mid-funnel consideration to bottom-of-funnel conversion, offers testing opportunities at every stage. Email subject lines, ad copy, content headlines, category page layouts, product descriptions, checkout confirmation pages: all of these are testable, and all of them contribute to the overall conversion picture.

The reason teams concentrate testing at the bottom of the funnel is partly correct (that is where the money is) and partly a failure of imagination. Mid-funnel content tests can be highly revealing. Moz has explored how organic content contributes to conversion paths in ways that last-click attribution systematically misses. If you are only testing your landing pages and ignoring the content that drives users to those pages, you are optimising the last step of a experience you have not looked at properly.

I have seen this pattern repeatedly when auditing marketing programmes. Teams with sophisticated landing page testing programmes sitting on top of acquisition funnels that had never been examined. The landing page was optimised. The traffic feeding it was not. The result was a well-optimised final step in a leaky bucket.

The Measurement Problem Underneath A/B Testing

Here is the thing that most articles about A/B testing do not say clearly enough: the quality of your test results is entirely dependent on the quality of your tracking. If your conversion events are not firing consistently, if your test variants are not being assigned correctly, if your analytics setup has gaps, your test data is unreliable regardless of how well you designed the experiment.

I have audited marketing programmes where teams were running active A/B tests on pages with broken event tracking. They were making decisions based on data that was, in some cases, capturing fewer than half of actual conversions. The winning variant was not the better one. It was the one that happened to have slightly less broken tracking on a particular browser.

Fix measurement first. This is not a qualification or a caveat. It is the prerequisite. If I could give one piece of advice to any marketing team starting a testing programme, it would be to spend the first month auditing your tracking before you run a single test. Verify that your conversion events fire correctly across browsers and devices. Confirm that your test tool is assigning variants consistently and not leaking users between groups. Check that your baseline conversion rate is stable before you introduce a variable.

This is not the exciting part of CRO. Nobody writes case studies about the month they spent fixing their tracking. But it is the part that determines whether everything that follows is real or fiction.

The same discipline applies to how you structure the information on your pages. Something as straightforward as a well-structured FAQ section can affect both user behaviour and your ability to track it cleanly. If you are building or restructuring pages as part of a testing programme, our free FAQ templates are a practical starting point for the kind of structured content that supports both UX and testing hygiene.

Common A/B Testing Mistakes That Waste Time and Money

I want to be specific here rather than generic. These are the mistakes I have seen most often, in agencies, in-house teams, and among clients who came to us after running testing programmes that had produced nothing useful.

Testing Without Enough Traffic

If your page gets 200 visitors a month, you cannot run a meaningful A/B test on it. You simply do not have the volume to detect anything other than very large effects, and very large effects are rare. Teams on low-traffic pages often run tests for weeks, declare a winner based on 40 conversions per variant, and implement changes that are statistically indistinguishable from noise. Use a sample size calculator before you start. If the required runtime is longer than three months, consider whether this page is the right place to test, or whether you should focus on driving more traffic before optimising conversion.

Running Tests During Unusual Periods

Seasonality, promotional events, and external news events all affect user behaviour. A test that runs across a bank holiday weekend, a major sale period, or a news cycle that affects your category will produce results that may not generalise to normal conditions. Be aware of what is happening in your commercial calendar when you start and stop tests. If you cannot avoid running during an unusual period, note it in your test documentation and treat the results with appropriate caution.

Implementing Winners Without Documentation

This one is slow-burning but serious. Teams run tests, implement winners, and move on without recording what was tested, what the hypothesis was, what the result was, and what the winning variant looked like. Six months later, nobody can remember why the page looks the way it does. A year later, someone proposes testing the exact same change again. Without a testing log, you are building institutional knowledge that lives in individual heads and disappears when people leave. Keep a record. It does not need to be elaborate. A shared spreadsheet with the hypothesis, the variants, the result, the confidence level, and the implementation date is enough.

Optimising for the Wrong Metric

Conversion rate is the most common primary metric in A/B testing, and it is often the right one. But it is not always the right one. A test that increases form completions but decreases the quality of leads generated is not a win. A test that increases add-to-cart rate but decreases purchase completion is not a win. Make sure the metric you are optimising is connected to a business outcome you actually care about, not just the closest measurable proxy.

When I was managing large-scale performance programmes, the most dangerous metric was cost per lead. It was easy to reduce cost per lead by broadening targeting and simplifying forms. It was also a reliable way to flood the sales team with unqualified enquiries and damage the relationship between marketing and commercial. The metric was going in the right direction. The business outcome was not.

Treating A/B Testing as a Substitute for Strategy

A/B testing is an optimisation tool. It works within the space defined by your current strategy and positioning. It cannot tell you whether you are targeting the right audience, making the right offer, or operating in the right market. Teams that rely on testing to make strategic decisions are using the wrong tool for the job. Testing can refine execution. It cannot substitute for the thinking that should precede execution.

I judged the Effie Awards for several years, which gave me a window into campaigns that had produced measurable business results at scale. The ones that consistently impressed were not the ones with the most sophisticated testing programmes. They were the ones that had made sharp strategic choices upstream and then executed with discipline. Testing was part of the execution layer, not the strategic foundation.

How to Build a Testing Programme That Compounds Over Time

The difference between teams that get lasting value from A/B testing and teams that run tests indefinitely without cumulative improvement is almost always programme structure. Individual tests are interesting. A structured programme is what produces compounding returns.

A structured testing programme has four components: a backlog, a prioritisation framework, a documentation system, and a review cadence.

The backlog is a list of all the tests you want to run, with hypotheses written out for each. It should be longer than you can ever work through. If your backlog is short, you are not generating enough ideas, which usually means you are not doing enough user research or analytical investigation to surface problems worth solving.

The prioritisation framework determines which tests you run first. The most widely used framework in CRO is PIE: Potential (how much improvement is possible), Importance (how much traffic or revenue does this page generate), and Ease (how difficult is this test to implement). Score each test across these three dimensions and run the highest-scoring tests first. This is not a perfect system, but it is better than running whatever is easiest or whatever someone in a meeting suggested last week.

The documentation system is your institutional memory. Every test that runs should be recorded with its hypothesis, variants, result, confidence level, and outcome. Over time, this becomes a knowledge base that tells you what works for your specific audience, not what works in general. That specificity is where the real value accumulates.

The review cadence is how you turn individual test results into strategic insight. Monthly or quarterly reviews of your testing programme should ask: what patterns are emerging across tests? Are there consistent themes in what is winning and what is losing? What does that tell us about our users that we did not know before? This is the layer of analysis that most teams skip, and it is the layer that separates teams that learn from teams that just run tests.

A/B Testing Tools: What to Look For

The tool market for A/B testing is mature and reasonably well-differentiated. The choice of tool matters less than how you use it, but there are a few things worth looking for when evaluating options.

Statistical rigour: does the tool use frequentist or Bayesian statistics, and does it handle the multiple testing problem? Some tools make it very easy to peek at results and call tests early, which is a feature that actively encourages bad practice. Look for tools that require you to set a sample size or runtime before the test starts, or that use sequential testing methods designed to handle early stopping correctly.

Segmentation capability: can you analyse results by device type, traffic source, new versus returning users, and other meaningful dimensions? A result that holds across all segments is more strong than one that is driven by a single segment. And a result that is strong for mobile users but negative for desktop users is a different decision than a blended result suggests.

Integration with your analytics stack: your testing tool should talk to your analytics platform so you can connect test results to downstream behaviour, not just the immediate conversion event. A variant that increases sign-ups but produces users with lower lifetime value is not a win, and you will only know that if your testing tool connects to your CRM or analytics platform.

CrazyEgg’s breakdown of multivariate testing is a useful reference if you are evaluating when to graduate from standard A/B tests to more complex experimental designs. The short version: when you have the traffic volume and want to understand how multiple variables interact, multivariate testing adds genuine value. Before that point, it adds complexity without proportionate insight. Their usability testing content is also worth reading as a complement to quantitative testing, particularly if your team is newer to qualitative research methods.

For teams looking at third-party support for their testing and optimisation work, it is worth understanding what conversion rate optimisation services actually involve before engaging an agency or consultant. The quality of CRO providers varies considerably, and the difference between a programme that compounds over time and one that produces a handful of inconclusive tests often comes down to the rigour of the process, not the sophistication of the technology.

What Good A/B Testing Results Look Like in Practice

There is a temptation, particularly in content written by testing platforms, to present A/B testing as a reliable engine for dramatic conversion improvements. Optimizely’s case study library includes examples of significant lifts from well-structured tests. These are real, but they are also the results worth publishing. The full distribution of test results includes a large proportion of inconclusive tests and a meaningful proportion of tests where the control outperforms the variant.

This is not a reason to stop testing. It is a reason to calibrate expectations correctly. A programme that runs 12 tests in a year and produces 4 clear winners, 5 inconclusive results, and 3 losses has done well. The 4 winners compound over time. The 5 inconclusive results tell you those variables are not the primary drivers of conversion on those pages. The 3 losses tell you that your hypotheses about those changes were wrong, which is useful information about your users.

The teams I have worked with that got the most from testing were not the ones chasing the biggest lifts. They were the ones that ran the most disciplined programmes, documented everything, and were genuinely curious about what the results meant rather than just whether they had won. That orientation, towards learning rather than winning, is what makes a testing programme valuable over time.

There is also a broader point here about marketing measurement that A/B testing makes concrete. When you run a properly controlled experiment, you are forced to define what success looks like before you start, measure it honestly, and accept the result regardless of whether it confirms your prior belief. That discipline, applied more broadly to how marketing teams measure their work, would improve the quality of decision-making across the board. Most marketing does not get measured with that rigour. A/B testing is one of the few places where it does, which is part of why it matters beyond its direct commercial impact.

If you want to explore the wider discipline that A/B testing supports, the CRO and Testing Hub covers everything from measurement foundations to testing programme management, with practical guidance on building the infrastructure that makes testing work at scale.

About the Author

Keith Lacy is a marketing strategist and former agency CEO with 20+ years of experience across agency leadership, performance marketing, and commercial strategy. He writes The Marketing Juice to cut through the noise and share what actually works.

Frequently Asked Questions

How long should an A/B test run before you call a winner?

An A/B test should run until it reaches the sample size you calculated before the test started, not until it looks like one variant is winning. Calculate your required sample size upfront based on your baseline conversion rate, the minimum effect size you care about, and a 95% confidence threshold. Running a test for a fixed time period (e.g., two weeks) without reference to sample size is a common mistake that produces unreliable results. At minimum, run tests for at least one full business cycle to account for day-of-week variation in user behaviour.

What is the difference between A/B testing and multivariate testing?

A/B testing compares two versions of a single variable, one control and one variant, to determine which performs better. Multivariate testing tests multiple variables simultaneously and measures how different combinations interact. Multivariate testing can surface interaction effects that A/B testing misses, but it requires significantly more traffic to reach reliable conclusions because the audience is split across more combinations. For most teams, A/B testing is the right default. Multivariate testing becomes valuable when you have high traffic volume and want to understand how elements on a page work together.

What conversion rate do you need before A/B testing is worth running?

There is no universal minimum conversion rate, but the combination of conversion rate and traffic volume determines whether you can reach statistical significance in a reasonable timeframe. A page with a 1% conversion rate needs far more traffic than a page with a 5% conversion rate to detect the same absolute improvement. As a practical guideline, if your page generates fewer than 100 conversions per month, most A/B tests will take so long to reach significance that the results will be outdated before you can act on them. In that case, focus on driving more traffic before investing in conversion testing.

Can you run multiple A/B tests on the same page at the same time?

Running multiple simultaneous tests on the same page is possible but requires careful management. If the tests affect overlapping elements or user journeys, the results can contaminate each other. The safest approach is to use an exclusion rule so that users in one test are excluded from others on the same page. Some testing platforms handle this automatically. If you cannot isolate the tests, run them sequentially rather than simultaneously. The time cost of sequential testing is real, but it produces cleaner results than overlapping tests on the same audience.

What should you do when an A/B test produces an inconclusive result?

An inconclusive result means the test did not detect a statistically significant difference between variants. This is a valid and informative result. It tells you that the variable you tested is not a primary driver of conversion on that page, at least not in the direction or magnitude you hypothesised. Document the result, note the hypothesis that was not confirmed, and move on to the next test in your backlog. Do not extend the test runtime in the hope of reaching significance. If you have hit your pre-determined sample size without a significant result, the test is done.