Metrics for A/B Testing Complete Guide

What is metrics in A/B testing?

Metrics is used to measure the success of a/b test. It quantifies the A/B test outcome and drives the key result.

Metric types?

There are usually two types of metrics to create when setting up an A/B Testing:

Invariant Metrics: Metrics that shouldn’t change between your test and control
Evaluation Metrics: High level business metrics that measures user experience with the product

How do we go about making a definition of a metric (for sanity checking)?

High level concept of metrics (e.g active users, CTR)
Details (e.g. how do you define user activity)
Take a set of metrics and summarize them into a single metric (e.g. overall evaluation criterion (OEC))

For evaluation, you can choose either one metric or a whole suite of metrics. If you have multiple metric, you can combine them into one metric, such as an objective function, or an Overall Evaluation Criterion (OEC) – a term that Microsoft uses.

The last situation is how generally applicable the metric is. If you are running a suite of A/B tests, it is preferable to have a metric that works across the entire suite.

User funnel indicates a series of steps taken by users through the site. It is called a funnel because every subsequent stage has fewer users than the stage above. Each stage is a metric – total count, rate, and probability (i.e. a unique user progressed down).

Try to void those metrics that might be difficult to measure because they don’t have access to data or It takes too long to collect.

What data can be used for creating metrics?

External data can be used. Three categories of companies that gather dataare:

Companies that collect data (e.g. Comscore, Nielsen)
Companies that conduct surveys (e.g. Pew)
Academic papers

The above can help you benchmark your own metrics against the industry

Internal data can be used as well. You could do:

Retrospective analysis: Look at historic data to look at changes and see the evaluation
Surveys and User experience research: This helps you develop ideas on what you want to research

The problem with these studies is that they show you correlation, and not causation, compared to running an experiment. Talk to your colleagues about what ideas they think make sense for metrics.

You can gather additional data by

User Experience Research (UER) – high depth on a few users. This is good for brainstorming. You can also use special equipment in a UER (e.g. eye movement camera) that you cannot use on your website. You may want to validate the results using retrospective analysis
Focus groups: Medium depth and # of participants. Get feedback on hypotheticals, but may run into the issue of groupthink
Surveys: have low depth but high # of participants: Useful for metrics you cannot directly measure. Can’t directly compare with other metrics since population for survey and internal metrics may be different.

Example of a metric.

Generally, rate is used when you want to measure the usability of the site, and probability when you want to measure the impact. Some of the commonly used metrics for a/b testing are:

Cookie probability: For each number of cookies that click divided by number of cookies.
Pageview probability: Number of pageviews with a click within divided by number of pageviews.
Click-through Rate: Number of clicks divided by number of pageviews.

You may have to filter out spam and fraud to de-bias the data. One way to figure out if you are biasing or de-biasing the data by filtering, is to slice your data and then calculate the metric for each slice after filtering. If you are affecting any slide disproportionately, then you may be biasing your data with filtering

To remove any weekly effects when looking say at total active cookies over time, use week-over-week i.e. divide current data by data from a week ago. Alternately, one can use year-over-year.