Overall, it is important to remember that there has to be an equal amount of spend in each cell for an equal comparison. If the cell with less spend has better performance, that's probably just because it spent less – not because it was better. However, if one cell spent more and still performed better, it is the real winner!
In this article, we talk about the things to consider when analyzing Facebook split tests results, both as as they run and after they have ended.
1. Analyzing results while the test is running
You can analyze the statistical significance of an ad study by clicking its name in Library → Ad Studies:
When the ad study is running, the most interesting question is whether to stop it – either because a difference has been found or because there does not seem to be a difference – or whether to keep it running to collect more data. To help you make the decision, we show a clear recommendation to either stop or continue the ad study. Note that you can modify the end date as long as the ad study is running.
The recommendation to continue or stop depends on the values of the smallest interesting difference and confidence level given when the test was created.
2. Analyzing results when the test has ended
After the ad study has ended, you can see which cell was the winning one, assuming you have collected enough data to draw conclusions. There are three possible outcomes:
- There is a statistically significant difference. In this case, you should implement the better variant more widely.
- There is no statistically significant difference. You can implement either variant. It is not certain which is better, and the difference is most likely too small to be of practical importance anyway.
- There is not enough data to draw conclusions. There might be a difference that is big enough to be interesting, but the ad study ended before enough data was collected to estimate this with enough confidence. If you still want to know which alternative is better, you should create a new ad study and run it longer (and with a larger budget) to collect more data.
If a difference is found you will also see a information on how large the difference is:
In this example, it is almost certain that CTR is over 0.056% smaller in Group A. However, there is also a small probability that CTR is over 0.126% smaller. These differences are given as percentage points because CTR is also a percentage; if CPA was selected the difference would be shown as monetary values. In this case, CTR was 0.758% in Group B and 0.667% in Group A, which gives the absolute difference of 0.091 percentage points as shown in the picture above. The last value on each row gives the relative difference compared to the baseline, in this case approximately -0.091/0.758 = -12%. When the ad study has three or more cells there are more alternative outcomes. For example, it is possible that there is no statistically significant difference between the two best cells but both are better than other cells. It is also possible that there is a statistically significant difference for CPA but not for Conversion Rate. This happens for example if both campaigns have received the same number of clicks and conversions, but one campaign has accomplished this with only half the spend.
"Not statistically significant"
What does it mean when we say that a difference is "not statistically significant"? Let's be first clear about what this phrase does not mean: it does not mean that there is no difference at all. For example, suppose you are testing two campaigns, A and B, and in reality, Campaign B has a 10% higher performance. If we run the ad study until we get roughly 100 conversions in each campaign, say 96 in Campaign A and 104 in Campaign B, the result of the test is that we observe no statistically significant difference, even though Campaign B is in reality 10% better!
Here is a better way to interpret "not statistically significant": it means that you do not yet have enough data to distinguish whatever difference there might be, and crucially, you do not have enough data to conclude which case is better. (Note, however, that you have learned something by running the ad study: if Campaign B had been 50% better, you would have gotten a different result already, so it is unlikely that there is a large difference in performance.)
Keeping all of this in mind when running ad studies can easily get overwhelming, and misunderstanding statistical significance is common even in science. This is exactly why we ask you to define the smallest interesting difference already when creating an ad study: it allows us to give you more understandable results. If the smallest interesting difference had been 10% in the example above, you would have seen a recommendation to continue the ad study, because, given the data so far, it is still possible that there is a difference larger than 10%. On the other hand, if the smallest interesting difference had been 40%, you would see a recommendation to stop the ad study, because it would have been unlikely that there is a difference larger than 40%.
Applying learnings
How to implement changes based on the study results depends on the ad study, and how well the results can be generalized. Let's use a creative split test as an example: If you learned that for a prospecting campaign, showing the prize in the image performs better than not showing the prize, you can gradually implement this across your prospecting campaigns to minimize risk. Here, "gradually" means that you implement the change for a couple of audiences at a time. Alternatively, you can implement the change across all prospecting campaigns at once. Which approach is better depends on your risk tolerance.
Frequently Asked Questions (FAQ)
Where are p-values?
Our calculations are based on Bayesian statistics and p-values are not relevant in this context. As to why we prefer to use Bayesian statistics instead of classical (frequentist) statistical tests, the main reasons are:
- Most people want to calculate statistical significance also while the ad study is running, not only after it has ended. However, if you do this with classical statistical tests and stop the ad study when a statistically significant result is reached, you will actually affect the result of the test. The fact that testing alone can change the outcome might sound surprising. The reason this happens is that the outcome always fluctuates during the ad study. The more often you check the results, the more likely you are to check at a moment when the result happens to be statistically significant. Just using Bayesian statistics does not magically make the problem go away, but it allows us to be more flexible and reduce the problem to a negligible level. You can find more information here and here.
- Another problem is that p-values are, more often than not, misunderstood. A large p-value is often misunderstood to mean that there is no difference; however, it could also mean that you just do not have enough data yet to draw conclusions. Bayesian statistics allows us to calculate quantities that are more intuitive and more useful than p-values: how likely it is that there is a difference, and how large the difference is. These are, after all, what most people really want to know.
I want to know if there is a difference of any size!
Are you sure about that? Suppose that Campaign A has a 0.1% higher conversion rate than Campaign B. This means that when Campaign B gets 1000 conversions, Campaign A gets 1001 — on average, that is. In any individual test, the outcome would be something different because of random variation, which is why you would need to collect approximately 4 million conversions in each campaign to be able to conclude that Campaign A indeed is better by a tiny amount. There are probably better ways to spend your time and money.
From a purely theoretical point of view, there is almost always some difference. However, the difference can be so small that it is irrelevant for all practical purposes. There are a million other possible changes you could make, that would have a bigger impact on your performance. When it is unlikely that a large enough difference will be found, we show a recommendation to stop the ad study so you do not waste time and money pursuing differences that are not relevant in practice.
For more details on budgeting for Facebook split tests, see our Knowledge Base article on Planning a Facebook split test.
Why is it important to define the smallest interesting difference?
The smallest interesting difference is used to decide when enough data has been collected so the ad study can be stopped. In brief, data is collected until we know the difference with a sufficient precision. The exact definition is somewhat involved, but if you are still reading this, you probably want to know anyway.
In statistical terms, let θ represent the metric whose difference we want to analyze. Given two ad study cells, A and B, we first estimate the posterior distributions for this metric in each cell, θA, and θB. Using these distributions, we can calculate the distribution for the relative difference of θA and θB, and then find the width of the 95% highest density interval (HDI) in of this distribution (the value 95% corresponds to the selected confidence level, which can be selected when creating the ad study). The ad study should be stopped when this width becomes smaller than the smallest interesting difference.
Because this stopping criterion does not depend on how large the difference is, but only on how well the difference can be estimated, it is possible to estimate statistical significance even while the ad study is running without affecting the outcome. A more elaborate explanation can be found here.