When designing an experiment, it’s best practice to set your success criteria upfront (so you don’t talk yourself into the outcome you want after the test ends).
A good, quantitative experiment will set a significance threshold with a maximum p-value. But, what should you choose as your p-value?
It’s tempting to take scientific best practices (for obvious reasons) and set your p-value threshold at ≤ 0.05, but you can generally be much more liberal (up to 0.25 says Thomas Redman, author of Data Driven).
With a more pragmatic threshold, you can get to the practical insight of your experiment faster (often p-value ≤ 0.05 isn’t even possible for product experiments).
In product, stat sig is rarely the most significant piece of the puzzle (pun intended, sorry). Instead, consider the context and design of the test and its possible outcomes to determine whether there will be business significance that outweighs the statistical significance.
Say you run a before-and-after test to see which headline maximizes conversion. B results in 25% higher conversions, but you discover that someone launched a 40% everything coupon on the site during the second part of the test. Now you know your results aren’t business significant even though the p-value might look great.
On the other hand, if you run an A/B test campaign for a month and see a conversion lift of 30%, but your p-value is 0.15, you’re probably safe calling it a win and moving forward.
Statistical significance accounts for sampling error (basically, getting unlucky in your random selection), but most failed product experiments are due to something less hidden like poorly structured tests, unaccounted variables, or just bad analysis. When assessing business significance, you should be taking all these factors into account.
If you get those other factors right with good experiment design and execution, you can unlock meaningful insights with a more liberal approach to statistical significance.