Not all mobile devices are equal: measuring heterogeneous performance in online controlled experiments

Published in CODE@MIT: The 6th Annual Conference on Digital Experimentation @ MIT, 2019

Abstract

It is of great importance to accurately measure various performance features in the production cycle of mobile apps in industry. Unlike measuring user engagement, whose metrics are usually aggregated on the user level (i.e. one data point per user) in A/B experiments, the industry standard is to calculate performance metrics by aggregating on the event level (i.e. one data point per event). However, since the distributions of event count from each device (or user) are usually extremely skewed, event-level performance metrics can be dominated by a small group of heavy users on relatively high-end mobile devices. Thus, optimizing for event-level performance metrics in A/B testing experiments can bias product decisions toward the experiences of heavy users. On the other hand, if we measure performance metrics in the same way as user engagement metrics, the experiences of all users can be weighted equally. We compare both the event-level and user-level approaches to measuring performance for 152 A/B experiments at Snapchat, and find that 10.06% of results show differences in statistical significance, effect direction, or both. We further show that it is the strong heterogeneity of mobile devices on the market that causes the non-trivial gap between event-level and user-level performance metrics. Based on our investigation of this device heterogeneity, we provide recommendations on how to measure and interpret performance metrics for A/B experiments of mobile apps and discuss directions for future work.

Recommended citation: Chow, Evan. ”Not All Phones Are Equal: Measuring Heterogeneous Performance in Online Controlled Experi- ments.” CODE 2019: Conference on Digital Experimentation (CODE), 2019, MIT. https://github.com/evnchw/evnchw.github.io/blob/master/files/chow_2019.pdf