How Does Regression Test Prioritization Perform
in Real-World Software Evolution?

Yafeng Lu1, Yiling Lou2, Shiyang Cheng1, Lingming Zhang1, Dan Hao2, Yangfan Zhou3, Lu Zhang2
1Department of Computer Science, The University of Texas at Dallas, TX 75080, USA
2Key Laboratory of High Confidence Software Technologies (Peking University), MoE, China
Institute of Software, School of EECS, Peking University, Beijing, 100871, China
3School of Computer Science, Fudan University, 201203, China
[email protected]

In recent years, researchers have intensively investigated various topics in test prioritization, which aims to re-order tests to increase the rate of fault detection during regression testing. While the main research focus in test prioritization is on proposing novel prioritization techniques and evaluating on more and larger subject systems, little effort has been put on investigating the threats to validity in existing work on test prioritization. One main threat to validity is that existing work mainly evaluates prioritization techniques based on simple artificial changes on the source code and tests. For example, the changes in the source code usually include only seeded program faults, whereas the test suite is usually not augmented at all. On the contrary, in real-world software development, software systems usually undergo various changes on the source code and test suite augmentation. Therefore, it is not clear whether the conclusions drawn by existing work in test prioritization from the artificial changes are still valid for real-world software evolution. In this paper, we present the first empirical study to investigate this important threat to validity in test prioritization. We reimplemented 24 variant techniques of both the traditional and time-aware test prioritization, and investigated the impacts of software evolution on those techniques based on the version history of 8 real-world Java programs from GitHub. The results show that for both traditional and time-aware test prioritization, test suite augmentation significantly hampers their effectiveness, whereas source code changes alone do not influence their effectiveness much.


Experimental Data and Results