Abstract
We present LazyFS, a new fault injection tool that simplifies the debugging and reproduction of complex data
durability
bugs experienced by databases, key-value stores, and other data-centric systems in crashes. Our tool simulates persistence properties of POSIX file systems (e.g., operations
ordering
and
atomicity)
and enables users to inject lost and torn write faults with a precise and controlled approach. Further, it provides profiling information about the system's operations flow and persisted data, enabling users to better understand the root cause of errors.
We use LazyFS to study seven important systems: PostgreSQL, etcd, Zookeeper, Redis, LevelDB, PebblesDB, and Lightning Network. Our fault injection campaign shows that LazyFS automates and facilitates the reproduction of five known bug reports containing manual and complex reproducibility steps. Further, it aids in understanding and reproducing seven ambiguous bugs reported by users. Finally, LazyFS is used to find eight new bugs, which lead to data loss, corruption, and unavailability.
Publisher
Association for Computing Machinery (ACM)
Reference60 articles.
1. Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2018. Operating Systems: Three Easy Pieces. CreateSpace Independent Publishing Platform, North Charleston, SC, USA.
2. Specifying and Checking File System Crash-Consistency Models
3. Daniel Bovet and Marco Cesati. 2005. Understanding the Linux Kernel (3 ed.). Oreilly & Associates Inc.
4. BTRFS. 2024. Status. https://btrfs.readthedocs.io/en/latest/Status.html Last accessed on July 18, 2024.
5. Analysis for the performance degradation of fsync() in F2FS