Affiliation:
1. École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
2. Harvard University, Cambridge, MA
Abstract
As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare and to load the data into the database before executing the desired queries. Many applications already avoid using database systems, for example, scientific data analysis and social networks, due to the complexity and the increased
data-to-query
time, that is, the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this
data deluge
will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze.
We here present the design and roadmap of a new paradigm in database systems, called NoDB, which
do not require data loading while still maintaining the whole feature set of a modern database system.
In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern Database Management Systems (DBMS), we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. We conclude that NoDB systems are feasible to design and implement over modern DBMS, bringing an unprecedented positive effect in usability and performance.<!-- END_PAGE_1 -->
Publisher
Association for Computing Machinery (ACM)
Cited by
17 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Significantly Improving Fixed-Ratio Compression Framework for Resource-limited Applications;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12
2. Detecting CSV file dialects by table uniformity measurement and data type inference;Data Science;2024-07-26
3. Statistical Claim Checking;Proceedings of the 31st ACM International Conference on Information & Knowledge Management;2022-10-17
4. Resource-aware adaptive indexing for in situ visual exploration and analytics;The VLDB Journal;2022-04-16
5. JSONSki: streaming semi-structured data with bit-parallel fast-forwarding;Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems;2022-02-22