Adaptive partitioning and indexing for in situ query processing

Citation

TY  - JOUR
AU  - Olma, Matthaios
AU  - Karpathiotakis, Manos
AU  - Alagiannis, Ioannis
AU  - Athanassoulis, Manos
AU  - Ailamaki, Anastasia
PY  - 2020
DA  - 2020/01/01
TI  - Adaptive partitioning and indexing for in situ query processing
JO  - The VLDB Journal
SP  - 569
EP  - 591
VL  - 29
IS  - 1
AB  - The constant flux of data and queries alike has been pushing the boundaries of data analysis systems. The increasing size of raw data files has made data loading an expensive operation that delays the data-to-insight time. To alleviate the loading cost, in situ query processing systems operate directly over raw data and offer instant access to data. At the same time, analytical workloads have increasing number of queries. Typically, each query focuses on a constantly shiftingâ€”yet smallâ€”range. As a result, minimizing the workload latency requires the benefits of indexing in in situ query processing. In this paper, we present an online partitioning and indexing scheme, along with a partitioning and indexing tuner tailored for in situ querying engines. The proposed system design improves query execution time by taking into account user query patterns, to (i) partition raw data files logically and (ii) build lightweight partition-specific indexes for each partition. We build an in situ query engine called Slalom to showcase the impact of our design. Slalom employs adaptive partitioning and builds non-obtrusive indexes in different partitions on-the-fly based on lightweight query access pattern monitoring. As a result of its lightweight nature, Slalom achieves efficient query processing over raw data with minimal memory consumption. Our experimentation with both microbenchmarks and real-life workloads shows that Slalom outperforms state-of-the-art in situ engines and achieves comparable query response times with fully indexed DBMS, offering lower cumulative query execution times for query workloads with increasing size and unpredictable access patterns.
SN  - 0949-877X
UR  - https://doi.org/10.1007/s00778-019-00580-x
DO  - 10.1007/s00778-019-00580-x
ID  - Olma2020
ER  -

どんなもの？

事前にデータを読み込んでおく必要がない(CSVファイルとかを直接扱える, in situ)データベース。クエリを何回か呼び出しているうちに、クエリ実行の統計から、データをうまく区切って読み込み量を減らしたり、索引を作成したりして、自動でチューニングしてくれる。

先行研究と比べてどこがすごい？

既存の in situ DB では、フルスキャンクエリか、事前のチューニングが必要か、のどちらかだった。
目的から、既存の自動チューニング技術より軽量(メモリ的にも、計算量的にも)である必要があった。

技術や手法のキモはどこ？

Positional Maps: あるフィードがファイルのどの位置にあったかを記録するキャッシュ
水平分割: よく使われるデータの範囲だけ索引を作成するため、まず分割を行う。分散が小さくなるまでクエリを実行するたびに分割を繰り返す。
Value-existence index: パーティションに値が存在するかの索引。これを使って、読む必要のないパーティションを飛ばせる。
Value-position index: パーティション単位で作られるB+木索引。今までのクエリ実行コストから次のクエリのコストを考慮して、索引を作成するか決定する。Snoopy cache問題に帰着させる。

どうやって有効だと検証した？

一般的なRDBMS、前作（自動索引がない）のPostgresRAWと比較。読み込み時間を含めて時間を計測すると、圧倒的に速くなる。

議論はある？

（水平分割の基準がよくわからんかった）
今までの課題だった自動索引が完成したので、この分野はひと段落ついたのではと感じた。

次に読むべき論文は？

前回: Adaptive query processing on RAW data （既読）
生のファイルに索引を作成する他の例: FastBit: interactively searching massive data