drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 王亮 <wanglian...@gmail.com>
Subject Drill storage plugin for IPFS, any suggestion is welcome :)
Date Sat, 06 Jul 2019 09:31:03 GMT
Hi all,

After reading that excellent book "Learning Apache Drill: Query and Analyze
Distributed Data Sources with SQL", my classmate and I also wanted to write
a Drill storage plugin. We found most DFS and NFS have been supported by
Drill, so we chose a relatively new and promising distributed file system,

So we built Minerva, a Drill storage plugin that connects IPFS's
decentralized storage and Drill's flexible query engine. Any data file
stored on IPFS can be easily accessed from Drill's query interface, just
like a file stored on a local disk. The basic idea is very simple: run a
Drill instance along the IPFS daemon, and you can connect to other users on
IPFS who are also using Minerva. If one of the users happens to have stored
the file you are trying to query, then Drill can send execution plan to
that node, who executes the operations locally and returns the results
back. Of course, other users can benefit from your node as well, if you are
sharing the data they want. If there are enough people running Minerva,
data sharing and querying can be made distributed and more efficient!

The query process is as follows:
0 The user inputs an SQL statement, referencing a file on IPFS by its CID;
1 The Foreman resolves the CIDs of the "pieces" of the data file, as well
as the IPFS providers of these pieces, by querying the DHT of IPFS;
2 The Foreman distributes jobs to drillbits running on the providers.
3 Drillbits on the providers read data from the piece of file on their
local disk, perform any necessary relational operations, and return results
to the Foreman.
4 The Foreman returns the results to the user.

Thanks to the modular design of Drill, we could rather "easily" write this
storage plugin. Now this plugin supports basic query operations, both read
and write, but only works with json and csv files. It is not very stable
for now, and the performance is still poor, mainly because it takes to too
long to do DHT queries on IPFS. We are trying to improve these problems in
the future.

If you are insterested, we have made a few slides that explain the ideas in

Any suggestion is welcome. ^_^

Find the code on GitHub: https://github.com/bdchain/Minerva

Wang Liang

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message