drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dobes Vandermeer" <dob...@gmail.com>
Subject Updating tables stored on s3
Date Sat, 14 Mar 2020 04:34:39 GMT

I've been thinking about how I might be able to get a good level of performance from drill
while still having data that updates and while storing the data in s3.  Maybe this is a pipe
dream, but here are some thoughts and questions.

What I would like to be able to do is to update, replace, re-balance the parquet files in
s3, but I don't want to calculate and specify the whole list of files that are "current" in
each query.

I was thinking perhaps I could use a view, so when I replace a file I can add a new file,
update the view to include it, and then delete the old file.

But I'm worried that having a view with thousands of files could perform poorly.

Building on that idea, it occurred to me that perhaps I could have a hierarchy of views -
views of views.  For example, a view for each day, rolled into a view for each month, rolled
into a view for each year, rolled into a top-level view.  This could be useful if drill could
somehow prune views, but I haven't seen any mention of that in the docs.

It seems like Apache Iceberg is designed to help with this, but it doesn't support s3 currently,
I'm not sure if it will (or can) anytime soon.

Does anyone have any thoughts or experience to share in this area?

Maybe what I am really looking for is some other database entirely - some kind of scalable
database that supports updates but scales horizontally.  Maybe drill just isn't like that
right now.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message