drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DEMOY, Jocelyn" <Jocelyn.DE...@sage.com>
Subject Improving S3 query performance with cache capabilities
Date Thu, 01 Oct 2015 09:37:50 GMT
Hi all,

I am an architect of a PAAS BI solution and gave apache drill a test during a week.

On our current solution, we store our customer data on S3 column files and perform in memory
computation on a home made nosql engine. We have TB of  data on S3, but since it's a multi
tenant solution, when a end customer perform queries it's only on a subset of data in separated
S3 folder with let's say 1 gb of data max.

We try to have the best response time for real time analytics queries (things like 1 sec response
time for 500K row aggregation on 4 cols with joins). To do so, we load only the necessary
columns (we have one file per column, not per table) and cache the columns value in local
JVM for xx mins.

I built a POC with drill & parquet to replace our computation engine. Local execution
time are fine and mach our needs. I am quite happy with the capability to query on S3 with
real SQL syntax. My main problem is the latency with S3 (from AWS instances) :  for every
query I have to pay the download cost of the parquet file from S3, this makes the query response
time too long for a "real time" solution.

I would like to know if you have plan anything in the roadmap to enable some native caching
capabilities on the data itself (not only metadata caching).

I saw the AbstractStoragePlugin and AbstractRecordReader classes. Would it be possible (and
a good idea) for us to create a decorator for the classic file provider (or a totally new
custom S3 provider) with memory cache capability. How would this make sense in a drill cluster
and in the drill philosophy ?

Thanks in advance

Jocelyn Demoy
BI Architect, R&D and strategy

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message