drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: Improving S3 query performance with cache capabilities
Date Thu, 01 Oct 2015 11:41:38 GMT
Hi Jocelyn,

I have used Tachyon as a caching layer for accessing S3 from Drill. It has
limitations but the latest release of Tachyon solves many of them (I have
not yet tested this).

I'm quite interested in what you are doing and it's relevant for my project
as well.

Please let me know if there is any development on your side and if you are
interested then I will share with you some basic documentation regarding
what needs to be done to connect the 3.

Regards,
 -Stefán

On Thu, Oct 1, 2015 at 9:37 AM, DEMOY, Jocelyn <Jocelyn.DEMOY@sage.com>
wrote:

> Hi all,
>
> I am an architect of a PAAS BI solution and gave apache drill a test
> during a week.
>
> On our current solution, we store our customer data on S3 column files and
> perform in memory computation on a home made nosql engine. We have TB of
> data on S3, but since it's a multi tenant solution, when a end customer
> perform queries it's only on a subset of data in separated S3 folder with
> let's say 1 gb of data max.
>
> We try to have the best response time for real time analytics queries
> (things like 1 sec response time for 500K row aggregation on 4 cols with
> joins). To do so, we load only the necessary columns (we have one file per
> column, not per table) and cache the columns value in local JVM for xx mins.
>
> I built a POC with drill & parquet to replace our computation engine.
> Local execution time are fine and mach our needs. I am quite happy with the
> capability to query on S3 with real SQL syntax. My main problem is the
> latency with S3 (from AWS instances) :  for every query I have to pay the
> download cost of the parquet file from S3, this makes the query response
> time too long for a "real time" solution.
>
> I would like to know if you have plan anything in the roadmap to enable
> some native caching capabilities on the data itself (not only metadata
> caching).
>
> I saw the AbstractStoragePlugin and AbstractRecordReader classes. Would it
> be possible (and a good idea) for us to create a decorator for the classic
> file provider (or a totally new custom S3 provider) with memory cache
> capability. How would this make sense in a drill cluster and in the drill
> philosophy ?
>
> Thanks in advance
>
>
> Jocelyn Demoy
> BI Architect, R&D and strategy
> Sage
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message