drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Clarification regarding Apache drill setup
Date Fri, 16 Aug 2019 16:37:07 GMT
Hi Manu,

To add a bit more background... Drill uses local storage only for spilling result sets when
they are too large for memory. Otherwise, data never touches disk once read from S3.

Unlike Snowflake, Drill does not cache S3 data locally. This means that, if you query the
same file multiple times, Drill will hit S3 for each query. Adding Snowflake-like S3 caching
is an open project looking for volunteers.

Spilling can be configured to go to the DFS (distributed file system). Presumably, this can
be S3, though I don't think anyone has tried this. Information about configuring the spill
directory is in [1].

Drill does not need Hadoop; it only needs ZK (and, as Nitin pointed out, the proper configuration
for your cloud vendor.)

As it turns out, there is some information on AWS and S3 setup in the "Learning Apache Drill"
book. Probably not as much detail as you would like, but enough to get you started. The book
does not include GCE setup, but the details should be similar.

Drill uses the HDFS client (not server) to access a cloud vendor. So, as long as you install
the correct HDFS client libraries, you are mostly good to go. Note that the S3 libraries have
evolved over time. The book explains the most recent library at the time we wrote the book
last year. Please check the HDFS project for which library you need for GCE access.

Now a request: you will learn quite a number of important details as you set up your cloud-agnostic
solution. Please post your findings here, and/or file JIRA tickets. so we can update documentation,
or fix any issues that you discover. You are benefiting from the work of others who created
Drill; please share your findings with the community so others can benefit from your work.

Thanks,
- Paul

[1] https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/




 

    On Friday, August 16, 2019, 05:10:00 AM PDT, Nitin Pawar <nitinpawar432@gmail.com>
wrote:  
 
 From my learning and I could be wrong in few things but wait for others to
answer as well


1.  When stetting up the drill cluster in prod environment to query data
ranging from several gigabytes to few terabytes hosted in s3/blob
storage/cloud storage, what are the considerations for disk space ? I
understand drill bits make use of data locality, but how does that work in
case of cloud storage like s3 ? Will the entire data from s3 be moved to
drill cluster before starting the query processing ?

It is advised to use parquet as your file formats. It improves your
performance a lot. Drill will bring all the data it needs to process for a
given query. This can be reduced if arrange your folder structure with
filterable columns such as dates etc. When you are using parquet files,
each of these files or blocks are downloaded separately by all the drillbit
servers and then based on your query patterns the data localization happens
such as when you say group by or filter and then sum etc. All the data
generally resides in memory and then starts spilling to disks based on your
query patterns.

  2.  Is it possible to use s3 or other cloud storage solutions for Sort,
Hash Aggregate, and Hash Join operators spill data rather than using local
disk ?
As per my understanding, only local disks are used for non-memory based
aggregations. Using the cloud based storage systems for intermediate
outputs as heavy network IO and causes huge delays in queries.


  3.  Is it ok to run drill production cluster without hadoop ? Is just
zookeeper quorum enough ?
You do NOT need to set up a hadoop cluster. Apache drill has no
per-requisite  on hadoop for execution purposes unless you are using those
fer,eature sets of apache drill.
To run drill cluster, a zookeeper quorum is more than sufficient. From
there on based on what storage systems you use, you will need to create
storage plugins and use them.

On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <manu.mukundan@prevalent.ai>
wrote:

> Hi,
>
> My name is Manu and I am working as a Bigdata architect in a small startup
> company in Kochi, India. Our new project handles visualizing large volume
> of unstructured data in cloud storage (It can be S3, Azure blob storage or
> Google cloud storage). We are planning to use Apache Drill as SQL query
> execution engine so that we will be cloud agnostic. Unfortunately we are
> finding some  key questions unanswered before moving ahead with Drill as
> our platform. Hoping you can provide some clarity and it will be much
> appreciated.
>
>
>  1.  When stetting up the drill cluster in prod environment to query data
> ranging from several gigabytes to few terabytes hosted in s3/blob
> storage/cloud storage, what are the considerations for disk space ? I
> understand drill bits make use of data locality, but how does that work in
> case of cloud storage like s3 ? Will the entire data from s3 be moved to
> drill cluster before starting the query processing ?
>  2.  Is it possible to use s3 or other cloud storage solutions for Sort,
> Hash Aggregate, and Hash Join operators spill data rather than using local
> disk ?
>  3.  Is it ok to run drill production cluster without hadoop ? Is just
> zookeeper quorum enough ?
>
>
> I totally understand how busy you can be but if you get a chance, please
> help me to get a clarity on these items. It will be really helpful
>
> Thanks again!
> Manu Mukundan
> Bigdata Architect,
> Prevalent AI,
> manu.mukundan@prevalent.ai
>
>
>

-- 
Nitin Pawar
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message