Hi Users,
Drill CAN use the disk when running out of memory (a.k.a. spill to disk).
Currently only the Sort operation is supported, hence you’d need to enforce a Merge Join
for joining, or a Streaming Aggregation for aggregating.
But we are currently working on expanding this functionality to other operators (Hash Aggregate,
Hash Join, windowing, etc.)
And Drill does not work with “raw storage” (i.e., manage the storage pages, etc); Drill
needs the storage to be a file system, or HBase, Hive, etc.
BTW, Drill supports the Apache Parquet storage format, which is columnar – and may suit
your needs.
Boaz
On 2/21/17, 2:21 PM, "Nicolas Paris" <niparisco@gmail.com> wrote:
Hi,
Join csv, json, databases.
Your needs looks like ETL processes. I am not sure drill suits well for
such goal. AFAIK, it is not able to work on disk when out of memory
occures.
Moreover those tasks usally needs some procedural code parts. I am not
sure UDFs are very flexible.
For such use case, I would use ETL tools such talend and load monetdb
direcly with it.
Le 19 févr. 2017 à 18:02, Gustavo Brian écrivait :
> Hi there,
>
> I'm newbie to this, so i apology if I'm asking something senseless :)
>
> Thanks for this amazing product. I'm planning to use it as main query
> engine for data analysis. My plan is to have a raw storage where to drop
> different types of documents: csv, json,... as they are produced by the
> apps. Then use Drill to query and join against sql database to produce
> enriched data to drop into a columnar storage: monetdb, druid,...
>
> My question is: is there a preferred storage engine for this raw storage?
> Can Drill take advantage of other engines like hadoop or yarn?
>
> Thanks in advance
|