drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Question on Drill Distributed Mode
Date Tue, 31 Mar 2015 15:40:58 GMT
I would suggest that you use a tool to split the JSON file into smaller chunks of 64-128MB,
and keeping the JSON objects intact in each file.
Also it will be best in the long run to just use a distributed FS for the Drill cluster vs
trying to manage file partitions.
I personally prefer MapR-FS as it provides a robust DFS and also has loopback NFS which allows
you to manage the files and data on the DFS as simple as a local FS with normal Linux tools
and scripts.

It is better to have large JSON files in more manageable chunks, as it is very hard for any
tool to split it on the fly as the beginning and end of records can be time consuming to find
without scanning the whole file. 64-256MB for JSON files seems to be a good starting point
pending total data size and node config, etc.

—Andries

On Mar 30, 2015, at 9:58 PM, Varun Kumar Reddy B <varunk@sahajsoft.com> wrote:

> Hello Team
> 
> I started exploring drill for our requirement to run SQL-on-semi structured
> data. I have setup a 4node drill cluster with zookeeper.  Have few
> questions on how it actually works,
> 
> 1. When I run Drill in distributed mode,  using dfs (local file system)
> i.e., I have a 1GB Json file on one of the nodes(say n1). I am able to run
> the query by launching sqlline from any of the nodes(n1, n2, n3, n4)
> inspire have date only on n1.   My questions is
> 
>  a.  Is the query being executed on all the nodes? i.e., will Drill
> parallelise the query execution by distributing the data to other node
> n2,n3n4?
> 
> b. If   NO,  by copying the same file on all the nodes n2,n3,n4 will help
> in leveraging MPP architecture of Drill?
> 
> 
> Thx
> Varun


Mime
View raw message