drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@mapr.com>
Subject Re: Distribution of workload across nodes in a cluster
Date Thu, 23 Feb 2017 15:20:53 GMT
Last I checked csv data will be read with a single thread per file. To make matters more challenging
Drill will typically scan the whole file (well in the case of a select * you are requesting
a full scan of the data).


Try to split the file into several smaller files (128MB or 256MB or smaller pending your requirements)
. Also consider migrating the data locally to your Drill cluster, or use parquet. Some use
cases you may read data remotely and then write it locally for repeated access, then just
try to split the file into smaller files on the remote cluster and write locally in parquet.


--Andries

________________________________
From: PROJJWAL SAHA <proj.saha@gmail.com>
Sent: Wednesday, February 22, 2017 11:31:27 PM
To: user@drill.apache.org
Subject: Distribution of workload across nodes in a cluster

Hello,

I am doing select * query on a csv file of 1 GB with a 5 node drill
cluster. The csv file is stored in another storage cluster within the
enterprise.

In the query profile, I see one major fragment and within the major
fragment, I see only 1 minor fragment. The hostname for the minor fragment
corresponds to one of the nodes of the cluster.

I think therefore, that all the resources of the cluster are not utilized.
Is there any configuration parameters that can be tweaked to achieve more
effective workload distribution across cluster machines ?

Let me know what you think.

Regards,
Projjwal

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message