drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Drill large data build up in fragment by using join
Date Wed, 15 Apr 2020 20:15:06 GMT
Hi Shashank,

Let me make sure I understand the question. You have to large JSON data files? You are on
a distributed Drill cluster. You want to know why you are seeiing a billion rows in one fragment
rather than the work being distributed across multiple fragments? Is this an accurate summary?

The key thing to know is that Drill (and most Hadoop-based systems) rely on files to be "block-splittable".
That is, if your file is 1 GB in size, Drill needs to be able to read, say, blocks of 256
MB from the file so that we can have four Drill fragments read that single 1 GB file. This
is true even if you store the files in S3.


CSV, Parquet, Sequence File and others are block splittable. As it turns out, JSON is not.
The reason is simple: there is no way to jump into a typical JSON file and scan for the start
of the next record. With CSV, newlines are record separators. Parquet has row groups. With
JSON, there may or may not be newlines between records, and there may or may not be newlines
within records.

It turns out that there is an emerging standard called jsonlines [1] which requires that there
be newlines between, but not within, JSON records. Using jsonlines would make JSON into a
block-splittable format. Drill does not yet support this specialized JSON format, but doing
so would be good enhancement for data files that adhere to the jsonlines format. Is your data
in jsonlines format?


For now, the solution is simple: rather than storing your data in a single large JSON file,
simply split the data into multiple small files within a single directory. Drill will read
each file in a separate fragment, giving you the parallelism you want. Make each file on the
order of 100MB, say. The key is to ensure that you have at least as many files as you have
minor fragments. The number of minor fragments will be 70% of your CPU count per node. If
you have 10 CPUs, say, Drill will create 7 fragments per node. Then, multiply this by the
number of nodes. If you have 4 nodes, say, you'll have 28 minor fragments total. You want
to have at least 28 JSON files so you can keep each fragment busy.

If your code generates the JSON, then you can change the code to split the data into smaller
files. If you obtain the JSON from somewhere else, then your options may be more limited.

Will any of this help resolve your issue?


Thanks,
- Paul

 
[1] http://jsonlines.org/

    On Wednesday, April 15, 2020, 12:32:35 PM PDT, Shashank Sharma <shashank.sharma@jungleworks.com>
wrote:  
 
 Hi folks,

I have a two large big json data set and querying on distributed apache
drill system, can anyone explain why it isĀ  making or build billion of
records to scan in fragment when join between two big records by hash join
as well as merge join with only 60,000 record data set through s3 bucket
file distributed system?

-- 

[image: https://jungleworks.com/] <https://jungleworks.com/>

Shashank Sharma

Software Engineer

Phone: +91 8968101068

<https://www.facebook.com/jungleworks1> <https://twitter.com/jungleworks1>
<https://www.linkedin.com/company/jungleworks/>
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message