drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: Nested or Array JSON
Date Thu, 02 Apr 2015 15:53:39 GMT
To answer Andries' question, with an enhancement in the 0.8 release, there
should be no hard limit on the size of Drill records supported. That being
said, Drill is not fundamentally set up for processing enormous rows, so we
do not have a clear idea of the performance impact of working with such
datasets.

This document is going to be read as a single record originally, and I
think the 0.8 release should be able to read it in. From there, flatten
should be able to produce individual records suitable for further analysis,
these records will be be a more reasonable size and get you good
performance for further analysis.

-Jason

On Thu, Apr 2, 2015 at 8:49 AM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> Hi Muthu,
>
> Welcome to the Drill community!
>
> Unfortunately the mailing list does not allow attachments, please send
> along the error log copied into a mail message.
>
> If you are working with the 0.7 version of Drill, I would recommend
> upgrading the the new 0.8 release that just came out, there were a lot of
> bug fixes and enhancements in the release.
>
> We're glad to hear you have been successful with your previous efforts
> with Drill. Unfortunately Drill is not well suited fro exploring datasets
> like the one you have linked to. By default Drill supports records of the
> format accepted by Mongo DB for bulk import, where individual records take
> the form of a JSON object.
>
> Looking at this dataset, it follows a pattern we have seen before, but
> currently are not well suited for working with in Drill. All of the data is
> in a single JSON object, at the top of the object are a number of
> dataset-wide metadata fields. These are all nested under a field "view",
> with the main data I am guessing you want to analyze nested under the field
> "data" in an array. While this format is not ideal for Drill, with the size
> of the dataset you might be able to get it working with an operator in
> Drill that could help make the data more accessible.
>
> The operator is called flatten, and is designed to take an array and
> produce individual records for each element in the array. Optionally other
> fields from the record can be included alongside each of the newly spawned
> records to maintain a relationship between the incoming fields in the
> output of flatten.
>
> For more info on flatten, see this page in the wiki:
> https://cwiki.apache.org/confluence/display/DRILL/FLATTEN+Function
>
> For this dataset, you might be able to get access to the data simply by
> running the following:
>
> select flatten(data) from dfs.`/path/to/file.json`;
>
> If you need to have access to some of the other fields from the top of the
> dataset, you can include them alongside flatten and they will be copied
> into each record produced by the flatten operation:
>
> select flatten(data), view.id, view.category from
> dfs.`/path/to/file.json`;
>
>
>
> On Wed, Apr 1, 2015 at 10:52 PM, Muthu Pandi <muthu1086@gmail.com> wrote:
>
>> Hi All
>>
>>
>>           Am new to the JSON format and exploring the same. I had used
>> Drill to analyse simple JSON files which work like a charm, but am not able
>> to load the this "
>> https://opendata.socrata.com/api/views/n2rk-fwkj/rows.json?accessType=DOWNLOAD"
>>  JSON file for analysis.
>>
>> Am using ODBC connector to connect to the 0.8 Drill. Kindly find the
>> attachment for the error.
>>
>>
>>
>> *RegardsMuthupandi.K*
>>
>>  Think before you print.
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message