drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lee, David" <David....@blackrock.com>
Subject RE: Drill Query Engine for nested json or parquet
Date Tue, 05 Sep 2017 23:19:08 GMT
Ok I ran into the same problem with another set of files..

Opened a ticket and attached test files

https://issues.apache.org/jira/browse/DRILL-5769


-----Original Message-----
From: Lee, David 
Sent: Wednesday, July 05, 2017 5:49 PM
To: user <user@drill.apache.org>
Subject: RE: Drill Query Engine for nested json or parquet

Unfortunately the files A) contain licensed data B) are 403,208,254 bytes of gzipped JSON
per file which contain roughly 6,000 JSON records with a mix of ~4,000 nested arrays and objects
(up to 7 levels deep).

Running Drill SQL on a sample of individual JSON keys works so it's probably some particular
JSON key value out of the ~4,000 which is causing the IOBE. The files which are failing are
from Africa, Asia and Australia so I have a feeling it's some regional utf8 / Unicode text
values which is the problem.

-----Original Message-----
From: Jinfeng Ni [mailto:jni@apache.org]
Sent: Monday, July 03, 2017 3:39 PM
To: user <user@drill.apache.org>
Subject: Re: Drill Query Engine for nested json or parquet

This IOBE is clearly indicates a code bug. From the error message, it seems to be similar
to DRILL-1743, which was marked as fixed. Could you please open a JIRA, and post the query
/ sample data you used (if it does not contain sensitive info)? Having a reproduce for this
would make it easier for people to take a look. Thanks.


1. https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D1743&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=IZ97vmUqJvkHR_Xi9GS-CmRIrqr1MSe6379wL6zbZbM&s=A-_qhmjmgndvzahLAAdfUZDxUHNEft27SgzsC5OPC_E&e=


On Mon, Jul 3, 2017 at 10:07 AM, Lee, David <David.Lee@blackrock.com> wrote:

> Ok thanks, but now I got a new odd problem which I haven't been able 
> to debug..
>
> SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4
> (expected: range(0, 16384))
>
> After unwrapping the JSON records from a nested dictionary, 90% of my 
> JSON files convert successfully to parquet, but 10% are getting the error above..
>
> I've tried to narrow down which rows in my JSON file are problematic, 
> but it is both consistent and random.. Removing 1000s records here and 
> there fixes the file, but it must be a combination of multiple records 
> causing the error. This error never showed up when I converted the 
> wrapped JSON to parquet..
>
> Any idea how to get to the root of this one?
>
> -----Original Message-----
> From: Jinfeng Ni [mailto:jni@apache.org]
> Sent: Wednesday, June 28, 2017 11:42 AM
> To: user <user@drill.apache.org>
> Subject: Re: Drill Query Engine for nested json or parquet
>
> You are right that Q2 has to first apply flatten the field "details"
> before applying filter on top of the flattened rows.  I do not think 
> Drill currently supports to push filter down.
>
> The difference between Q1 and Q2:  Q1 is querying a 4-row table, while
> Q2 is querying a 1-row table; the flatten operator produces 4 rows on 
> top of which the filter is applied.
>
>
> On Wed, Jun 28, 2017 at 9:51 AM, Lee, David <David.Lee@blackrock.com>
> wrote:
>
> >
> > Is there a fundamental difference between the following queries? I 
> > can't get the second example working with parquet files which 
> > contain 400,000+ nested records..
> >
> > It seems like the system wants to flatten every possible record 
> > before applying the SQL Where clause to the flattened data structure..
> >
> > Example 1:
> >
> > select b.* from dfs.`test1.json` b where b.item  = 3
> >
> > [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> > ]
> >
> > Example 2:
> >
> > select b.* from
> > (select flatten(a.details) as details from dfs.`test2.json` a) b 
> > where b.details.item  = 3
> >
> > {
> >   "header": "my_header_info",
> >   "details": [
> >     {
> >       "item": 1,
> >       "item_name": "name_for_1"
> >     },
> >     {
> >       "item": 2,
> >       "item_name": "name_for_2"
> >     },
> >     {
> >       "item": 3,
> >       "item_name": "name_for_3"
> >     },
> >     {
> >       "item": 4,
> >       "item_name": "name_for_4"
> >     }
> >   ]
> > }
> >
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender 
> > immediately and delete this message. See http://www.blackrock.com/ 
> > corporate/en-us/compliance/email-disclaimers for further information.
> > Please refer to http://www.blackrock.com/corporate/en-us/compliance/
> > privacy-policy for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2017 BlackRock, Inc. All rights reserved.
> >
>
Mime
View raw message