drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lee, David" <David....@blackrock.com>
Subject RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record
Date Tue, 28 Aug 2018 19:11:20 GMT
select count(*) on a jsonl file comes back instantly

/u1/my_login=> wc -l test.jsonl
7226 test.jsonl

select count(*) from dfs.`/u1/my_login/test.jsonl`  

EXPR$0
7227

Overview
Operator ID	Type	Avg Setup Time	Max Setup Time	Avg Process Time	Max Process Time	Min Wait
Time	Avg Wait Time	Max Wait Time	% Fragment Time	% Query Time	Rows	Avg Peak Memory	Max Peak
Memory
00-xx-00	JSON_SUB_SCAN	0.000s	0.000s	1.096s	3.287s	0.000s	0.181s	0.543s	99.58%	99.58%	7,228
24KB	32KB
00-xx-01	PROJECT	0.001s	0.001s	0.000s	0.000s	0.000s	0.000s	0.000s	0.00%	0.00%	1	32KB	32KB
00-xx-02	STREAMING_AGGREGATE	0.022s	0.022s	0.001s	0.001s	0.000s	0.000s	0.000s	0.04%	0.04%
1	64KB	64KB
00-xx-03	STREAMING_AGGREGATE	0.040s	0.040s	0.011s	0.011s	0.000s	0.000s	0.000s	0.34%	0.34%
7,227	48KB	48KB
00-xx-04	PROJECT	0.032s	0.032s	0.001s	0.001s	0.000s	0.000s	0.000s	0.04%	0.04%	7,227	16KB	16KB


-----Original Message-----
From: Paul Rogers [mailto:par0328@yahoo.com.INVALID] 
Sent: Tuesday, August 28, 2018 11:23 AM
To: user@drill.apache.org
Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle
of a record

[EXTERNAL EMAIL]


Hi Scott,

Bingo. Just tried this very case with the sample file from the previous post. Got exactly
the failure in the post you provided. I notice that a "select *" query returns immediately,
but a "count(*)" query hangs for the 30+ seconds before it errors out. Mine is only a two-record
file, so taking 30 seconds to fail is excessive.

Clearly, something is wrong. At the very least, a count(*) should simply read all records
and discard the data, using exactly the same JSON parser as for a "SELECT *" query. That Drill
is not doing so suggests that perhaps the code is trying to be clever to optimize for the
"count(*)" case, and is doing so incorrectly.

Here is a clunky workaround: just add a WHERE clause that accepts all records:

SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
+---------+| EXPR$0  |+---------+| 2       |+---------+

As it turns out, I'm in the (very slow) process of issuing PRs for a revised JSON record reader
to handle other issues. A side effect of that change is that the new implementation does use
the same parse path for both the "SELECT *" an "SELECT count(*)" paths. So, even if someone
cannot fix this bug short term, there is a longer-term fix coming.

Thanks,
- Paul



    On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <tcots8888@gmail.com> wrote:

 Paul,
Thanks for prompting the right questions. I went back and took another look at my queries.
It turns out that there is some condition that causes this error when running functions like
"count(*)" on the data to cause this error, where a normal unqualified select does not. I
also ran across this article from MapR that led me to conclude Drill just doesn't support
it.

https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=

I think if we can confirm exactly which conditions cause the problem, we should open a high
priority Jira. What do you think?


On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers <par0328@yahoo.com.invalid>
wrote:

> Hi Scott,
>
> I created a file, "test.json", using the data from your e-mail:
>
> [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
>
> The oldest build I have readily available is Drill 1.13. I ran that as 
> a server, then connected with sqlline as a client. I ran a query:
>
> select * from `test.json`;
> +-------+-------+| var1  | var2  |+-------+-------+| foo  | bar  || fo
>  | baz  |+-------+-------+
>
> I can try with Drill 1.12, once I find and download it. Or, you can 
> try with Drill 1.14 (the latest release.)
>
> I do wonder, however, if we are talking about the same thing. My test 
> puts your JSON in a JSON file with ".json" extension so that Drill 
> choses the JSON parser. I'm using default JSON (session) options.
>
> Is this what you are doing? Or, is your JSON coming from some other 
> source? Kafka? A field from a CSV file, say?
>
> Thanks,
> - Paul
>
>
>
>    On Monday, August 27, 2018, 10:31:00 PM PDT, scott <  
>tcots8888@gmail.com> wrote:
>
>  Paul,
> I'm using version 1.12. Can you tell me what version you think that 
>was  fixed in? The ticket I referenced is still open, with no comments.
>
> Scott
>
> On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers 
> <par0328@yahoo.com.invalid>
> wrote:
>
> > Hi David,
> >
> > JSON files are never splittable: there is no single-character way to 
> > find the start of a JSON record within a file.
> >
> > Drill is supposed to support two JSON formats: the array format from 
> > the earlier post, and the non-JSON (but very common) list of objects 
> > format
> in
> > this example.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <  
> >David.Lee@blackrock.com> wrote:
> >
> >  Get rid of the opening and closing brackets and see if you can turn 
> >the  commas into newlines.. The file needs to be splittable I think 
> >to reduce  memory overhead vs parsing a giant string...
> >
> > {"var1": "foo", "var2":"bar"}
> > {"var1": "fo", "var2": "baz"}
> > {"var1": "f2o", "var2": "baz2"}
> > {"var1": "f3o", "var2": "baz3"}
> > {"var1": "f4o", "var2": "baz4"}
> > {"var1": "f5o", "var2": "baz5"}
> >
> > -----Original Message-----
> > From: scott [mailto:tcots8888@gmail.com]
> > Sent: Monday, August 27, 2018 4:59 PM
> > To: user@drill.apache.org
> > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read 
> > from
> the
> > middle of a record
> >
> > [EXTERNAL EMAIL]
> >
> >
> > Hi All,
> > I'm getting an error querying some of my json files.
> > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON 
> > - Cannot read from the middle of a record. Current token was 
> > START_ARRAY
> >
> > The json files are in array format, like [ { "var1": "foo", "var2":
> > "bar"},{"var1": "fo", "var2": "baz"}]
> >
> > I found a ticket that indicates this format is not supported by 
> > Drill
> yet,
> > DRILL-1755 <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
> ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
> ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
> oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=
> >
> > , but I find it hard to believe there is no workaround or solution 
> > since this was reported
> > 4 years back. Does anyone have a solution or workaround to this problem?
> >
> > Thanks,
> > Scott
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> immediately
> > and delete this message. See
> > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
> > s
> for
> > further information.  Please refer to 
> > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy 
> > for more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2018 BlackRock, Inc. All rights reserved.
> >
Mime
View raw message