drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From scott <tcots8...@gmail.com>
Subject Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read from the middle of a record
Date Tue, 28 Aug 2018 20:00:40 GMT
Paul, Thank you for the workaround, that worked in my case perfectly !!

Scott

On Tue, Aug 28, 2018 at 12:20 PM Lee, David <David.Lee@blackrock.com> wrote:

> This is a pretty ugly json file.. 568 megs for 7227 records..
>
> => ls -l test.jsonl
> -rw-r--r-- 1 my_login users 568693075 Aug 28 15:15 test.jsonl
>
> There is one difference 7226 vs 7227, but that is from wc..
>
> wc -l is NOT counting last of the file if it does not have end of line
> character
>
> -----Original Message-----
> From: Lee, David
> Sent: Tuesday, August 28, 2018 12:11 PM
> To: user@drill.apache.org
> Subject: RE: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
> from the middle of a record
>
> select count(*) on a jsonl file comes back instantly
>
> /u1/my_login=> wc -l test.jsonl
> 7226 test.jsonl
>
> select count(*) from dfs.`/u1/my_login/test.jsonl`
>
> EXPR$0
> 7227
>
> Overview
> Operator ID     Type    Avg Setup Time  Max Setup Time  Avg Process Time
>       Max Process Time        Min Wait Time   Avg Wait Time   Max Wait
> Time   % Fragment Time % Query Time    Rows    Avg Peak Memory Max Peak
> Memory
> 00-xx-00        JSON_SUB_SCAN   0.000s  0.000s  1.096s  3.287s  0.000s
> 0.181s  0.543s  99.58%  99.58%  7,228   24KB    32KB
> 00-xx-01        PROJECT 0.001s  0.001s  0.000s  0.000s  0.000s  0.000s
> 0.000s  0.00%   0.00%   1       32KB    32KB
> 00-xx-02        STREAMING_AGGREGATE     0.022s  0.022s  0.001s  0.001s
> 0.000s  0.000s  0.000s  0.04%   0.04%   1       64KB    64KB
> 00-xx-03        STREAMING_AGGREGATE     0.040s  0.040s  0.011s  0.011s
> 0.000s  0.000s  0.000s  0.34%   0.34%   7,227   48KB    48KB
> 00-xx-04        PROJECT 0.032s  0.032s  0.001s  0.001s  0.000s  0.000s
> 0.000s  0.04%   0.04%   7,227   16KB    16KB
>
>
> -----Original Message-----
> From: Paul Rogers [mailto:par0328@yahoo.com.INVALID]
> Sent: Tuesday, August 28, 2018 11:23 AM
> To: user@drill.apache.org
> Subject: Re: RE: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
> from the middle of a record
>
> [EXTERNAL EMAIL]
>
>
> Hi Scott,
>
> Bingo. Just tried this very case with the sample file from the previous
> post. Got exactly the failure in the post you provided. I notice that a
> "select *" query returns immediately, but a "count(*)" query hangs for the
> 30+ seconds before it errors out. Mine is only a two-record file, so taking
> 30 seconds to fail is excessive.
>
> Clearly, something is wrong. At the very least, a count(*) should simply
> read all records and discard the data, using exactly the same JSON parser
> as for a "SELECT *" query. That Drill is not doing so suggests that perhaps
> the code is trying to be clever to optimize for the "count(*)" case, and is
> doing so incorrectly.
>
> Here is a clunky workaround: just add a WHERE clause that accepts all
> records:
>
> SELECT COUNT(*) FROM `test.json` WHERE 1 = 1;
> +---------+| EXPR$0  |+---------+| 2       |+---------+
>
> As it turns out, I'm in the (very slow) process of issuing PRs for a
> revised JSON record reader to handle other issues. A side effect of that
> change is that the new implementation does use the same parse path for both
> the "SELECT *" an "SELECT count(*)" paths. So, even if someone cannot fix
> this bug short term, there is a longer-term fix coming.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, August 28, 2018, 8:46:11 AM PDT, scott <
> tcots8888@gmail.com> wrote:
>
>  Paul,
> Thanks for prompting the right questions. I went back and took another
> look at my queries. It turns out that there is some condition that causes
> this error when running functions like "count(*)" on the data to cause this
> error, where a normal unqualified select does not. I also ran across this
> article from MapR that led me to conclude Drill just doesn't support it.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mapr.com_support_s_article_Apache-2DDrill-2Dcannot-2Dread-2Dfrom-2Dmiddle-2Dof-2Da-2Drecord-3Flanguage-3Den-5FUS&d=DwIFaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=dxottkFod9H47Nc4z5FFPEXrUSmqQBXSqE_dy2vBbo8&s=ah8AI98Fb49IXVN1GkiBk3dMGzCQH8I8CZZc9dJpm_g&e=
>
> I think if we can confirm exactly which conditions cause the problem, we
> should open a high priority Jira. What do you think?
>
>
> On Mon, Aug 27, 2018 at 11:58 PM Paul Rogers <par0328@yahoo.com.invalid>
> wrote:
>
> > Hi Scott,
> >
> > I created a file, "test.json", using the data from your e-mail:
> >
> > [ { "var1": "foo", "var2":"bar"},{"var1": "fo", "var2": "baz"}]
> >
> > The oldest build I have readily available is Drill 1.13. I ran that as
> > a server, then connected with sqlline as a client. I ran a query:
> >
> > select * from `test.json`;
> > +-------+-------+| var1  | var2  |+-------+-------+| foo  | bar  || fo
> >  | baz  |+-------+-------+
> >
> > I can try with Drill 1.12, once I find and download it. Or, you can
> > try with Drill 1.14 (the latest release.)
> >
> > I do wonder, however, if we are talking about the same thing. My test
> > puts your JSON in a JSON file with ".json" extension so that Drill
> > choses the JSON parser. I'm using default JSON (session) options.
> >
> > Is this what you are doing? Or, is your JSON coming from some other
> > source? Kafka? A field from a CSV file, say?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Monday, August 27, 2018, 10:31:00 PM PDT, scott <
> >tcots8888@gmail.com> wrote:
> >
> >  Paul,
> > I'm using version 1.12. Can you tell me what version you think that
> >was  fixed in? The ticket I referenced is still open, with no comments.
> >
> > Scott
> >
> > On Mon, Aug 27, 2018 at 5:47 PM Paul Rogers
> > <par0328@yahoo.com.invalid>
> > wrote:
> >
> > > Hi David,
> > >
> > > JSON files are never splittable: there is no single-character way to
> > > find the start of a JSON record within a file.
> > >
> > > Drill is supposed to support two JSON formats: the array format from
> > > the earlier post, and the non-JSON (but very common) list of objects
> > > format
> > in
> > > this example.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Monday, August 27, 2018, 5:38:32 PM PDT, Lee, David <
> > >David.Lee@blackrock.com> wrote:
> > >
> > >  Get rid of the opening and closing brackets and see if you can turn
> > >the  commas into newlines.. The file needs to be splittable I think
> > >to reduce  memory overhead vs parsing a giant string...
> > >
> > > {"var1": "foo", "var2":"bar"}
> > > {"var1": "fo", "var2": "baz"}
> > > {"var1": "f2o", "var2": "baz2"}
> > > {"var1": "f3o", "var2": "baz3"}
> > > {"var1": "f4o", "var2": "baz4"}
> > > {"var1": "f5o", "var2": "baz5"}
> > >
> > > -----Original Message-----
> > > From: scott [mailto:tcots8888@gmail.com]
> > > Sent: Monday, August 27, 2018 4:59 PM
> > > To: user@drill.apache.org
> > > Subject: Error: DATA_READ ERROR: Error parsing JSON - Cannot read
> > > from
> > the
> > > middle of a record
> > >
> > > [EXTERNAL EMAIL]
> > >
> > >
> > > Hi All,
> > > I'm getting an error querying some of my json files.
> > > The error I'm getting is: Error: DATA_READ ERROR: Error parsing JSON
> > > - Cannot read from the middle of a record. Current token was
> > > START_ARRAY
> > >
> > > The json files are in array format, like [ { "var1": "foo", "var2":
> > > "bar"},{"var1": "fo", "var2": "baz"}]
> > >
> > > I found a ticket that indicates this format is not supported by
> > > Drill
> > yet,
> > > DRILL-1755 <
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_j
> > ira_browse_DRILL-2D1755&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTif
> > ecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=G0Hsj4vSq2tBbv1c1dW6zC3pOzA_kSuhlQ
> > oFvFKpdJo&s=Dh8nYVKoOA8nQ3XdDmauSethwq9x4ric2_MsYMcfDdc&e=
> > >
> > > , but I find it hard to believe there is no workaround or solution
> > > since this was reported
> > > 4 years back. Does anyone have a solution or workaround to this
> problem?
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> > > This message may contain information that is confidential or
> privileged.
> > > If you are not the intended recipient, please advise the sender
> > immediately
> > > and delete this message. See
> > > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimer
> > > s
> > for
> > > further information.  Please refer to
> > > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy
> > > for more information about BlackRock’s Privacy Policy.
> > >
> > > For a list of BlackRock's office addresses worldwide, see
> > > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> > >
> > > © 2018 BlackRock, Inc. All rights reserved.
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message