drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Hou <r...@mapr.com>
Subject Re: Json to Parquet
Date Fri, 08 Mar 2019 23:46:19 GMT
If you know the schema ahead of time, you can try creating a view.  Using
your xyz.json example:

create view newxyz as select cast(address as varchar) address, cast(zipcode
as int) zipcode from xyz.json;

select * from newxyz;
+---------------------+----------+
|       address       | zipcode  |
+---------------------+----------+
| 10 Downing Street   | null     |
+---------------------+----------+


Then create your parquet file using your view.  Or you can try creating a
parquet table directly:

          create table `newxyz.parquet` as select cast(address as varchar)
address, cast(zipcode as int) zipcode from dfs.`/xyz.json`;

Thanks.

--Robert

On Fri, Mar 8, 2019 at 10:09 AM Lee, David <David.Lee@blackrock.com> wrote:

> Nope which is why I use Python with pyarrow to convert JSON to Parquet
> these days. Hopefully arrow / parquet-cpp supports parquet dictionaries
> within a couple months.
>
> https://issues.apache.org/jira/browse/ARROW-1644
>
> All these types of JSON structures are problematic for any Json Schema
> Learning engine like Drill.
>
> File ABC.json is fine, but..
> [
> {"address": "1600 Pennsylvania Avenue", "zip_code": "20500" },
> ]
>
> File XYZ.json will bomb
> [
> {"address": "10 Downing Street ", "zip_code": null},
> ]
>
> No way to figure out what datatype zip_code is in the second file. I think
> Drill by default will save this as a BOOLEAN type and now you have zip code
> column with string and boolean values which creates chaos and will result
> in an exception..
>
> The only clean way to solve these problems is to stop using schema
> learning and inject a schema https://json-schema.org/ into the query
> somehow.
>
> I just gave up trying to use Drill to work with JSON and now use Python to
> read json and generate parquet datasets which I can then use in Drill, etc..
>
>
> -----Original Message-----
> From: Dweep Sharma <dweep.sharma@redbus.com>
> Sent: Friday, March 8, 2019 1:56 AM
> To: user@drill.apache.org
> Subject: reg: Json to Parquet
>
> External Email: Use caution with links and attachments
>
>
> Hi,
>
> I have a CTAS query that converts JSON to Parquet format and encounter
> this error sometimes
>
>  (org.apache.parquet.schema.InvalidSchemaException) Cannot write a schema
> with an empty group: optional group address
>
> I guess this happens when drill encounters a field like "address" : {}
> (empty object)
>
> Is there a way to handle this ?
>
> Thanks,
> -Dweep
>
> --
> *::DISCLAIMER::
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> The contents of this e-mail and any attachments are confidential and
> intended for the named recipient(s) only.E-mail transmission is not
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted,lost, destroyed, arrive late or incomplete, or may contain
> viruses in transmission. The e mail and its contents(with or without
> referred errors) shall therefore not attach any liability on the originator
> or redBus.com. Views or opinions, if any, presented in this email are
> solely those of the author and may not necessarily reflect the views or
> opinions of redBus.com. Any form of reproduction, dissemination, copying,
> disclosure, modification,distribution and / or publication of this message
> without the prior written consent of authorized representative of redbus.
> <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAvZg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=Uvy2K8V8SJd_wUf26oFaOeXqIDADwHQ76HkPbQGdutw&s=Dzn4ub-codA6gMk65crCiYDZRb5MF91NA5XXlC473EI&e=>com
> is strictly prohibited. If you have received this email in error please
> delete it and notify the sender immediately.Before opening any email and/or
> attachments, please check them for viruses and other defects.*
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2019 BlackRock, Inc. All rights reserved.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message