drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lee, David" <David....@blackrock.com>
Subject RE: Performace issue
Date Thu, 14 Feb 2019 02:09:19 GMT
These days..

I use Python to read a jsonl file line by line into a python dictionary.
Convert the data (flatten, etc..) into a tabular record set.
Write the tabular data into parquet.
Read parquet using tools like Drill, etc.

JSON support is lacking in Drill and other tools because these SQL query tools are designed
around structured data. JSON is not structured with a lot of gotchas.

You need a lot more flexibility to handle exceptions.

If you have in one json file something like:

[
{"name": "bob", "born": "1945-01-23", "died": "2019-04-03"},
{"name": "billy', "born": "1945-01-23", "died": "2019-04-03"},
{"name": "brian", "born": "1945-01-23", "died": "2019-04-03"},
]

This looks pretty clean and structured, but what if in a second file you have:

[
{"name": "moe", "born": "1945-01-23", "died": null},
{"name": "curly', "born": "1945-01-23", "died": null },
{"name": "larry", "born": "1945-01-23", "died": null },
]

What happens if you have a file with a lot of missing or extra data. Orphans..

[
{"name": "oliver", "born": null, "died": null, "adopted": null },
{"name": "annie', "born": null, "died": null. "adopted": {"by": "Daddy Warbucks", "when":
"1934-01-04"}},
]

Changing the memory structure on the fly with copying, transforms, etc.. is very expensive
especially if this type of operation is repeated.
Better to define up front what is possible.. We really need to include something like https://json-schema.org/
into the mix.

Streaming JSON data and dealing with records one at a time is a lot faster and memory efficient..

-----Original Message-----
From: PRAVEEN DEVERACHETTY <praveendk@gmail.com> 
Sent: Wednesday, February 13, 2019 5:41 PM
To: user@drill.apache.org
Subject: Re: Performace issue

External Email: Use caution with links and attachments


HI Sorabh,

Here is the sample query passed using REST API. This query is sent as a body and submit job(REST).
 convert_from function  convert json string to json string object. Then flatten on the resultset
defined in you query.
Please let me know if anythng else is required. Our main gao

SELECT
  'EMPSLS' as recordType,
  ems.record.revenueCenterName as rvcName,
  ems.record.revenueCenterNum as rvcNum,
  ems.record.revenueCenterMasterName as rvcMasterName FROM (

select FLATTEN(t.jdata) as record from
(
(select
convert_from('[{"revenueCenterName":"Restaurant","revenueCenterNum":1,"revenueCenterMasterName":null,"revenueCenterMasterNum":null},
{"revenueCenterName":"Restaurant","revenueCenterNum":2,"revenueCenterMasterName":null,"revenueCenterMasterNum":null}]',
'json') as jdata)
) as t) ems

On Wed, Feb 13, 2019 at 11:18 PM Sorabh Hamirwasia <shamirwasia@mapr.com>
wrote:

> Hi Praveen,
> I am probably missing something here because I don't understand how 
> are you feeding data to Drill in memory using the rest api. As you 
> mentioned data has to be stored on disk or some db for Drill to fetch 
> it. Can you please share the query profile for your query ?
>
> P.S. Attachments are not allowed through this mailing list. Would be 
> great if you can upload the profile somewhere (dropbox, gDrive, etc) 
> and share the link with us.
>
> Thanks,
> Sorabh
>
> On Wed, Feb 13, 2019 at 7:58 AM PRAVEEN DEVERACHETTY 
> <praveendk@gmail.com>
> wrote:
>
> > As per my understanding with Apache drill, it is based on the file 
> > store only. Please help me if i can create any plugins for the 
> > following use
> case
> > 1. Create a json object and push to Apache drill in memory(cache). I 
> > can create json object in java, and if any api available from drill 
> > to push this file in the memory?
> > 2. Read the file from the the memory(cahce) and execute queries by 
> > using that json object from the memory instead of the from the file 
> > system or
> any
> > data store.
> >
> > is it possible to do with apache drill? It would be really helpful 
> > to understand the changes i need to make.
> >
> > thanks,
> > Praveen
> >
> > On Wed, Feb 13, 2019 at 11:46 AM PRAVEEN DEVERACHETTY <
> praveendk@gmail.com
> > >
> > wrote:
> >
> > > Hi Sorabh, Data is in json string format, sent over rest api. 
> > > Using convert_from function to convert json string to json array 
> > > and flatten
> > the
> > > result array into multiple rows. Data is not stored in the disk. 
> > > All
> data
> > > is in the memory.
> > >
> > > Thanks,
> > > Praveen
> > >
> > > On Tue, Feb 12, 2019 at 11:49 PM Sorabh Hamirwasia < 
> > > sohami.apache@gmail.com> wrote:
> > >
> > >> Hi Praveen,
> > >> Can you also share what is the schema of your entire dataset and 
> > >> in
> what
> > >> format it's stored?
> > >>
> > >> Thanks,
> > >> Sorabh
> > >>
> > >> On Tue, Feb 12, 2019 at 10:02 AM Kunal Khatua <kunal@apache.org>
> wrote:
> > >>
> > >> > You'll need to edit the memory settings in
> > DRILL_HOME/conf/drill-env.sh
> > >> > I suspect that your 5MB JSON data might be having a lot of 
> > >> > objects,
> > >> which
> > >> > need to be serialized in memory.
> > >> >
> > >> > FLATTEN has the problem that it replicates the data parent data 
> > >> > for
> > each
> > >> > child node that is being flattened into a row... so the 
> > >> > resulting
> data
> > >> > being constructed in memory can grow significantly.
> > >> > One way to work around (not elegant, but worth trying) would be 
> > >> > to generate intermediate flatten data and write temporary (if 
> > >> > not using
> > >> WebUI)
> > >> > tables and keep flattening out those records until you have a 
> > >> > fully flattened dataset to work with directly.
> > >> >
> > >> > On 2/11/2019 10:37:58 PM, PRAVEEN DEVERACHETTY 
> > >> > <praveendk@gmail.com
> >
> > >> > wrote:
> > >> > Thnks a lot Kunal. I am looking into that. I have one observation.
> > >> >
> > >> > With out flatten also, i tried to run a query of size 5MB, it 
> > >> > is
> > taking
> > >> 5GB
> > >> > of heap? how do i control heap? Are there any settings i can
> modify. i
> > >> am
> > >> > reading a lot, but nothing is working for me. It would be 
> > >> > helpful
> how
> > to
> > >> > control heap, i modified memory parameters based on the
> documentation,
> > >> it
> > >> > is not working yet. it would be really helpful if i get some 
> > >> > help in
> > >> this
> > >> > regard. Thanks in advance.
> > >> >
> > >> > Regards
> > >> > Praveen
> > >> >
> > >> > On Tue, Feb 12, 2019 at 11:18 AM Kunal Khatua wrote:
> > >> >
> > >> > > This is a good starting point for understanding 
> > >> > > LATERAL-UNNEST and
> > >> how it
> > >> > > compares to the FLATTEN operator.
> > >> > >
> > >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.ap
> > >> > > ache.org_docs_lateral-2Djoin_&d=DwIBaQ&c=zUO0BtkCe66yJvAZ4cAv
> > >> > > Zg&r=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI&m=ix6TZujMCq
> > >> > > 8zAZ4b53XyfjezIBIi7oaZS0TBw2fKnNo&s=rBrqW7NNNUyOuBqGO6OA3sReW
> > >> > > jo5s1rw_8cEuqTxaaI&e=
> > >> > >
> > >> > >
> > >> > > On 2/11/2019 9:03:42 PM, PRAVEEN DEVERACHETTY wrote:
> > >> > > Thanks Kunal.
> > >> > > i am not getting how to use lateral-unrest as dataset does 
> > >> > > not
> have
> > >> child
> > >> > > rows. All data is in array of json objects(as mentioned below).
> > There
> > >> are
> > >> > > two json objects separated by comma and enclosed in squre bracket.
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> [{"Location":"100","FirstName":"test1"},{"Location":"100","FirstName":
> "test2"},{"Location":"101","FirstName":"test3"}]
> > >> > >
> > >> > > We are using drill from Java. Through a rest invocation. Not

> > >> > > using
> > >> json
> > >> > > files. All data is sent over post as string. We are using
> > convert_from
> > >> > > function in the query to convert into json objects. As we are
> > sending
> > >> > array
> > >> > > of json objects, using FLATTEN operator to convert into 
> > >> > > multiple
> > >> rows. is
> > >> > > there any way to avoid Flatten, as we see huge spike for 54MB
> data,
> > >> going
> > >> > > to 24GB and still failing with heap error. not sure what is wrong.
> > >> Can i
> > >> > > use FLATTEN on the entire data set? There are almost 54K 
> > >> > > records
> > that
> > >> is
> > >> > > getting FLATTENED.
> > >> > >
> > >> > > example query: 1)first converted into array of json objects 
> > >> > > 2)
> > >> flatten to
> > >> > > convert into multiple rows
> > >> > > select ems.* from (select flatten(t.jdata) as record from 
> > >> > > (select
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> convert_from('[{"Location":"100","FirstName":"test1"},{"Location":"100
> ","FirstName":"test2"},{"Location":"101","FirstName":"test3"}..]')
> > >> > > as jdata) as t) ems
> > >> > >
> > >> > >
> > >> > > On Sat, Feb 9, 2019 at 1:37 AM Kunal Khatua wrote:
> > >> > >
> > >> > > > The memory (heap) would climb as it tries to flatten the

> > >> > > > JSON
> > data.
> > >> > Have
> > >> > > > you tried looking at Drill's LateralJoin-Unnest feature?
It 
> > >> > > > was
> > >> meant
> > >> > to
> > >> > > > address memory issues for some use cases of the FLATTEN
> operator.
> > >> > > >
> > >> > > > On 2/8/2019 5:17:01 AM, PRAVEEN DEVERACHETTY wrote:
> > >> > > > I am running a query with UNION ALL. as below
> > >> > > >
> > >> > > > select
> > >> > > > from ( select FLATTEN(t.jdata) as record from ((select 
> > >> > > > convert_from(json string, json) union all (select 
> > >> > > > conver_from(json_string,json) union all ...
> > >> > > > ) as jdata) ) as t) ems
> > >> > > >
> > >> > > > Reason for giving union all is because we are invoking a

> > >> > > > call
> > using
> > >> > rest
> > >> > > > app, there is limitation of 20,000 when we use convert_from
> > >> function.
> > >> > Our
> > >> > > > heap size is 8GB, server is 8core. From profiling, it shows

> > >> > > > this
> > >> > > perticula
> > >> > > > query spikes from 100MB to 8GB continuously. is there 
> > >> > > > anything i
> > am
> > >> > > > doing wrong?.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Prveen
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>


This message may contain information that is confidential or privileged. If you are not the
intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers
for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy
for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2019 BlackRock, Inc. All rights reserved.
Mime
View raw message