drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Hou <r...@mapr.com>
Subject Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"
Date Wed, 15 Aug 2018 20:52:52 GMT
I'm wondering if you can export the json from Postgres to a json document.
And then write it to parquet using Drill.  This link may have some ideas:


https://hashrocket.com/blog/posts/create-quick-json-data-dumps-from-postgresql

Thanks.

--Robert

On Wed, Aug 15, 2018 at 10:16 AM, Reid Thompson <Reid.Thompson@omnicell.com>
wrote:

> Thanks for your help.
>
> Yes, it solves the out of memory failure, but then I ran into the json
> issue.
>
> I'm not sure that the ticket represents what I currently need.  I.E.  At
> this point, I'm attempting to do a simple pull of data from PostgreSQL
> and write to parquet, nothing more.   I think that the ticket indicates
> wanting to be able to directly query for content in json(b) columns from
> drill.
>
> thanks,
> Reid
>
> On Wed, 2018-08-15 at 19:27 +0300, Vitalii Diravka wrote:
> > [EXTERNAL SOURCE]
> >
> > Glad to see that it helps and you've solved the issue.
> >
> > I have seen you asked about PostgreSQL JSONB in another topic,
> > but looks like it is not supported by now and should be implemented in
> context of DRILL-5087
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Wed, Aug 15, 2018 at 3:36 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > Vitalii,
> > >
> > > yes. Per https://urldefense.proofpoint.com/v2/url?u=https-3A__jdbc.
> postgresql.org_documentation_head_connect.html&d=DwIGaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> Y1QZGgkI2OWmGL84IKvDRdTMPlWy28ZLLLR8fPLALKk&e=
> > > (page lists numerous settings available)
> > >
> > > defaultRowFetchSize = int
> > >
> > > Determine the number of rows fetched in ResultSet by one fetch with
> trip to the database. Limiting the number of rows are fetch with each trip
> to the database allow avoids unnecessary memory
> > > consumption and as a consequence OutOfMemoryException.
> > >
> > > The default is zero, meaning that in ResultSet will be fetch all rows
> at once. Negative number is not available.
> > >
> > >
> > > on another topic,
> > > is there any way to have drill properly recognize postgresql's json and
> > > jsonb types?  I have tables with both, and am getting this error
> > >
> > >  org.apache.drill.common.exceptions.UserException:
> UNSUPPORTED_OPERATION
> > >  ERROR: A column you queried has a data type that is not currently
> > >  supported by the JDBC storage plugin. The column's name was actionjson
> > >  and its JDBC data type was OTHER.
> > >
> > >
> > > thanks,
> > > reid
> > >
> > > On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote:
> > > > [EXTERNAL SOURCE]
> > > >
> > > > Hi Reid,
> > > >
> > > > Am I right, defaultRowFetchSize=10000 property in URL solves that
> OOM issue?
> > > > If so possibly it can be useful to have this information in Drill
> docs [1].
> > > >
> > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.
> apache.org_docs_rdbms-2Dstorage-2Dplugin_&d=DwIGaQ&
> c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=tvskvH61dBj_z89kZ6NYTxnR-6_
> E6bHXJ4kcGXNfqQI&e=
> > > >
> > > > Kind regards
> > > > Vitalii
> > > >
> > > >
> > > > On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > > > using the below parameters in the URL and looking in the defined
> logfile
> > > > > indicates that the fetch size is being set to 10000, as expected.
> > > > >
> > > > > just to note that it appears that the param defaultRowFetchSize
> sets the
> > > > > fetch size and signifies that a cursor should be used.  It is
> different
> > > > > from the originally noted defaultFetchSize param, and it appears
> that
> > > > > postgresql doesn't require the useCursorFetch=true or the
> defaultAutoCommit=false.
> > > > >
> > > > > ...snip..
> > > > >   "url": "jdbc:postgresql://myhost.mydomain.com/mydb?
> useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/
> tmp/jdbc.log&defaultRowFetchSize=10000",
> > > > > ...snip..
> > > > >
> > > > >
> > > > >
> > > > > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > > > > > attempting with the below still fails.
> > > > > > looking at pg_stat_activity it doesn't appear that a cursor
is
> being
> > > > > > created.  It's still attempting to pull all the data at once.
> > > > > >
> > > > > > thanks,
> > > > > > reid
> > > > > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > > > > > Vitalii,
> > > > > > >
> > > > > > > Ok, thanks, I had found that report, but didn't note the
> option related
> > > > > > > to defaultAutoCommit.
> > > > > > > > [1] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-
> 2D4177&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=02dphqxg7r_
> 7IjLwMMl9Sd-GmrO3EjVN_mD37PgjcTQ&e=
> > > > > > >
> > > > > > >
> > > > > > > so, something along the lines of
> > > > > > >
> > > > > > > ..snip..
> > > > > > >   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?
> useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > > > > > > ..snip..
> > > > > > >
> > > > > > >
> > > > > > > thanks,
> > > > > > > reid
> > > > > > >
> > > > > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > > > > > [EXTERNAL SOURCE]
> > > > > > > >
> > > > > > > > Hi Reid,
> > > > > > > >
> > > > > > > > Look like your issue is similar to DRILL-4177 [1].
> > > > > > > > It was related to MySQL connection. Looks like the
similar
> issue is with PostgreSQL.
> > > > > > > > Looking at the Postgres documentation, the code needs
to
> explicitly set the connection autocommit mode
> > > > > > > > to false e.g. conn.setAutoCommit(false) [2]. For data
size
> of 10 million plus, this is a must.
> > > > > > > >
> > > > > > > > You could disable "Auto Commit" option as session
option [3]
> > > > > > > > or to do it within plugin config URL with the following
> property: defaultAutoCommit=false [4]
> > > > > > > >
> > > > > > > > [1] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-
> 2D4177&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=02dphqxg7r_
> 7IjLwMMl9Sd-GmrO3EjVN_mD37PgjcTQ&e=
> > > > > > > > [2] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__jdbc.postgresql.org_documentation_
> 93_query.html-23fetchsize-2Dexample&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> GXRJhB4g1YFDJsrcglHwUA&m=HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> QVTkxdxQrN6ClYDj1gBm1buRnmH5ra3fQ8rsLCGHO6w&e=
> > > > > > > > [3] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__www.postgresql.org_docs_9.3_
> static_ecpg-2Dsql-2Dset-2Dautocommit.html&d=DwIGaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> qVoZA1PBAjd7DGfRYBUH3Huqh7GN8MYfxZ6Hw7ocAz0&e=
> > > > > > > > [4] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__jdbc.postgresql.org_documentation_
> head_ds-2Dcpds.html&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> GXRJhB4g1YFDJsrcglHwUA&m=HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> rGZnDFuuSDpTWs8LUr8RiwZNsaqQ31AexNwuC3reBTE&e=
> > > > > > > >
> > > > > > > > Kind regards
> > > > > > > > Vitalii
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > > > > > > > My standalone host is configured with 16GB RAM,
8 cpus.
> Using
> > > > > > > > > drill-embedded (single host standalone), I am
attempting
> to pull data
> > > > > > > > > from PostgreSQL tables to parquet files via CTAS.
Smaller
> datasets work
> > > > > > > > > fine, but larger data sets fail (for example
~11GB) with
> > > > > > > > > "java.lang.OutOfMemoryError: GC overhead limit
exceeded"
> Can someone
> > > > > > > > > advise on how to get past this?
> > > > > > > > >
> > > > > > > > > Is there a way to have drill stream this data
from
> PostgreSQL to parquet
> > > > > > > > > files on disk, or does the data set have to be
completely
> loaded into
> > > > > > > > > memory before it can be written to disk?  The
> documentation indicates
> > > > > > > > > that drill will spill to disk to avoid memory
issues, so I
> had hoped
> > > > > > > > > that it would be straightforward to extract from
the DB to
> disk.
> > > > > > > > >
> > > > > > > > > Should I not be attempting this via CTAS?  What
are the
> other options?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > thanks,
> > > > > > > > > reid
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message