drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reid Thompson <Reid.Thomp...@omnicell.com>
Subject Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"
Date Tue, 14 Aug 2018 13:17:01 GMT
using the below parameters in the URL and looking in the defined logfile
indicates that the fetch size is being set to 10000, as expected. 

just to note that it appears that the param defaultRowFetchSize sets the
fetch size and signifies that a cursor should be used.  It is different
from the originally noted defaultFetchSize param, and it appears that
postgresql doesn't require the useCursorFetch=true or the defaultAutoCommit=false.

...snip..
  "url": "jdbc:postgresql://myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000",
...snip..



On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> attempting with the below still fails.
> looking at pg_stat_activity it doesn't appear that a cursor is being
> created.  It's still attempting to pull all the data at once.
> 
> thanks,
> reid
> On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > Vitalii,
> > 
> > Ok, thanks, I had found that report, but didn't note the option related
> > to defaultAutoCommit. 
> > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > 
> > 
> > so, something along the lines of
> > 
> > ..snip..
> >   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > ..snip..
> > 
> > 
> > thanks,
> > reid
> > 
> > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > [EXTERNAL SOURCE]
> > >  
> > > Hi Reid,
> > > 
> > > Look like your issue is similar to DRILL-4177 [1].
> > > It was related to MySQL connection. Looks like the similar issue is with PostgreSQL.
> > > Looking at the Postgres documentation, the code needs to explicitly set the
connection autocommit mode 
> > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million plus,
this is a must.
> > > 
> > > You could disable "Auto Commit" option as session option [3] 
> > > or to do it within plugin config URL with the following property: defaultAutoCommit=false
[4]
> > > 
> > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > [2] https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > [3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > 
> > > Kind regards
> > > Vitalii
> > > 
> > > 
> > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Reid.Thompson@omnicell.com>
wrote:
> > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > drill-embedded (single host standalone), I am attempting to pull data
> > > > from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> > > > fine, but larger data sets fail (for example ~11GB) with
> > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> > > > advise on how to get past this?
> > > > 
> > > > Is there a way to have drill stream this data from PostgreSQL to parquet
> > > > files on disk, or does the data set have to be completely loaded into
> > > > memory before it can be written to disk?  The documentation indicates
> > > > that drill will spill to disk to avoid memory issues, so I had hoped
> > > > that it would be straightforward to extract from the DB to disk.
> > > > 
> > > > Should I not be attempting this via CTAS?  What are the other options?
> > > > 
> > > > 
> > > > thanks,
> > > > reid
> > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 
> 
> 



Mime
View raw message