spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: DataFrame/JDBC very slow performance
Date Mon, 24 Aug 2015 22:38:39 GMT
> Much appreciated! I am not comparing with "select count(*)" for
> performance, but it was one simple thing I tried to check the performance
> :). I think it now makes sense since Spark tries to extract all records
> before doing the count. I thought having an aggregated function query
> submitted over JDBC/Teradata would let Teradata do the heavy lifting.

We currently only push down filters since there is a lot of variability in
what types of aggregations various databases support.  You can manually
pushdown whatever you want by replacing the table name with a subquery
(i.e. "(SELECT ... FROM ...)")

       - How come my second query for (5B) records didn't return anything
> even after a long processing? If I understood correctly, Spark would try to
> fit it in memory and if not then might use disk space, which I have
> available?

Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem.  There is
likely an error buried somewhere.

>          - Am I supposed to do any Spark related tuning to make it work?
> My main need is to access data from these large table(s) on demand and
> provide aggregated and calculated results much quicker, for that  I was
> trying out Spark. Next step I am thinking to export data in Parque files
> and give it a try. Do you have any suggestions for to deal with the problem?

Exporting to parquet will likely be a faster option that trying to query
through JDBC, since we have many more opportunities for parallelism here.

View raw message