spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhaval Patel <dhaval1...@gmail.com>
Subject Re: DataFrame/JDBC very slow performance
Date Wed, 26 Aug 2015 15:14:24 GMT
Thanks Michael, much appreciated!

Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem.  There is
likely an error buried somewhere.

For your above comments - I don't get any error but just get the NULL as
return value. I have tried digging deeper in the logs etc but couldn't spot
anything. Is there any other suggestions to spot such buried errors?

Thanks,
Dhaval

On Mon, Aug 24, 2015 at 6:38 PM, Michael Armbrust <michael@databricks.com>
wrote:

> Much appreciated! I am not comparing with "select count(*)" for
>> performance, but it was one simple thing I tried to check the performance
>> :). I think it now makes sense since Spark tries to extract all records
>> before doing the count. I thought having an aggregated function query
>> submitted over JDBC/Teradata would let Teradata do the heavy lifting.
>>
>
> We currently only push down filters since there is a lot of variability in
> what types of aggregations various databases support.  You can manually
> pushdown whatever you want by replacing the table name with a subquery
> (i.e. "(SELECT ... FROM ...)")
>
>        - How come my second query for (5B) records didn't return anything
>> even after a long processing? If I understood correctly, Spark would try to
>> fit it in memory and if not then might use disk space, which I have
>> available?
>>
>
> Nothing should be held in memory for a query like this (other than a
> single count per partition), so I don't think that is the problem.  There
> is likely an error buried somewhere.
>
>
>>          - Am I supposed to do any Spark related tuning to make it work?
>>
>> My main need is to access data from these large table(s) on demand and
>> provide aggregated and calculated results much quicker, for that  I was
>> trying out Spark. Next step I am thinking to export data in Parque files
>> and give it a try. Do you have any suggestions for to deal with the problem?
>>
>
> Exporting to parquet will likely be a faster option that trying to query
> through JDBC, since we have many more opportunities for parallelism here.
>

Mime
View raw message