spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Efficient filtering on Spark SQL dataframes with ordered keys
Date Mon, 31 Oct 2016 22:50:09 GMT
well I suppose one  can drop tempTable as below

scala> df.registerTempTable("tmp")

scala> spark.sql("select count(1) from tmp").show
+--------+
|count(1)|
+--------+
|  904180|
+--------+

scala> spark.sql("drop table if exists tmp")
res22: org.apache.spark.sql.DataFrame = []

Also your point

"But the thing is that I don't explicitly cache the tempTables ..".

I believe tempTable is created in-memory and is already cached

HTH


























Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 31 October 2016 at 14:16, Michael David Pedersen <
michael.d.pedersen@googlemail.com> wrote:

> Hi Mich,
>
> Thank you again for your reply.
>
> As I see you are caching the table already sorted
>>
>> val keyValRDDSorted = keyValRDD.sortByKey().cache
>>
>> and the next stage is you are creating multiple tempTables (different
>> ranges) that cache a subset of rows already cached in RDD. The data stored
>> in tempTable is in Hive columnar format (I assume that means ORC format)
>>
>
> But the thing is that I don't explicitly cache the tempTables, and I don't
> really want to because I'll only run a single query on each tempTable. So I
> expect the SQL query processor to operate directly on the underlying
> key-value RDD, and my concern is that this may be inefficient.
>
>
>> Well that is all you can do.
>>
>
> Ok, thanks - that's really what I wanted to get confirmation of.
>
>
>> Bear in mind that these tempTables are immutable and I do not know any
>> way of dropping tempTable to free more memory.
>>
>
> I'm assuming there won't be any (significant) memory overhead of
> registering the temp tables as long as I don't explicitly cache them. Am I
> wrong? In any case I'll be calling sqlContext.dropTempTable once the query
> has completed, which according to the documentation should also free up
> memory.
>
> Cheers,
> Michael
>

Mime
View raw message