spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Spark dataset cache vs tempview
Date Sun, 06 Nov 2016 09:16:03 GMT
With regard to use of tempTable

createOrReplaceTempView is backed by an in-memory hash table that maps
table name (a string) to a logical query plan.  Fragments of that logical
query plan may or may not be cached. However, calling register alone will
not result in any materialization of results.

If your dataset is very large, then one option is to create a tempView out
of that DF and use that in your processing. My assumption here is that your
data will be the same. In other words that tempView will always be valid.
You can of course drop that tempView

scala> df.toDF.createOrReplaceTempView("tmp")

scala> spark.sql("drop view if exists tmp")

Check UI (port 4040) storage page to see what is cached etc.

Just try either options to see which one is more optimum. Option 2 may be
more optimum.

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 November 2016 at 03:44, Rohit Verma <rohit.verma@rokittech.com> wrote:

> I have a parquet file which I reading atleast 4-5 times within my
> application. I was wondering what is most efficient thing to do.
>
> Option 1. While writing parquet file, immediately read it back to dataset
> and call cache. I am assuming by doing an immediate read I might use some
> existing hdfs/spark cache as part from write process.
>
> Option 2. In my application when I need the dataset first time, call cache
> then.
>
> Option 3. While writing parquet file, after completion create a temp view
> out of it. In all subsequent usage, use the view.
>
> I am also not very clear about efficiency of reading from tempview vs
> parquet dataset.
>
> FYI the datasets which I am referring, its not possible to fit all of it
> in memory. They are very huge.
>
> Regards..
> Rohit
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message