spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomas Bartalos <tomas.barta...@gmail.com>
Subject Re: Access to live data of cached dataFrame
Date Sun, 19 May 2019 19:32:55 GMT
I'm trying to re-read however I'm getting cached data (which is a bit
confusing). For re-read I'm issuing:
spark.read.format("delta").load("/data").groupBy(col("event_hour")).count

The cache seems to be global influencing also new dataframes.

So the question is how should I re-read without loosing the cached data
(without using unpersist) ?

As I mentioned with sql its possible - I can create a cached view, so wen I
access the original table I get live data, when I access the view I get
cached data.

BR,
Tomas

On Fri, 17 May 2019, 8:57 pm Sean Owen, <srowen@gmail.com> wrote:

> A cached DataFrame isn't supposed to change, by definition.
> You can re-read each time or consider setting up a streaming source on
> the table which provides a result that updates as new data comes in.
>
> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos <tomas.bartalos@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I have a cached dataframe:
> >
> >
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
> >
> > I would like to access the "live" data for this data frame without
> deleting the cache (using unpersist()). Whatever I do I always get the
> cached data on subsequent queries. Even adding new column to the query
> doesn't help:
> >
> >
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
> lit("dummy"))
> >
> >
> > I'm able to workaround this using cached sql view, but I couldn't find a
> pure dataFrame solution.
> >
> > Thank you,
> > Tomas
>

Mime
View raw message