spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenchen Fan <cloud0...@gmail.com>
Subject Re: Access to live data of cached dataFrame
Date Tue, 21 May 2019 13:35:57 GMT
When you cache a dataframe, you actually cache a logical plan. That's why
re-creating the dataframe doesn't work: Spark finds out the logical plan is
cached and picks the cached data.

You need to uncache the dataframe, or go back to the SQL way:
df.createTempView("abc")
spark.table("abc").cache()
df.show // returns latest data.
spark.table("abc").show // returns cached data.


On Mon, May 20, 2019 at 3:33 AM Tomas Bartalos <tomas.bartalos@gmail.com>
wrote:

> I'm trying to re-read however I'm getting cached data (which is a bit
> confusing). For re-read I'm issuing:
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count
>
> The cache seems to be global influencing also new dataframes.
>
> So the question is how should I re-read without loosing the cached data
> (without using unpersist) ?
>
> As I mentioned with sql its possible - I can create a cached view, so wen
> I access the original table I get live data, when I access the view I get
> cached data.
>
> BR,
> Tomas
>
> On Fri, 17 May 2019, 8:57 pm Sean Owen, <srowen@gmail.com> wrote:
>
>> A cached DataFrame isn't supposed to change, by definition.
>> You can re-read each time or consider setting up a streaming source on
>> the table which provides a result that updates as new data comes in.
>>
>> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos <tomas.bartalos@gmail.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > I have a cached dataframe:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
>> >
>> > I would like to access the "live" data for this data frame without
>> deleting the cache (using unpersist()). Whatever I do I always get the
>> cached data on subsequent queries. Even adding new column to the query
>> doesn't help:
>> >
>> >
>> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
>> lit("dummy"))
>> >
>> >
>> > I'm able to workaround this using cached sql view, but I couldn't find
>> a pure dataFrame solution.
>> >
>> > Thank you,
>> > Tomas
>>
>

Mime
View raw message