spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From András Kolbert <kolbertand...@gmail.com>
Subject Re: Use case advice
Date Sat, 09 Jan 2021 20:30:10 GMT
Sorry if my terminology is misleading.

What I meant under driver only is to use a local pandas dataframe (collect
the data to the master), and keep updating that instead of dealing with a
spark distributed dataframe for holding this data.

For example, we have a dataframe with all users and their corresponding
latest activity timestamp. After each streaming batch, aggregations are
performed and the calculation is collected to the driver to update a subset
of users latest activity timestamp.



On Sat, 9 Jan 2021, 6:18 pm Artemis User, <artemis@dtechspace.com> wrote:

> Could you please clarify what do you mean by 1)? Driver is only
> responsible for submitting Spark job, not performing.
>
> -- ND
>
> On 1/9/21 9:35 AM, András Kolbert wrote:
> > Hi,
> > I would like to get your advice on my use case.
> > I have a few spark streaming applications where I need to keep
> > updating a dataframe after each batch. Each batch probably affects a
> > small fraction of the dataframe (5k out of 200k records).
> >
> > The options I have been considering so far:
> > 1) keep dataframe on the driver, and update that after each batch
> > 2) keep dataframe distributed, and use checkpointing to mitigate lineage
> >
> > I solved previous use cases with option 2, but I am not sure if it is
> > the most optimal as checkpointing is relatively expensive. I also
> > wondered about HBASE or some sort of quick access memory storage,
> > however it is currently not in my stack.
> >
> > Curious to hear your thoughts
> >
> > Andras
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message