spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AndrĂ¡s Kolbert <>
Subject Use case advice
Date Sat, 09 Jan 2021 14:35:25 GMT
I would like to get your advice on my use case.
I have a few spark streaming applications where I need to keep updating a
dataframe after each batch. Each batch probably affects a small fraction of
the dataframe (5k out of 200k records).

The options I have been considering so far:
1) keep dataframe on the driver, and update that after each batch
2) keep dataframe distributed, and use checkpointing to mitigate lineage

I solved previous use cases with option 2, but I am not sure if it is the
most optimal as checkpointing is relatively expensive. I also wondered
about HBASE or some sort of quick access memory storage, however it is
currently not in my stack.

Curious to hear your thoughts


View raw message