spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filip <>
Subject Spark structured streaming - efficient way to do lots of aggregations on the same input files
Date Thu, 21 Jan 2021 15:36:29 GMT

I'm considering using Apache Spark for the development of an application.
This would replace a legacy program which reads CSV files and does lots
(tens/hundreds) of aggregations on them. The aggregations are fairly simple:
counts, sums, etc. while applying some filtering conditions on some of the

I prefer using structured streaming for its simplicity and low-latency. I'd
also like to use full SQL queries (via createOrReplaceTempView). However,
doing multiple queries means Spark will re-read the input files for each one
of them. This seems very inefficient for my use-case.

Does anyone have any suggestions? The only thing I found so far involves
using forEachBatch and manually updating my aggregates. But, I think there
should be a simpler solution for this use case.

Sent from:

To unsubscribe e-mail:

View raw message