spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <ja...@japila.pl>
Subject Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files
Date Fri, 22 Jan 2021 10:35:07 GMT
Hi Filip,

Care to share the code behind "The only thing I found so far involves using
forEachBatch and manually updating my aggregates. "?

I'm not completely sure I understand your use case and hope the code could
shed more light on it. Thank you.

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Jan 21, 2021 at 5:05 PM Filip <Filip.Neculciu@enghouse.com.invalid>
wrote:

> Hi,
>
> I'm considering using Apache Spark for the development of an application.
> This would replace a legacy program which reads CSV files and does lots
> (tens/hundreds) of aggregations on them. The aggregations are fairly
> simple:
> counts, sums, etc. while applying some filtering conditions on some of the
> columns.
>
> I prefer using structured streaming for its simplicity and low-latency. I'd
> also like to use full SQL queries (via createOrReplaceTempView). However,
> doing multiple queries means Spark will re-read the input files for each
> one
> of them. This seems very inefficient for my use-case.
>
> Does anyone have any suggestions? The only thing I found so far involves
> using forEachBatch and manually updating my aggregates. But, I think there
> should be a simpler solution for this use case.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message