spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filip <>
Subject Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files
Date Fri, 22 Jan 2021 14:28:00 GMT

I don't have any code for the forEachBatch approach, I mentioned it due to
this response to my question on SO:

I have added some very simple code below that I think shows what I'm trying
to do:
val schema = StructType(
	StructField("senderId1", LongType),
	StructField("senderId2", LongType),
	StructField("destId1", LongType),
    StructField("eventType", IntegerType)
    StructField("cost", LongType)

val fileStreamDf = spark.readStream.schema(schema).option("delimiter",


spark.sql("SELECT senderId1, count(*) AS num_events FROM myTable GROUP BY
senderId1 HAVING count(*) >
spark.sql("SELECT senderId2, sum(cost) AS total_cost FROM myTable WHERE
eventType = 3 GROUP BY senderId2 HAVING sum(cost) >
spark.sql("SELECT destId1, count(*) AS num_events WHERE event_type = 5 GROUP
BY destId1 HAVING count(*) >

Of course, this is simplified; there are a lot more columns and the queries
should also group by time period, but I didn't want to complicate it.
With this example, I have 3 queries running on the same input files, but
Spark would need to read the files from disk 3 times. These extra reads are
what I'm trying to avoid.
In the real application, the number of queries would be a lot higher and
dynamic (they are generated in response to some configurations made by the
end users).

Sent from:

To unsubscribe e-mail:

View raw message