kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Beabes <mailinglist...@gmail.com>
Subject Kafka S3 Connector: Sort by a field within a partition
Date Thu, 29 Apr 2021 17:06:31 GMT
We’ve a use case where lots of messages will come in via AWS SQS from
various devices. We’re thinking of reading these messages using Spark
Structured Streaming, cleaning them up as needed & saving each message on
Kafka. Later we’re thinking of using Kafka S3 Connector to push them to S3
on an hourly basis; meaning there will be a different directory for each
hour. Challenge is that, within this hourly “partition” the messages need
to be “sorted by” a certain field (let’s say device_id). Reason being,
we’re planning to create an EXTERNAL table on it with BUCKETS on device_id.
This will speed up the subsequent Aggregation jobs.

Questions:

1) Does Kafka S3 Connector allow messages to be sorted by a particular
field within a partition – or – do we need to extend it?
2) Is there a better way to do this?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message