spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vikram agrawal <vikram.agra...@gmail.com>
Subject Structured Streaming Support for Amazon Kinesis in Spark 2.2.x
Date Tue, 13 Mar 2018 10:08:44 GMT
Hi All,

I have implemented Kinesis Connector for Structured Streaming. The code is
available is at https://github.com/qubole/kinesis-sql.

Open Source Jira for the same is SPARK-18165
<https://issues.apache.org/jira/browse/SPARK-18165>

Design details are mentioned here - https://docs.google.com/presentation/d/
1eSX7WiZxkjRy5ZDXAKRUDIKWqhP3ob2A1L1lat2QIXY/edit#slide=id.g34f0efcb5a_2_76

I wrote and ran unit tests to validate the implementation. Apart from that,
I started 3 node spark clusters to read from Kinesis and write to S3 in
parquet format and ran the query for more than a day.

List of Features implemented

   - Re-sharding logic to add/remove new/closed shards
   - Executors fetch records from Kinesis as part of the incremental job
   execution (instead of having a receiver model where few threads are
   responsible for reading from Kinesis)
   - Various configuration to have fine-grain control depending upon your
   requirements


I have developed and tested the connector against Spark 2.2.x and
will migrate it to DataSource V2 APIs in some time.

Can anyone help me in reviewing the design/implementation? I would love it
to be part of Spark Distribution.

- Vikram

Mime
View raw message