crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-606) Create a KafkaSource
Date Fri, 22 Apr 2016 21:33:12 GMT


Micah Whitacre updated CRUNCH-606:
    Attachment: CRUNCH-606.patch

So the approach I'm going with the source is that callers are going to be responsible for
managing start/stop offsets.  They are also responsible for persisting that information somewhere.
 I think in a later request I'll work on adding some convenience methods for that.

Posting progress here for comments or feedback.  There is some work to be done, specifically:
* More tests around some of the utilities in KafkaUtils.
* Finish flushing out the KafkaSourceIT (hitting a ClassCastException from String to Text)
* Finish flushing out the method because it stops us from being able
to call materialize on a read PCollection.
* Reogranize the tests because right now the "long running" (but not really) tests are in
surefire vs our convention of putting them in src/it/tests

One specific feedback is if I should keep with the path of making the KafkaSource extend FileInputFormat
or if I'm trying to put a square peg in a round hole.  

> Create a KafkaSource
> --------------------
>                 Key: CRUNCH-606
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-606.patch
> Pulling data out of Kafka is a common use case and some of the ways to do it Kafka Connect,
Camus, Gobblin do not integrate nicely with existing processing pipelines like Crunch.  With
Kafka 0.9, the consuming API is a lot easier so we should build a Source implementation that
can read from Kafka.

This message was sent by Atlassian JIRA

View raw message