spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Spark Streaming Specify Kafka Partition
Date Fri, 04 Dec 2015 15:12:06 GMT
So createDirectStream will give you a JavaInputDStream of R, where R is the
return type you chose for your message handler.

If you want a JavaPairInputDStream, you may have to call .mapToPair in
order to convert the stream, even if the type you chose for R was already
Tuple2

(note that I try to stay as far away from Java as possible, so this answer
is untested, possibly inaccurate, may throw checked exceptions etc etc)

On Thu, Dec 3, 2015 at 5:21 PM, Alan Braithwaite <alan@cloudflare.com>
wrote:

> One quick newbie question since I got another chance to look at this
> today.  We're using java for our spark applications.  The
> createDirectStream we were using previously [1] returns a
> JavaPairInputDStream, but the createDirectStream with fromOffsets expects
> an argument recordClass to pass into the generic constructor for
> createDirectStream.
>
> In the code for the first function signature (without fromOffsets) it's
> being constructed in Scala as just a tuple (K, V).   How do I pass this
> same class/type information from java as the record class to get a JavaPairInputDStream<K,
> V>?
>
> I understand this might be a question more fit for a scala mailing list
> but google is failing me at the moment for hints on the interoperability of
> scala and java generics.
>
> [1] The original createDirectStream:
> https://github.com/apache/spark/blob/branch-1.5/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L395-L423
>
> Thanks,
> - Alan
>
> On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger <cody@koeninger.org> wrote:
>
>> I actually haven't tried that, since I tend to do the offset lookups if
>> necessary.
>>
>> It's possible that it will work, try it and let me know.
>>
>> Be aware that if you're doing a count() or take() operation directly on
>> the rdd it'll definitely give you the wrong result if you're using -1 for
>> one of the offsets.
>>
>>
>>
>> On Tue, Dec 1, 2015 at 9:58 AM, Alan Braithwaite <alan@cloudflare.com>
>> wrote:
>>
>>> Neat, thanks.  If I specify something like -1 as the offset, will it
>>> consume from the latest offset or do I have to instrument that manually?
>>>
>>> - Alan
>>>
>>> On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger <cody@koeninger.org>
>>> wrote:
>>>
>>>> Yes, there is a version of createDirectStream that lets you specify
>>>> fromOffsets: Map[TopicAndPartition, Long]
>>>>
>>>> On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite <alan@cloudflare.com>
>>>> wrote:
>>>>
>>>>> Is there any mechanism in the kafka streaming source to specify the
>>>>> exact partition id that we want a streaming job to consume from?
>>>>>
>>>>> If not, is there a workaround besides writing our a custom receiver?
>>>>>
>>>>> Thanks,
>>>>> - Alan
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message