spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Latest 200 messages per topic
Date Wed, 20 Jul 2016 14:27:58 GMT
If they're files in a file system, and you don't actually need
multiple kinds of consumers, have you considered
streamingContext.fileStream instead of kafka?

On Wed, Jul 20, 2016 at 5:40 AM, Rabin Banerjee
<dev.rabin.banerjee@gmail.com> wrote:
> Hi Cody,
>
>     Thanks for your reply .
>
>    Let Me elaborate a bit,We have a Directory where small xml(90 KB) files
> are continuously coming(pushed from other node).File has  ID & Timestamp in
> name and also inside record.  Data coming in the directory has to be pushed
> to Kafka to finally get into Spark Streaming . Data is time series data(Per
> device per 15 min 1 file of 90 KB, Total 10,000 Device. So 50,000 files per
> 15 min). No utility can be installed in the source where data is generated ,
> so data will be always ftp-ed to  a directory .In Spark streaming we are
> always interested with latest 60 min(window) of data(latest 4 files per
> device). What do you suggest to get them into Spark Streaming with
> reliability (probably with Kafka). In streaming I am only interested with
> the latest 4 data(60 min).
>
>
> Also I am thinking about , instead of using Spark Windowing ,Using Custom
> java code will push the ID  of the file to Kafka and push parsed XML data to
> HBASE keeping Hbase insert timestamp as File Timestamp, HBASE key will be
> only ID ,CF will have 4 version(Time series version) per device ID (4 latest
> data). As hbase keeps the data per key sorted with timestamp , I will always
> get the latest 4 ts data on get(key). Spark streaming will get the ID from
> Kafka, then read the data from HBASE using get(ID). This will eliminate
> usage of Windowing from Spark-Streaming . Is it good to use ?
>
> Regards,
> Rabin Banerjee
>
>
> On Tue, Jul 19, 2016 at 8:44 PM, Cody Koeninger <cody@koeninger.org> wrote:
>>
>> Unless you're using only 1 partition per topic, there's no reasonable
>> way of doing this.  Offsets for one topicpartition do not necessarily
>> have anything to do with offsets for another topicpartition.  You
>> could do the last (200 / number of partitions) messages per
>> topicpartition, but you have no guarantee as to the time those events
>> represent, especially if your producers are misbehaving.  To be
>> perfectly clear, this is a consequence of the Kafka data model, and
>> has nothing to do with spark.
>>
>> So, given that it's a bad idea and doesn't really do what you're
>> asking...  you can do this using KafkaUtils.createRDD
>>
>> On Sat, Jul 16, 2016 at 10:43 AM, Rabin Banerjee
>> <dev.rabin.banerjee@gmail.com> wrote:
>> > Just to add ,
>> >
>> >   I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 ,
>> > every time .
>> >
>> > Also I want to know , If I want to fetch a specific offset range for
>> > Batch
>> > processing, is there any option for doing that ?
>> >
>> >
>> >
>> >
>> > On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee
>> > <dev.rabin.banerjee@gmail.com> wrote:
>> >>
>> >> HI All,
>> >>
>> >>    I have 1000 kafka topics each storing messages for different devices
>> >> .
>> >> I want to use the direct approach for connecting kafka from Spark , in
>> >> which
>> >> I am only interested in latest 200 messages in the Kafka .
>> >>
>> >> How do I do that ?
>> >>
>> >> Thanks.
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message