spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rabin Banerjee <dev.rabin.baner...@gmail.com>
Subject Re: Latest 200 messages per topic
Date Wed, 20 Jul 2016 10:40:05 GMT
Hi Cody,

    Thanks for your reply .

   Let Me elaborate a bit,We have a Directory where small xml(90 KB) files
are continuously coming(pushed from other node).File has  ID & Timestamp in
name and also inside record.  Data coming in the directory has to be pushed
to Kafka to finally get into Spark Streaming . Data is time series data(Per
device per 15 min 1 file of 90 KB, Total 10,000 Device. So 50,000 files per
15 min). No utility can be installed in the source where data is generated
, so data will be always ftp-ed to  a directory .In Spark streaming we are
always interested with latest 60 min(window) of data(latest 4 files per
device). What do you suggest to get them into Spark Streaming with
reliability (probably with Kafka). In streaming I am only interested with
the latest 4 data(60 min).


Also I am thinking about , instead of using Spark Windowing ,Using Custom
java code will push the ID  of the file to Kafka and push parsed XML data
to HBASE keeping Hbase insert timestamp as File Timestamp, HBASE key will
be only ID ,CF will have 4 version(Time series version) per device ID (4
latest data). As hbase keeps the data per key sorted with timestamp , I
will always get the latest 4 ts data on get(key). Spark streaming will get
the ID from Kafka, then read the data from HBASE using get(ID). This will
eliminate usage of Windowing from Spark-Streaming . Is it good to use ?

Regards,
Rabin Banerjee


On Tue, Jul 19, 2016 at 8:44 PM, Cody Koeninger <cody@koeninger.org> wrote:

> Unless you're using only 1 partition per topic, there's no reasonable
> way of doing this.  Offsets for one topicpartition do not necessarily
> have anything to do with offsets for another topicpartition.  You
> could do the last (200 / number of partitions) messages per
> topicpartition, but you have no guarantee as to the time those events
> represent, especially if your producers are misbehaving.  To be
> perfectly clear, this is a consequence of the Kafka data model, and
> has nothing to do with spark.
>
> So, given that it's a bad idea and doesn't really do what you're
> asking...  you can do this using KafkaUtils.createRDD
>
> On Sat, Jul 16, 2016 at 10:43 AM, Rabin Banerjee
> <dev.rabin.banerjee@gmail.com> wrote:
> > Just to add ,
> >
> >   I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 ,
> > every time .
> >
> > Also I want to know , If I want to fetch a specific offset range for
> Batch
> > processing, is there any option for doing that ?
> >
> >
> >
> >
> > On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee
> > <dev.rabin.banerjee@gmail.com> wrote:
> >>
> >> HI All,
> >>
> >>    I have 1000 kafka topics each storing messages for different devices
> .
> >> I want to use the direct approach for connecting kafka from Spark , in
> which
> >> I am only interested in latest 200 messages in the Kafka .
> >>
> >> How do I do that ?
> >>
> >> Thanks.
> >
> >
>

Mime
View raw message