kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Haviv <daniel.ha...@veracity-group.com>
Subject Re: Assigning partitions to specific nodes
Date Mon, 16 Mar 2015 20:35:32 GMT
Let's say I have 3 different types of algorithms that are implemented by
three streaming apps (on Yarn).
They are completely separate, meaning that they can run in parallel on the
same data and not sequentially.

*Using Kafka: *File X is loaded into HDFS and I want Algorithms A and B to
process it so I push it's path to TopicA and TopicB and
streamingAppA and streamingAppB will be able to consume simultaneously.

*Using Directory Listener:* File X is loaded into HDFS. Algorithm A will
listen on directoryA, B on directoryB, C on directoryC.
For the file to be processed I will have to copy the file both to
directoryA and to directoyB and only then will the streaming apps will
start processing it.

Daniel



On Mon, Mar 16, 2015 at 10:25 PM, Gwen Shapira <gshapira@cloudera.com>
wrote:

> Probably off-topic for Kafka list, but why do you think you need
> multiple copies of the file to parallelize access?
> You'll have parallel access based on how many containers you have on
> the machine (if you are using YARN-Spark).
>
> On Mon, Mar 16, 2015 at 1:20 PM, Daniel Haviv
> <daniel.haviv@veracity-group.com> wrote:
> > Hi,
> > The reason we want to use this method is that this way a file can be
> consumed by different streaming apps simultaneously (they just consume it's
> path from kafka and open it locally).
> >
> > With fileStream to parallelize the processing of a specific file I will
> have to make several copies of it, which wasteful in terms of space and
> time.
> >
> > Thanks,
> > Daniel
> >
> >> On 16 במרץ 2015, at 22:12, Gwen Shapira <gshapira@cloudera.com> wrote:
> >>
> >> Any reason not to use SparkStreaming directly with HDFS files, so
> >> you'll get locality guarantees from the Hadoop framework?
> >> StreamContext has textFileStream() method you could use for this.
> >>
> >> On Mon, Mar 16, 2015 at 12:46 PM, Daniel Haviv
> >> <daniel.haviv@veracity-group.com> wrote:
> >>> Hi,
> >>> Is it possible to assign specific partitions to specific nodes?
> >>> I want to upload files to HDFS, find out on which nodes the file
> resides
> >>> and then push their path into a topic and partition it by nodes.
> >>> This way I can ensure that the consumer (Spark Streaming) will consume
> both
> >>> the message and file locally.
> >>>
> >>> Can this be achieved ?
> >>>
> >>> Thanks,
> >>> Daniel
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message