kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Csar <cac...@gmail.com>
Subject Re: Is kafka suitable for our architecture?
Date Sat, 11 Oct 2014 00:34:55 GMT
On 10/10/2014 05:12 AM, Albert Vila wrote:
> Some comments below.
> On 10 October 2014 11:19, cacsar@gmail.com <cacsar@gmail.com> wrote:
>> Albert, you certainly can use Kafka (and it will probably work quite well)
>> you'll just need to make sure your consumers are written to match the
>> available options. I think I may not have a good picture of what you need
>> to do. Is it that you have a stream of documents coming in and then each
>> document can go through crawling, keyword extraction, language detection,
>> and indexation in any order, or are the steps ordered?
> Some steps are ordered and some other can be done in parallel. Everything
> starts after the crawler, because it outputs documents. Then those
> documents follow a graph of steps where we enrich their data.
>> The difference between a workqueue and what Kafka provides is that with a
>> workqueue all the documents would be in one big queue and all the consumers
>> have equal access to each document and can process the documents out of
>> order if required. In Kafka to have multiple consumers on one subscription
>> the data is partitioned with consumers assigned to particular partitions
>> (though this can happen in a coordinated fashion at runtime) and each
>> consumer will only chug through the data in its partition with messages
>> marked as being processed in a strict order. You can use this partitioning
>> to control which consumer will process a message but you may not want to.
>> Does this help at all?
> I think it does. In fact, what I had in mind is to have a topic for each
> step, for example:
> keyword topic [{partition: server1, message: document content}, {...}]
> language topic [{partition: server1, message: document content}, {...}]
> ...
> Then, some of the consumers will be on the same consumer group (to act as a
> distributed queue) and some other with different consumer group (to act as
> a pub/sub when we can run some steps in parallel).
> Is that correct or I miss something about how Kafka works? What I see is
> that by doing so, the document content will be duplicated on different
> topics, but maybe it's not a problem by Kafka (document size around 50k)

It looks like you would be moving 300GB of data through Kafka a day.
With there being multiple copies you will want to be sure to set the
data retention/deletion policies appropriately to avoid filling up the
nodes on your brokers unnecessarily. I'm not going to claim that it will
be the most efficient use of hardware and network capacity, but I expect
it can work reasonably well.

> Is it ok to partition data by the crawling server (right now 14 servers)?
> Or should we use some other field to give more distribution?

I think the consideration of consumer parallelism will be the main one
as discussed further below.

>> Without forethought  using Kafka you wouldn't be
>> able to suddenly increase the number of consumers to process a backlog
>> since you could have at most one consumer per partition (but with
>> forethought you could start with a larger number of partitions so as to be
>> able to increase the number of consumers by having each consumer process
>> fewer partitions at a time).
> I don't understand the last paragraph. Can't I have a consumer group
> containing X consumer instances? So later on, Can't I increase the number
> of the instances?

This inelegantly named question sort of addresses this

Basically you can't have more consumers in a consumer group than there
are partitions in a topic. This means you can't *easily* increase the
number of consumers beyond your initial number of partitions. While you
can add partitions later and so add consumers, repartitioning the data
to try and address a backlog is harder that simply spinning up new

It will often be easiest to split the data at the start into more
partitions than you need to start with (within reason). So while
partitioning your data by crawling server is an option, it means that
would be the maximum number of consumers. If you don't have a natural
partition key you may prefer to use the random partitioner where in
theory messages will be evenly distributed over time (however, note that
each producer sends a chunk of documents to one partition at a time
rather than choosing the partition randomly for each message).

>> As for Storm, I don't believe there's anything stopping each bolt from just
>> being concerned with a single document at a time, and if your stages are
>> sequential then your use of Kafka might well be equivalent to a simple
>> linear Storm topology.
> Exactly, that's why we are evaluating if only with Kafka is enough.
> Because if Storm gives us the same benefits than Kafka it's better to stick
> with only one technology to keep everything as simple as possible.

I think it is more a question of will using Storm make managing your
various consumers easier. Since I haven't used Storm in a production
environment I can't speak to that. I don't think there is any reason you
*need* to use Storm rather than just Kafka to achieve your needs though.


>> Christian
> Thanks
>> On Thu, Oct 9, 2014 at 11:57 PM, Albert Vila <albert.vila@augure.com>
>> wrote:
>>> Hi
>>> We process data in real time, and we are taking a look at Storm and Spark
>>> streaming too, however our actions are atomic, done at a document level
>> so
>>> I don't know if it fits on something like Storm/Spark.
>>> Regarding what you Christian said, isn't Kafka used for scenarios like
>> the
>>> one I described? I mean, we do have work queues right now with Gearman,
>> but
>>> with a bunch of workers on each step. I thought we could change that to a
>>> producer and a bunch of consumers (where the message should only reach
>> one
>>> and exact one consumer).
>>> And what I said about the data locally, it was only an optimization we
>> did
>>> some time ago because we was moving more data back then. Maybe now its
>> not
>>> necessary and we could move messages around the system using Kafka, so it
>>> will allow us to simplify the architecture a little bit. I've seen people
>>> saying they move Tb of data every day using Kafka.
>>> Just to be clear on the size of each document/message, we are talking
>> about
>>> tweets, blog posts, ... (on 90% of cases the size is less than 50Kb)
>>> Regards
>>> On 9 October 2014 20:02, Christian Csar <cacsar@gmail.com> wrote:
>>>> Apart from your data locality problem it sounds like what you want is a
>>>> workqueue. Kafka's consumer structure doesn't lend itself too well to
>>>> that use case as a single partition of a topic should only have one
>>>> consumer instance per logical subscriber of the topic, and that
>> consumer
>>>> would not be able to mark jobs as completed except in a strict order
>>>> (while maintaining a processed successfully at least once guarantee).
>>>> This is not to say it cannot be done, but I believe your workqueue
>> would
>>>> end up working a bit strangely if built with Kafka.
>>>> Christian
>>>> On 10/09/2014 06:13 AM, William Briggs wrote:
>>>>> Manually managing data locality will become difficult to scale. Kafka
>>> is
>>>>> one potential tool you can use to help scale, but by itself, it will
>>> not
>>>>> solve your problem. If you need the data in near-real time, you could
>>>> use a
>>>>> technology like Spark or Storm to stream data from Kafka and perform
>>> your
>>>>> processing. If you can batch the data, you might be better off
>> pulling
>>> it
>>>>> into a distributed filesystem like HDFS, and using MapReduce, Spark
>> or
>>>>> another scalable processing framework to handle your transformations.
>>>> Once
>>>>> you've paid the initial price for moving the document into HDFS, your
>>>>> network traffic should be fairly manageable; most clusters built for
>>> this
>>>>> purpose will schedule work to be run local to the data, and typically
>>>> have
>>>>> separate, high-speed network interfaces and a dedicated switch in
>> order
>>>> to
>>>>> optimize intra-cluster communications when moving data is
>> unavoidable.
>>>>> -Will
>>>>> On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.vila@augure.com>
>>>> wrote:
>>>>>> Hi
>>>>>> I just came across Kafta when I was trying to find solutions to
>> scale
>>>> our
>>>>>> current architecture.
>>>>>> We are currently downloading and processing 6M documents per day
>> from
>>>>>> online and social media. We have a different workflow for each type
>> of
>>>>>> document, but some of the steps are keyword extraction, language
>>>> detection,
>>>>>> clustering, classification, indexation, .... We are using Gearman
>>>>>> dispatch the job to workers and we have some queues on a database.
>>>>>> I'm wondering if we could integrate Kafka on the current workflow
>> and
>>> if
>>>>>> it's feasible. One of our main discussions are if we have to go to
>>>> fully
>>>>>> distributed architecture or to a semi-distributed one. I mean,
>>>> distribute
>>>>>> everything or process some steps on the same machine (crawling,
>>> keyword
>>>>>> extraction, language detection, indexation). We don't know which
>>>> scales
>>>>>> more, each one has pros and cont.
>>>>>> Now we have a semi-distributed one as we had network problems taking
>>>> into
>>>>>> account the amount of data we were moving around. So now, all
>>> documents
>>>>>> crawled on server X, later on are dispatched through Gearman to the
>>> same
>>>>>> server. What we dispatch on Gearman is only the document id, and
>>>>>> document data remains on the crawling server on a Memcached, so the
>>>> network
>>>>>> traffic is keep at minimum.
>>>>>> What do you think?
>>>>>> It's feasible to remove all database queues and Gearman and move
>>>> Kafka?
>>>>>> As Kafka is mainly based on messages I think we should move the
>>> messages
>>>>>> around, should we take into account the network? We may face the
>> same
>>>>>> problems?
>>>>>> If so, there is a way to isolate some steps to be processed on the
>>> same
>>>>>> machine, to avoid network traffic?
>>>>>> Any help or comment will be appreciate. And If someone has had a
>>> similar
>>>>>> problem and has knowledge about the architecture approach will be
>> more
>>>> than
>>>>>> welcomed.
>>>>>> Thanks
>>> --
>>> *Albert Vila*
>>> R&D Manager & Software Developer
>>> Tél. : +34 972 982 968
>>> *www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
>>> <http://blog.augure.es/> | *Twitter. *@AugureSpain
>>> <https://twitter.com/AugureSpain>
>>> *Skype *: albert.vila | *Access map.* Augure Girona
>>> <
>> https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16

View raw message