kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neha Narkhede <neha.narkh...@gmail.com>
Subject Re: random access performance of messages.
Date Fri, 21 Oct 2011 22:35:11 GMT
Marko,

> Does that mean the initial FetchRequest() is considered to be "slow"?

That is true of consumers that are not at the "tail-end" of the topic
they are consuming. This is because most likely the data they want to
consume doesn't exist in the page cache.
For real time consumers though, the latest data for that topic would
most probably be in the cache, and each fetch request would be served
from the page cache itself.

> Can you give any concrete number that would give a sense of exactly how slow
> is "slow"?

Well, assuming the consumer fetch size is 1MB and the fetch request
goes to disk, it will roughly correspond to 10 ms (seek time) + 30 ms
(time to read 1MB sequentially from disk).

> Is the concern also that too many random accesses will degrade write
> performance?

Too many random access will hurt producer throughput, as well as
consumer throughput. This is because most requests, read or write,
will hit the disk.

 >>  If it's already in Kafka, then why not just leave it there?

You could, but you would be giving up all benefits of using a high
performance pub sub messaging system, since both your writes as well
as reads will be slow.

Thanks
Neha

On Fri, Oct 21, 2011 at 10:19 AM,  <marko@modelcitizen.com> wrote:
> Thanks for the responses, and pardon my newbie status.
>
> @sharad
>>> Also using kafka as *long* term message store is not a good usecase.
>
> To be more specific about my message lifetime/volume, in my case storage
> would be < one month (in the range of a few Terrabytes in size).
>
> @neha
>>> . Instead of using kafka for random message lookups, you could use it as
> the persistent message bus between the publishers of the messages and your
> indexing system.
>
> Yes, that is what I intended by first approach I suggested.  Granted that is
> the most apparent path, but I'm trying to consider if I can save all the
> time/resources needed to essentially move the data out of Kafka into a
> secondary db.  In this case the only purpose of the secondary store would be
> to house the message data.  If it's already in Kafka, then why not just
> leave it there?
>
> @sharad
>>> kafka is more suited for sequential message reads. Not really meant for
> random message lookups.
>
> From my basic understanding of the API it would appear that reading (using a
> checkpoint) always begins with random access?  Eg. Below code excerpt from
> the wiki quickstart.  I assume the FetchRequest() call is a random access
> read?
>
> Does that mean the initial FetchRequest() is considered to be "slow"?
>
> Can you give any concrete number that would give a sense of exactly how slow
> is "slow"?
>
> Is the concern also that too many random accesses will degrade write
> performance?
>
> Thank you.
>
> long offset = 0;
> while (true) {
>  // create a fetch request for topic “test”, partition 0, current offset,
> and fetch size of 1MB
>  FetchRequest fetchRequest = new FetchRequest("test", 0, offset, 1000000);
>
>  // get the message set from the consumer and print them out
>  ByteBufferMessageSet messages = consumer.fetch(fetchRequest);
>  for(Message message : messages) {
>    System.out.println("consumed: " + Utils.toString(message.payload(),
> "UTF-8"));
>    // advance the offset after consuming each message
>    offset += MessageSet.entrySize(message);
>  }
> }
>
> -----Original Message-----
> From: Neha Narkhede [mailto:neha.narkhede@gmail.com]
> Sent: Friday, October 21, 2011 1:02 PM
> To: kafka-users@incubator.apache.org
> Subject: Re: random access performance of messages.
>
> Marko,
>
> I agree with Sharad. Instead of using kafka for random message lookups, you
> could use it as the persistent message bus between the publishers of the
> messages and your indexing system.
> Using the low level consumer API (SimpleConsumer), you could set up your
> indexer processes to pull from the broker partitions for a topic.
> You would have to checkpoint your Kafka
> offsets to match the data indexed and flushed to disk, and re-fetch data
> from Kafka, if/when the indexer fails.
>
> Thanks,
> Neha
>
> On Fri, Oct 21, 2011 at 9:47 AM, Sharad Agarwal <sharad.apache@gmail.com>
> wrote:
>> kafka is more suited for sequential message reads. Not really meant
>> for random message lookups.
>>
>> Also using kafka as *long* term message store is not a good usecase.
>>
>> On Fri, Oct 21, 2011 at 9:32 PM, <marko@modelcitizen.com> wrote:
>>
>>> I would like to use Kafka to process messages that need to be
>>> immutably stored for a N-days, and during that period the msgs need
>>> to be indexed, searched, as well as retrieval of msg data that is
> queried.
>>>
>>>
>>>
>>> One approach is to read messages from Kafka and store the messages in
>>> a secondary db for query and data retrieval.  Once the messages are
>>> read and processed into the secondary db, then the messages can be
>>> discarded from the Kafka queue.
>>>
>>>
>>>
>>> Another approach is to read the messages, build an external index for
>>> searching that directly references the message data by Kafka-key in
>>> the Kafka queue itself.  In this case the Kafka becomes the message
>>> store for the life of the message/data.
>>>
>>>
>>>
>>> The latter would be ideal for me if the performance of query-by-key
>>> and message data retrieval is very good.
>>>
>>>
>>>
>>> Is random query of message+data good for Kafka?  Is this an
>>> appropriate usecase for Kafka?
>>>
>>>
>>>
>>> Thank you.
>>>
>>>
>>>
>>> Marko.
>>>
>>> .
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Thanks
>> Sharad Agarwal
>> Hadoop and Avro Committer
>> Technology Platforms, InMobi
>> *Disclaimer: Opinions expressed here are my own and do not represent
>> past or present employers.*
>>
>
>

Mime
View raw message