kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Performance Question
Date Fri, 01 Jul 2016 18:44:25 GMT
On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim <bbuild11@gmail.com> wrote:

> Hi Todd,
>
> I changed the key to be what you suggested, and I can’t tell the
> difference since it was already fast. But, I did get more numbers.
>

Yea, you won't see a substantial difference until you're inserting billions
of rows, etc, and the keys and/or bloom filters no longer fit in cache.


>
> > 104M rows in Kudu table
> - read: 8s
> - count: 16s
> - aggregate: 9s
>
> The time to read took much longer from 0.2s to 8s, counts were the same
> 16s, and aggregate queries look longer from 6s to 9s.
>

> I’m still impressed.
>

We aim to please ;-) If you have any interest in writing up these
experiments as a blog post, would be cool to post them for others to learn
from.

-Todd


> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
> Hi Benjamin,
>
> What workload are you using for benchmarks? Using spark or something more
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
> some queries
>
> Todd
>
> Todd
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuild11@gmail.com> wrote:
>
>> Hi Todd,
>>
>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
>> impressed. Compared to HBase, read and write performance are better. Write
>> performance has the greatest improvement (> 4x), while read is > 1.5x.
>> Albeit, these are only preliminary tests. Do you know of a way to really do
>> some conclusive tests? I want to see if I can match your results on my 50
>> node cluster.
>>
>> Thanks,
>> Ben
>>
>> On May 30, 2016, at 10:33 AM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuild11@gmail.com> wrote:
>>
>>> Todd,
>>>
>>> It sounds like Kudu can possibly top or match those numbers put out by
>>> Aerospike. Do you have any performance statistics published or any
>>> instructions as to measure them myself as good way to test? In addition,
>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0
>>> where support will be built in?
>>>
>>
>> We don't have a lot of benchmarks published yet, especially on the write
>> side. I've found that thorough cross-system benchmarks are very difficult
>> to do fairly and accurately, and often times users end up misguided if they
>> pay too much attention to them :) So, given a finite number of developers
>> working on Kudu, I think we've tended to spend more time on the project
>> itself and less time focusing on "competition". I'm sure there are use
>> cases where Kudu will beat out Aerospike, and probably use cases where
>> Aerospike will beat Kudu as well.
>>
>> From my perspective, it would be great if you can share some details of
>> your workload, especially if there are some areas you're finding Kudu
>> lacking. Maybe we can spot some easy code changes we could make to improve
>> performance, or suggest a tuning variable you could change.
>>
>> -Todd
>>
>>
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuild11@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> First of all, thanks for the link. It looks like an interesting read. I
>>>> checked that Aerospike is currently at version 3.8.2.3, and in the article,
>>>> they are evaluating version 3.5.4. The main thing that impressed me was
>>>> their claim that they can beat Cassandra and HBase by 8x for writing and
>>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M
>>>> records per second with only 50 nodes. I wanted to see if this is real.
>>>>
>>>
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well,
>>> depending on the size of your records and the insertion order. I've been
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do
>>> better.
>>>
>>>
>>>>
>>>> To answer your questions, we have a DMP with user profiles with many
>>>> attributes. We create segmentation information off of these attributes to
>>>> classify them. Then, we can target advertising appropriately for our sales
>>>> department. Much of the data processing is for applying models on all or
if
>>>> not most of every profile’s attributes to find similarities (nearest
>>>> neighbor/clustering) over a large number of rows when batch processing or
a
>>>> small subset of rows for quick online scoring. So, our use case is a
>>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t
>>>> work well for these types of analytics.
>>>>
>>>> I read, that Aerospike in the release notes, they did do many
>>>> improvements for batch and scan operations.
>>>>
>>>> I wonder what your thoughts are for using Kudu for this.
>>>>
>>>
>>> Sounds like a good Kudu use case to me. I've heard great things about
>>> Aerospike for the low latency random access portion, but I've also heard
>>> that it's _very_ expensive, and not particularly suited to the columnar
>>> scan workload. Lastly, I think the Apache license of Kudu is much more
>>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct
>>> answer to the performance question :)
>>>
>>>
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpercy@cloudera.com> wrote:
>>>>
>>>> Have you considered whether you have a scan heavy or a random access
>>>> heavy workload? Have you considered whether you always access / update a
>>>> whole row vs only a partial row? Kudu is a column store so has some
>>>> awesome performance characteristics when you are doing a lot of scanning
of
>>>> just a couple of columns.
>>>>
>>>> I don't know the answer to your question but if your concern is
>>>> performance then I would be interested in seeing comparisons from a perf
>>>> perspective on certain workloads.
>>>>
>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test:
>>>> https://aphyr.com/posts/324-jepsen-aerospike
>>>>
>>>> I wonder if they have addressed any of those issues.
>>>>
>>>> Mike
>>>>
>>>> On Friday, May 27, 2016, Benjamin Kim <bbuild11@gmail.com> wrote:
>>>>
>>>>> I am just curious. How will Kudu compare with Aerospike (
>>>>> http://www.aerospike.com)? I went to a Spark Roadshow and found out
>>>>> about this piece of software. It appears to fit our use case perfectly
>>>>> since we are an ad-tech company trying to leverage our user profiles
data.
>>>>> Plus, it already has a Spark connector and has a SQL-like client. The
>>>>> tables can be accessed using Spark SQL DataFrames and, also, made into
SQL
>>>>> tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see from
the
>>>>> work done here http://gerrit.cloudera.org:8080/#/c/2992/ that the
>>>>> Spark integration is well underway and, from the looks of it lately,
almost
>>>>> complete. I would prefer to use Kudu since we are already a Cloudera
shop,
>>>>> and Kudu is easy to deploy and configure using Cloudera Manager. I also
>>>>> hope that some of Aerospike’s speed optimization techniques can make
it
>>>>> into Kudu in the future, if they have not been already thought of or
>>>>> included.
>>>>>
>>>>> Just some thoughts…
>>>>>
>>>>> Cheers,
>>>>> Ben
>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Mike Percy
>>>> Software Engineer, Cloudera
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message