kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: new Kudu benchmarks
Date Fri, 05 Jan 2018 23:23:29 GMT
Hey Mauricio,

Answers inline below

On Fri, Jan 5, 2018 at 2:50 PM, Mauricio Aristizabal <
mauricio@impactradius.com> wrote:

> Todd, since you bring it up in this thread... what CDH version do you
> expect DECIMAL support to make it into? I recently asked Icaro Vazquez
> about it but still no news.  We're hoping it makes it into 5.14 otherwise
> according to the roadmap there might not be another minor release and we'd
> be waiting till Summer for CDH 6.
>

As this is an open source project mailing list, it would be inappropriate
for me to comment on a vendor's release schedule. Please note that Kudu is
a product of the Apache Software Foundation and the ASF doesn't have any
influence on or knowledge of Cloudera's release plans.

Of course it happens that I and many other contributors are also employees
of Cloudera, but we participate in the ASF as individuals and not
representatives of our employer, and so generally won't comment on
questions like this in this forum. Please refer to Cloudera's forums for
questions about CDH release plans, etc.


>
> And just in case we're forced to make do without DECIMAL initially, is the
> recommendation really to store as string and convert?  I was thinking of
> storing as int/long and dividing by 10 or 1000 as needed in an impala view
> over the kudu table.  Wouldn't a division be way more performant than a
> conversion from string, especially when aggregating over thousands of
> records in a report query?
>

You're right -- using an integer type and division by a power of 10 is
going to be much faster than casting from a string.  Division by a constant
would be JITted by Impala into a pretty minimal sequence of assembly
instructions (two bitshifts, an integer multiplication, and a subtraction)
which likely take about 6 cycles total. In contrast, a cast from string to
decimal probably takes many thousands of cycles.

The only downside is that if you have end users using the data they might
be confused by the integer representation whereas a string representation
would be a little clearer.

Thanks
-Todd


>
> On Fri, Jan 5, 2018 at 11:13 AM, Todd Lipcon <todd@cloudera.com> wrote:
>
>> Oh, one other piece of feedback: maybe worth editing the title to say "vs
>> Apache Parquet" instead of "vs Apache Impala" since in all cases you are
>> using Impala as the query engine?
>>
>> -Todd
>>
>> On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> Hey Boris,
>>>
>>> Thanks for publishing this. It's a great look at how an end user
>>> evaluates Kudu. I appreciate that you cover both the pros and cons of the
>>> technology, and glad to see that your conclusion leaves you excited about
>>> Kudu :)
>>>
>>> One quick note is that I think you'll be even more pleased when you
>>> upgrade to a later version (eg Kudu 1.5). We've improved performance in
>>> several areas and also improved scalability compared to the version you're
>>> testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It
>>> might be worth noting this as an addendum to the blog post if you feel like
>>> it.
>>>
>>> -Todd
>>>
>>> On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin <boris@boristyukin.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> we just finished testing Kudu, mostly comparing Kudu to Impala on
>>>> HDFS/parquet. I wanted to share my blog post and results. We used typical
>>>> (and real) healthcare data for the test, not a synthetic data which I think
>>>> makes it is a bit more interesting.
>>>>
>>>> I welcome any feedback!
>>>>
>>>> http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
>>>>
>>>> We are really impressed with Kudu and I wanted to take an opportunity
>>>> to thank Kudu developers for such an amazing and much-needed product.
>>>>
>>>> Boris
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> *MAURICIO ARISTIZABAL*
> Architect - Business Intelligence + Data Science
> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
> 223 E. De La Guerra St. | Santa Barbara, CA 93101
> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>
> Overview <http://www.impactradius.com/?src=slsap> | Twitter
> <https://twitter.com/impactradius> | Facebook
> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
> <https://www.linkedin.com/company/impact-radius-inc->
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message