kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: abnormal high disk I/O rate when upsert into kudu table?
Date Thu, 18 Aug 2016 18:11:06 GMT
Hey Ben and Jacky,

Apologies for my late response. As you might have seen on the Kudu blog, a
lot of the contributors have been busy wrapping up the 0.10.0 release this
week. Answers inline

>
>
> On Aug 16, 2016, at 6:05 PM, jacky.he@gmail.com wrote:
>
> Thanks Todd.
>
> Kudu cluster running on centos 7.2, each tablet node has 40 cores, the
> test table is about 140GB after 3 reps,  and partitioned by hash bucket, I
> had tried 24 and 120 hash buckets.
>
> I do one test:
> 1. Stop all ingestion to the cluster
> 2. Just randomly upsert 3000 rows once, upsert contains new data row or
> just updates to exisit row (updates the whole row, not just updates one or
> more column)
> 3. From the CDH monitor dashboard, I see the cluster's disk I/O raising
> from ~300Mb/s to ~1.5Gb/s, and get back the ~300Mb/s 30min later or more
>
> I check some of tablet node INFO log, they are always doing compaction,
> compacting 1~ 100s of thousands rows.
>
> My question:
> 1. Are the maintenance manager is rewriting the whole table?  3000 rows
> upsert once will trigger a rewriting the whole table?
>
>
If those 3000 rows are spread across the whole key space, then yes, it
currently will. If, on the other hand, you had a table something like:

CREATE TABLE t (
  ts INTEGER PRIMARY KEY,
  other_data string,
  ...
) DISTRIBUTE BY HASH(ts) INTO 120 BUCKETS

and your inserts were for a small range of time (eg concentrated around
"now") then the compaction would only rewrite the portion of key space that
has new rows (or updates) affecting it.


> 2. Does the background I/O have impacts to the scan performance.
>
> Of course it has some. However, it is restricted by default to a single
thread, so it should use only a small percentage of the machine's total
capacity. I had worked on a patch a few years ago to use ioprio_set to mark
these I/Os as "low priority", but didn't have enough time to validate it
helping with any workload, and thus, it didn't get committed. In case
anyone's interested in trying it, I posted the (very old) diff to
https://gist.github.com/toddlipcon/faaf8e74b4dae93e668e8bda1118b58a . It
will probably need some conflict resolution to apply it.


> 3. About the number of hash partitioned buckets,  I partitioned the table
> to 24 or 120 buckets, what's the difference in upsert and scan performance?
> and what is the best practices?
>
> For write performance, I'd recommend several (5-10) of tablets per tablet
server being actively written to. Individual tablets can handle
multi-threaded writes pretty well, though there are some bottlenecks at
very high throughputs, so just having one per tablet server would not get
peak performance.

On the read side, it's worth noting that _currently_ both the Spark and
Impala integrations start only one scanner thread per tablet. So, the
number of tablets limits scan parallelism. Given that, I expect you would
see better performance with 120 buckets. This is, however, a temporary
limitation: we intend to add some API at some point to make it easier for
query engines to divide the scanner into smaller chunks which can be spread
across more threads regardless of the number of tablets.


> 4. What is the recommended setting for tablet server memory hard limit?
>
> It depends how much you want to devote to other applications :) On the
write-buffering side, I have seen diminishing returns after 10GB or so.
However, you can use a much bigger memory limit and devote an arbitrary
amount to the block cache, which will improve both write performance (since
you can get 100% cache hit rate on bloom filters) as well as read
performance (since you will hit cache more on reads). Of course, it's worth
noting that if you stick to a small memory limit for Kudu, the Linux block
cache is still used, and you can still get many reads serviced from memory.


On Tue, Aug 16, 2016 at 6:09 PM, Benjamin Kim <bbuild11@gmail.com> wrote:

> This could be a problem… If this is a bad byproduct brought over from
> HBase, then this is a common issue for all HBase users. It would be too bad
> if this also exists in Kudu. We HBase users have been trying to eradicate
> this for a long time.
>

The main difference between compaction in Kudu and in HBase is that all of
our compaction is "incremental". We also have the design that compaction is
always running at the same speed. As you've noticed, you should expect that
in a live Kudu cluster, there's always some background work consuming IO.
This has the downside that you're always seeing some performance impact due
to it. This has the huge (IMHO) upside that you are _always_ seeing the
performance impact -- in other words, you will never be "surprised" by a
compaction starting.

This is unlike the design in HBase (and many other LSM-tree designs) where
there is a distinct "trigger point" at which compaction starts. In those
systems, you can have something performing well during testing, and then
all of a sudden reach some threshold where the performance profile
drastically changes (possibly resulting in getting paged in the middle of
the night). Personally, as an operator, I would always pick a system which
is _consistently_ a bit slower than one which is sometimes faster and at
arbitrary times goes into a slow mode.

One thing we should probably consider is having some sort of very low
threshold below which we don't trigger a compaction. The example of a large
table with a few thousand inserts is a good one - we are probably better
off just waiting until the situation is a little bit worse before starting
compaction. We just don't want to wait until it's out of control.

Hope that reasoning makes sense to you, and that your experience testing
Kudu has borne it out.

-Todd

------------------------------
> jacky.he@gmail.com
>
>
> *From:* Todd Lipcon <todd@cloudera.com>
> *Date:* 2016-08-17 01:58
> *To:* user <user@kudu.apache.org>
> *Subject:* Re: abnormal high disk I/O rate when upsert into kudu table?
> Hi Jacky,
>
> Answers inline below
>
> On Tue, Aug 16, 2016 at 8:13 AM, jacky.he@gmail.com <jacky.he@gmail.com>
> wrote:
>
>> Dear Kudu Developers,
>>
>> I am a new tester for kudu, our kudu cluster has 3+12 nodes, 3 seperated
>> master node and 12 tablet node,
>> each node has 128GB memory, and 1 SSD for WAL, 6 1TB SAS for data
>>
>> we are using CDH 5.7.0 with impala-kudu 2.7.0 and kudu 0.9.1 parcels,
>> we set 16GB memory hard limit for each tablet node.
>>
>> Sounds like a good cluster setup. Thanks for providing the details.
>
>
>
>> one of our test table is about 80-100 columns and 1 key column, with java
>> client, we can insert/upsert into the kudu table about 100,000/s
>> the kudu table has 300m rows, and about 300,000 rows update per day, we
>> also use java client upsert API to update the rows
>>
>> we found the kudu cluster maybe encounter abnormal high disk I/O rate,
>> about 1.5-2.0Gb/s, even we just update 1,000~10,000 rows/s
>> i would like to know, with our row update frequency, is the cluster high
>> disk rate normal or not?
>>
>
> Are you upserts randomly spread across the range of rows in the table? If
> so, then when the updates flush, they'll trigger compactions of the updates
> and inserted rows into the existing data. This will cause, over time, a
> rewrite of the whole table, in order to incorporate the updates.
>
> This background I/O is run by the "maintenance manager". You can visit
> http://tablet-server:8050/maintenance-manager to see a dashboard of
> currently running maintenance operations such as compactions.
>
> The maintenance manager runs a preset number of threads, so the amount of
> background I/O you're experiencing won't increase if you increase the
> number of upserts.
>
> I'm curious, is the background I/O causing an issue, or just unexpected?
>
> Thanks
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message