kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guoqiao He <jacky...@gmail.com>
Subject Re: abnormal high disk I/O rate when upsert into kudu table?
Date Sun, 21 Aug 2016 22:40:13 GMT
thanks for explanation, Todd.now waiting for 0.10.0 parcels to test.
Get Outlook for iOS




On Fri, Aug 19, 2016 at 2:11 AM +0800, "Todd Lipcon" <todd@cloudera.com> wrote:










Hey Ben and Jacky,
Apologies for my late response. As you might have seen on the Kudu blog, a lot of the contributors
have been busy wrapping up the 0.10.0 release this week. Answers inline


On Aug 16, 2016, at 6:05 PM, jacky.he@gmail.com wrote:
Thanks Todd.
Kudu cluster running on centos 7.2, each tablet node has 40 cores, the test table is about
140GB after 3 reps,  and partitioned by hash bucket, I had tried 24 and 120 hash buckets.
I do one test: 1. Stop all ingestion to the cluster2. Just randomly upsert 3000 rows once, upsert
contains new data row or just updates to exisit row (updates the whole row, not just updates
one or more column)3. From the CDH monitor dashboard, I see the cluster's disk I/O raising
from ~300Mb/s to ~1.5Gb/s, and get back the ~300Mb/s 30min later or more
I check some of tablet node INFO log, they are always doing compaction, compacting 1~ 100s
of thousands rows.
My question:1. Are the maintenance manager is rewriting the whole table?  3000 rows upsert
once will trigger a rewriting the whole table?
If those 3000 rows are spread across the whole key space, then yes, it currently will. If,
on the other hand, you had a table something like:
CREATE TABLE t (  ts INTEGER PRIMARY KEY,  other_data string,   ...) DISTRIBUTE BY HASH(ts)
INTO 120 BUCKETS
and your inserts were for a small range of time (eg concentrated around "now") then the compaction
would only rewrite the portion of key space that has new rows (or updates) affecting it. 2.
Does the background I/O have impacts to the scan performance.Of course it has some. However,
it is restricted by default to a single thread, so it should use only a small percentage of
the machine's total capacity. I had worked on a patch a few years ago to use ioprio_set to
mark these I/Os as "low priority", but didn't have enough time to validate it helping with
any workload, and thus, it didn't get committed. In case anyone's interested in trying it,
I posted the (very old) diff to https://gist.github.com/toddlipcon/faaf8e74b4dae93e668e8bda1118b58a
. It will probably need some conflict resolution to apply it. 3. About the number of hash
partitioned buckets,  I partitioned the table to 24 or 120 buckets, what's the difference
in upsert and scan performance? and what is the best practices?For write performance, I'd
recommend several (5-10) of tablets per tablet server being actively written to. Individual
tablets can handle multi-threaded writes pretty well, though there are some bottlenecks at
very high throughputs, so just having one per tablet server would not get peak performance.
On the read side, it's worth noting that _currently_ both the Spark and Impala integrations
start only one scanner thread per tablet. So, the number of tablets limits scan parallelism.
Given that, I expect you would see better performance with 120 buckets. This is, however,
a temporary limitation: we intend to add some API at some point to make it easier for query
engines to divide the scanner into smaller chunks which can be spread across more threads
regardless of the number of tablets. 4. What is the recommended setting for tablet server
memory hard limit?
It depends how much you want to devote to other applications :) On the write-buffering side,
I have seen diminishing returns after 10GB or so. However, you can use a much bigger memory
limit and devote an arbitrary amount to the block cache, which will improve both write performance
(since you can get 100% cache hit rate on bloom filters) as well as read performance (since
you will hit cache more on reads). Of course, it's worth noting that if you stick to a small
memory limit for Kudu, the Linux block cache is still used, and you can still get many reads
serviced from memory.

On Tue, Aug 16, 2016 at 6:09 PM, Benjamin Kim <bbuild11@gmail.com> wrote:
This could be a problem… If this is a bad byproduct brought over from HBase, then this is
a common issue for all HBase users. It would be too bad if this also exists in Kudu. We HBase
users have been trying to eradicate this for a long time.
The main difference between compaction in Kudu and in HBase is that all of our compaction
is "incremental". We also have the design that compaction is always running at the same speed.
As you've noticed, you should expect that in a live Kudu cluster, there's always some background
work consuming IO. This has the downside that you're always seeing some performance impact
due to it. This has the huge (IMHO) upside that you are _always_ seeing the performance impact
-- in other words, you will never be "surprised" by a compaction starting.
This is unlike the design in HBase (and many other LSM-tree designs) where there is a distinct
"trigger point" at which compaction starts. In those systems, you can have something performing
well during testing, and then all of a sudden reach some threshold where the performance profile
drastically changes (possibly resulting in getting paged in the middle of the night). Personally,
as an operator, I would always pick a system which is _consistently_ a bit slower than one
which is sometimes faster and at arbitrary times goes into a slow mode.
One thing we should probably consider is having some sort of very low threshold below which
we don't trigger a compaction. The example of a large table with a few thousand inserts is
a good one - we are probably better off just waiting until the situation is a little bit worse
before starting compaction. We just don't want to wait until it's out of control.
Hope that reasoning makes sense to you, and that your experience testing Kudu has borne it
out.
-Todd
jacky.he@gmail.com From: Todd LipconDate: 2016-08-17 01:58To: userSubject: Re: abnormal
high disk I/O rate when upsert into kudu table?Hi Jacky,
Answers inline below
On Tue, Aug 16, 2016 at 8:13 AM, jacky.he@gmail.com <jacky.he@gmail.com> wrote:
Dear Kudu Developers, 
I am a new tester for kudu, our kudu cluster has 3+12 nodes, 3 seperated master node and
12 tablet node, each node has 128GB memory, and 1 SSD for WAL, 6 1TB SAS for data
we are using CDH 5.7.0 with impala-kudu 2.7.0 and kudu 0.9.1 parcels, we set 16GB memory hard
limit for each tablet node.
Sounds like a good cluster setup. Thanks for providing the details. 
 one of our test table is about 80-100 columns and 1 key column, with java client, we can
insert/upsert into the kudu table about 100,000/s
the kudu table has 300m rows, and about 300,000 rows update per day, we also use java client
upsert API to update the rows

we found the kudu cluster maybe encounter abnormal high disk I/O rate, about 1.5-2.0Gb/s,
even we just update 1,000~10,000 rows/s
i would like to know, with our row update frequency, is the cluster high disk rate normal
or not?
Are you upserts randomly spread across the range of rows in the table? If so, then when the
updates flush, they'll trigger compactions of the updates and inserted rows into the existing
data. This will cause, over time, a rewrite of the whole table, in order to incorporate the
updates.
This background I/O is run by the "maintenance manager". You can visit http://tablet-server:8050/maintenance-manager to
see a dashboard of currently running maintenance operations such as compactions.
The maintenance manager runs a preset number of threads, so the amount of background I/O you're
experiencing won't increase if you increase the number of upserts.
I'm curious, is the background I/O causing an issue, or just unexpected?
Thanks-Todd-- 
Todd Lipcon
Software Engineer, Cloudera



-- 
Todd Lipcon
Software Engineer, Cloudera







Mime
View raw message