kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Heo <jason.heo....@gmail.com>
Subject Apache Kudu Table is 6.6 times larger than Parquet File.
Date Sat, 11 Mar 2017 03:16:48 GMT
Hello, I'm new to Apache Kudu. I was really impressed by the concept of
Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
team project.

One of the issues I have is that Kudu Table is too big compared to Parquet
File

- Parquet File: 1.3TB
- Kudu Table: 8.6TB

(both tables configured 3 replica factor)

I'm using Kudu with CDH 5.10 and most of the configurations is not changed
(I've only changed `memory_limit_hard_bytes` and `block_cache_capacity_mb`
to prevent bulk load error)

When I changed `ENCODING` for some fields, only decreased by 5%. I'm
thinking there are some optimization techniques to reduce Kudu table size.

I would really appreciate it if someone gives advice to me.

Thanks for advance answer.

`parquet_table` has 38 STRING fields and 6B rows.

The schema of `parquet_table` looks like belows

    ```
    > SHOW CREATE TABLE parquet_table;

+---------------------------------------------------------------------------------+
    | result
           |

+---------------------------------------------------------------------------------+
    | CREATE EXTERNAL TABLE default.parquet_table (
          |
    |   a STRING,
          |
    |   b STRING,
          |
    |   c STRING,
          |
    |   d STRING,
          |
        ...
        ...
    | )
          |
    | PARTITIONED BY (
           |
    |   ymd STRING
           |
    | )
          |
    | WITH SERDEPROPERTIES ('serialization.format'='1')
          |
    | STORED AS PARQUET
          |
    | LOCATION 'hdfs://hostname/path/to/parquet' |
    |
          |

+---------------------------------------------------------------------------------+
    ```

I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT *
FROM parquet_table`

    ```
    > SHOW CREATE TABLE kudu_table;

+----------------------------------------------------------------------------------+
    | result
            |

+----------------------------------------------------------------------------------+
    | CREATE TABLE default.kudu_table (
           |
    |   a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,      |
    |   b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,      |
    |   c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,          |
    |   d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION,          |
        ...
    |   PRIMARY KEY (a, b)
            |
    | )
           |
    | PARTITION BY HASH (a) PARTITIONS 40
           |
    | STORED AS KUDU
            |
    | TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
                    'kudu.table_name'='impala::kudu_table') |

+----------------------------------------------------------------------------------+
    ```

Mime
View raw message