thank you both. Does it make a difference from performance perspective though if I do a bulk load through Impala versus Spark? is the Kudu client with Spark will be faster than Impala?

On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <> wrote:
On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <> wrote:
Hi Boris.

1) I would like to bypass Impala as data for my bulk load coming from sqoop and avro files are stored on HDFS.
What's the objection to Impala? In the example below, Impala reads from an HDFS-resident table, and writes to the Kudu table.
2) we do not want to deal with MapReduce.

You can still use Spark... the MR reference is in regards to the Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use these. See, for example:

While that's possible I'd recommend using the dataframes API instead. eg see

That should work as well (or better) than the MR outputformat.


However, you'll have to write (simple) Spark code, whereas with method #1 you do effectively the same thing under the covers using SQL statements via Impala.


What’s the most efficient way to bulk load data into Kudu?

The easiest way to load data into Kudu is if the data is already managed by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table SELECT * FROM some_csv_tabledoes the trick.

You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, HBase, or any other data store that has an InputFormat.

No tool is provided to load data directly into Kudu’s on-disk data format. We have found that for many workloads, the insert performance of Kudu is comparable to bulk load performance of other systems.

Todd Lipcon
Software Engineer, Cloudera