spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McGloin <>
Subject Re: Spark SQL, Parquet and Impala
Date Fri, 01 Aug 2014 14:20:23 GMT
Sorry, sent early, wasn't finished typing.


Then we can select the data using Impala.  But this is registered as an
external table and must be refreshed if new data is inserted.

Obviously this doesn't seem good and doesn't seem like the correct solution.

How should we insert data from SparkSQL into a Parquet table which can be
directly queried by Impala?

Best regards,

On 1 August 2014 16:18, Patrick McGloin <> wrote:

> Hi,
> We would like to use Spark SQL to store data in Parquet format and then
> query that data using Impala.
> We've tried to come up with a solution and it is working but it doesn't
> seem good.  So I was wondering if you guys could tell us what is the
> correct way to do this.  We are using Spark 1.0 and Impala 1.3.1.
> First we are registering our tables using SparkSQL:
> val sqlContext = new SQLContext(sc)
> sqlContext.createParquetFile[ParqTable]("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt",
> true)
> Then we are using the HiveContext to register the table and do the insert:
> val hiveContext = new HiveContext(sc)
> import hiveContext._
> hiveContext.parquetFile("hdfs://localhost:8020/user/hive/warehouse/ParqTable.pqt").registerAsTable("ParqTable")
> eventsDStream.foreachRDD(event=>event.insertInto("ParqTable"))
> Now we have the data stored in a Parquet file.  To access it in Hive or
> Impala we run

View raw message