spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Soumitra Kumar <kumar.soumi...@gmail.com>
Subject Re: Kafka->HDFS to store as Parquet format
Date Tue, 07 Oct 2014 17:24:55 GMT
Currently I am not doing anything, if anything change start from scratch.

In general I doubt there are many options to account for schema changes. If you are reading
files using impala, then it may allow if the schema changes are append only. Otherwise existing
Parquet files have to be migrated to new schema.

----- Original Message -----
From: "Buntu Dev" <buntudev@gmail.com>
To: "Soumitra Kumar" <kumar.soumitra@gmail.com>
Cc: user@spark.incubator.apache.org
Sent: Tuesday, October 7, 2014 10:18:16 AM
Subject: Re: Kafka->HDFS to store as Parquet format


Thanks for the info Soumitra.. its a good start for me. 


Just wanted to know how you are managing schema changes/evolution as parquetSchema is provided
to setSchema in the above sample code. 


On Tue, Oct 7, 2014 at 10:09 AM, Soumitra Kumar < kumar.soumitra@gmail.com > wrote:



I have used it to write Parquet files as: 

val job = new Job 
val conf = job.getConfiguration 
conf.set (ParquetOutputFormat.COMPRESSION, CompressionCodecName.SNAPPY.name ()) 
ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType (parquetSchema)) 
rdd saveAsNewAPIHadoopFile (rddToFileName (outputDir, em, time), classOf[Void], classOf[Group],
classOf[ExampleOutputFormat], conf) 



----- Original Message ----- 
From: "bdev" < buntudev@gmail.com > 
To: user@spark.incubator.apache.org 
Sent: Tuesday, October 7, 2014 9:51:40 AM 
Subject: Re: Kafka->HDFS to store as Parquet format 

After a bit of looking around, I found saveAsNewAPIHadoopFile could be used 
to specify the ParquetOutputFormat. Has anyone used it to convert JSON to 
Parquet format or any pointers are welcome, thanks! 



-- 
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HDFS-to-store-as-Parquet-format-tp15768p15852.html

Sent from the Apache Spark User List mailing list archive at Nabble.com. 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message