spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Piu <sebastian....@gmail.com>
Subject Re: Best way to store Avro Objects as Parquet using SPARK
Date Mon, 21 Mar 2016 06:58:31 GMT
We use this, but not sure how the schema is stored

Job job = Job.getInstance();
ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
AvroParquetOutputFormat.setSchema(job, schema);
LazyOutputFormat.setOutputFormatClass(job, new
ParquetOutputFormat<T>().getClass());
job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
"false");
job.getConfiguration().set("parquet.enable.summary-metadata", "false");

//save the file
rdd.mapToPair(me -> new Tuple2(null, me))
.saveAsNewAPIHadoopFile(
String.format("%s/%s", path, timeStamp.milliseconds()),
Void.class,
clazz,
LazyOutputFormat.class,
job.getConfiguration());

On Mon, 21 Mar 2016, 05:55 Manivannan Selvadurai, <smk.manivannan@gmail.com>
wrote:

> Hi All,
>
>           In my current project there is a requirement to store avro data
> (json format) as parquet files.
> I was able to use AvroParquetWriter in separately to create the Parquet
> Files. The parquet files along with the data also had the 'avro schema'
> stored on them as a part of their footer.
>
>            But when tired using Spark streamng I could not find a way to
> store the data with the avro schema information. The closest that I got was
> to create a Dataframe using the json RDDs and store them as parquet. Here
> the parquet files had a spark specific schema in their footer.
>
>       Is this the right approach or do I have a better one. Please guide
> me.
>
>
> We are using Spark 1.4.1.
>
> Thanks In Advance!!
>

Mime
View raw message