spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manivannan Selvadurai <smk.manivan...@gmail.com>
Subject Re: Best way to store Avro Objects as Parquet using SPARK
Date Mon, 21 Mar 2016 08:37:03 GMT
Hi,

Which version of spark are you using??

On Mon, Mar 21, 2016 at 12:28 PM, Sebastian Piu <sebastian.piu@gmail.com>
wrote:

> We use this, but not sure how the schema is stored
>
> Job job = Job.getInstance();
> ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class);
> AvroParquetOutputFormat.setSchema(job, schema);
> LazyOutputFormat.setOutputFormatClass(job, new
> ParquetOutputFormat<T>().getClass());
> job.getConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
> "false");
> job.getConfiguration().set("parquet.enable.summary-metadata", "false");
>
> //save the file
> rdd.mapToPair(me -> new Tuple2(null, me))
> .saveAsNewAPIHadoopFile(
> String.format("%s/%s", path, timeStamp.milliseconds()),
> Void.class,
> clazz,
> LazyOutputFormat.class,
> job.getConfiguration());
>
> On Mon, 21 Mar 2016, 05:55 Manivannan Selvadurai, <
> smk.manivannan@gmail.com> wrote:
>
>> Hi All,
>>
>>           In my current project there is a requirement to store avro data
>> (json format) as parquet files.
>> I was able to use AvroParquetWriter in separately to create the Parquet
>> Files. The parquet files along with the data also had the 'avro schema'
>> stored on them as a part of their footer.
>>
>>            But when tired using Spark streamng I could not find a way to
>> store the data with the avro schema information. The closest that I got was
>> to create a Dataframe using the json RDDs and store them as parquet. Here
>> the parquet files had a spark specific schema in their footer.
>>
>>       Is this the right approach or do I have a better one. Please guide
>> me.
>>
>>
>> We are using Spark 1.4.1.
>>
>> Thanks In Advance!!
>>
>

Mime
View raw message