spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: Bzip2 to Parquet format
Date Mon, 25 Jul 2016 21:43:31 GMT
Hi,

This is the expected behaivour.
A default compression for parquet is `snappy`.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L215

// maropu

On Tue, Jul 26, 2016 at 6:33 AM, janardhan shetty <janardhanp22@gmail.com>
wrote:

> Andrew,
>
> 2.0
>
> I tried
> val inputR = sc.textFile(file)
> val inputS = inputR.map(x => x.split("`"))
> val inputDF = inputS.toDF()
>
> inputDF.write.format("parquet").save(result.parquet)
>
> Result part files end with *.snappy.parquet *is that expected ?
>
> On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich <andrew@aehrlich.com>
> wrote:
>
>> You can load the text with sc.textFile() to an RDD[String], then use
>> .map() to convert it into an RDD[Row]. At this point you are ready to
>> apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
>>
>> Here is an example on how to define the StructType (schema) that you
>> will combine with the RDD[Row] to create a DataFrame.
>>
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
>>
>> Once you have the DataFrame, save it to parquet with
>> dataframe.save(“/path”) to create a parquet file.
>>
>> Reference for SQLContext / createDataFrame:
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>>
>>
>>
>> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhanp22@gmail.com>
>> wrote:
>>
>> We have data in Bz2 compression format. Any links in Spark to convert
>> into Parquet and also performance benchmarks and uses study materials ?
>>
>>
>>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message