spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From janardhan shetty <janardhan...@gmail.com>
Subject Re: Bzip2 to Parquet format
Date Mon, 25 Jul 2016 21:33:22 GMT
Andrew,

2.0

I tried
val inputR = sc.textFile(file)
val inputS = inputR.map(x => x.split("`"))
val inputDF = inputS.toDF()

inputDF.write.format("parquet").save(result.parquet)

Result part files end with *.snappy.parquet *is that expected ?

On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich <andrew@aehrlich.com> wrote:

> You can load the text with sc.textFile() to an RDD[String], then use
> .map() to convert it into an RDD[Row]. At this point you are ready to
> apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
>
> Here is an example on how to define the StructType (schema) that you will
> combine with the RDD[Row] to create a DataFrame.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
>
> Once you have the DataFrame, save it to parquet with
> dataframe.save(“/path”) to create a parquet file.
>
> Reference for SQLContext / createDataFrame:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>
>
>
> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhanp22@gmail.com>
> wrote:
>
> We have data in Bz2 compression format. Any links in Spark to convert into
> Parquet and also performance benchmarks and uses study materials ?
>
>
>

Mime
View raw message