spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ehrlich <and...@aehrlich.com>
Subject Re: Bzip2 to Parquet format
Date Mon, 25 Jul 2016 03:00:21 GMT
You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into
an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow,
structType)

Here is an example on how to define the StructType (schema) that you will combine with the
RDD[Row] to create a DataFrame.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType>

Once you have the DataFrame, save it to parquet with dataframe.save(“/path”) to create
a parquet file.

Reference for SQLContext / createDataFrame: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext>



> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhanp22@gmail.com> wrote:
> 
> We have data in Bz2 compression format. Any links in Spark to convert into Parquet and
also performance benchmarks and uses study materials ?


Mime
View raw message