spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Spark streaming RDDs to Parquet records
Date Tue, 17 Jun 2014 23:33:46 GMT
If you convert the data to a SchemaRDD you can save it as Parquet:
http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <
mahesh.padmanabhan@twc-contractor.com> wrote:

>  Thanks Krishna. Seems like you have to use Avro and then convert that to
> Parquet. I was hoping to directly convert RDDs to Parquet files. I’ll look
> into this some more.
>
>  Thanks,
> Mahesh
>
>   From: Krishna Sankar <ksankar42@gmail.com>
> Reply-To: "user@spark.apache.org" <user@spark.apache.org>
> Date: Tuesday, June 17, 2014 at 2:41 PM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: Spark streaming RDDs to Parquet records
>
>  Mahesh,
>
>    - One direction could be : create a parquet schema, convert & save the
>    records to hdfs.
>    - This might help
>    https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala
>
>  Cheers
> <k/>
>
>
> On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <
> mahesh.padmanabhan@twc-contractor.com> wrote:
>
>> Hello,
>>
>> Is there an easy way to convert RDDs within a DStream into Parquet
>> records?
>> Here is some incomplete pseudo code:
>>
>> // Create streaming context
>> val ssc = new StreamingContext(...)
>>
>> // Obtain a DStream of events
>> val ds = KafkaUtils.createStream(...)
>>
>> // Get Spark context to get to the SQL context
>> val sc = ds.context.sparkContext
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>
>> // For each RDD
>> ds.foreachRDD((rdd: RDD[Array[Byte]]) => {
>>
>>     // What do I do next?
>> })
>>
>> Thanks,
>> Mahesh
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
> ------------------------------
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
>

Mime
View raw message