spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maheshtwc <mahesh.padmanab...@twc-contractor.com>
Subject Re: Spark streaming RDDs to Parquet records
Date Fri, 20 Jun 2014 03:33:20 GMT
Unfortunately, I couldn’t figure it out without involving Avro.

Here is something that may be useful since it uses Avro generic records (so no case classes
needed) and transforms to Parquet.

http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/

HTH,
Mahesh

From: "Anita Tailor [via Apache Spark User List]" <ml-node+s1001560n7939h76@n3.nabble.com<mailto:ml-node+s1001560n7939h76@n3.nabble.com>>
Date: Thursday, June 19, 2014 at 12:53 PM
To: Mahesh Padmanabhan <mahesh.padmanabhan@twc-contractor.com<mailto:mahesh.padmanabhan@twc-contractor.com>>
Subject: Re: Spark streaming RDDs to Parquet records

I have similar case where I have RDD [List[Any], List[Long] ] and wants to save it as Parquet
file.
My understanding is that only RDD of case classes can be converted to SchemaRDD. So is there
any way I can save this RDD as Parquet file without using Avro?

Thanks in advance
Anita


On 18 June 2014 05:03, Michael Armbrust <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=0>>
wrote:
If you convert the data to a SchemaRDD you can save it as Parquet: http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet


On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor) <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=1>>
wrote:
Thanks Krishna. Seems like you have to use Avro and then convert that to Parquet. I was hoping
to directly convert RDDs to Parquet files. I’ll look into this some more.

Thanks,
Mahesh

From: Krishna Sankar <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=2>>
Reply-To: "[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=3>" <[hidden
email]</user/SendEmail.jtp?type=node&node=7939&i=4>>
Date: Tuesday, June 17, 2014 at 2:41 PM
To: "[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=5>" <[hidden
email]</user/SendEmail.jtp?type=node&node=7939&i=6>>
Subject: Re: Spark streaming RDDs to Parquet records

Mahesh,

 *   One direction could be : create a parquet schema, convert & save the records to hdfs.
 *   This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala

Cheers
<k/>


On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc <[hidden email]</user/SendEmail.jtp?type=node&node=7939&i=7>>
wrote:
Hello,

Is there an easy way to convert RDDs within a DStream into Parquet records?
Here is some incomplete pseudo code:

// Create streaming context
val ssc = new StreamingContext(...)

// Obtain a DStream of events
val ds = KafkaUtils.createStream(...)

// Get Spark context to get to the SQL context
val sc = ds.context.sparkContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// For each RDD
ds.foreachRDD((rdd: RDD[Array[Byte]]) => {

    // What do I do next?
})

Thanks,
Mahesh



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


________________________________
This E-mail and any of its attachments may contain Time Warner Cable proprietary information,
which is privileged, confidential, or subject to copyright belonging to Time Warner Cable.
This E-mail is intended solely for the use of the individual or entity to which it is addressed.
If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination,
distribution, copying, or action taken in relation to the contents of and attachments to this
E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error,
please notify the sender immediately and permanently delete the original and any copy of this
E-mail and any printout.




________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7939.html
To unsubscribe from Spark streaming RDDs to Parquet records, click here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7762&code=bWFoZXNoLnBhZG1hbmFiaGFuQHR3Yy1jb250cmFjdG9yLmNvbXw3NzYyfDE3Mjg5ODI4OTI=>.
NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-RDDs-to-Parquet-records-tp7762p7971.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message