spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Subject Re: Best way to read XML data from RDD
Date Sat, 20 Aug 2016 04:41:19 GMT
Yes . It accepts a xml file as source but not RDD. The XML data embedded  inside json is streamed
from kafka cluster.  So I could get it as RDD. 
Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map function  but
 performance  wise I am not happy as it takes 4 minutes to parse XML from 2 million messages
in a 3 nodes 100G 4 cpu each environment. 


Sent from Samsung Mobile.

<div>-------- Original message --------</div><div>From: Felix Cheung <felixcheung_m@hotmail.com>
</div><div>Date:20/08/2016  09:49  (GMT+05:30) </div><div>To: Diwakar
Dhanuskodi <diwakar.dhanuskodi@gmail.com>, user <user@spark.apache.org> </div><div>Cc:
 </div><div>Subject: Re: Best way to read XML data from RDD </div><div>
</div>Have you tried

https://github.com/databricks/spark-xml
?




On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <diwakar.dhanuskodi@gmail.com>
wrote:

Hi, 

There is a RDD with json data. I could read json data using rdd.read.json . The json data
has XML data in couple of key-value paris. 

Which is the best method to read and parse XML from rdd. Is there any specific xml libraries
for spark. Could anyone help on this.

Thanks. 
Mime
View raw message