spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <>
Subject Re: Best way to read XML data from RDD
Date Sat, 20 Aug 2016 04:41:19 GMT
Yes . It accepts a xml file as source but not RDD. The XML data embedded  inside json is streamed
from kafka cluster.  So I could get it as RDD. 
Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map function  but
 performance  wise I am not happy as it takes 4 minutes to parse XML from 2 million messages
in a 3 nodes 100G 4 cpu each environment. 

Sent from Samsung Mobile.

<div>-------- Original message --------</div><div>From: Felix Cheung <>
</div><div>Date:20/08/2016  09:49  (GMT+05:30) </div><div>To: Diwakar
Dhanuskodi <>, user <> </div><div>Cc:
 </div><div>Subject: Re: Best way to read XML data from RDD </div><div>
</div>Have you tried

On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <>


There is a RDD with json data. I could read json data using . The json data
has XML data in couple of key-value paris. 

Which is the best method to read and parse XML from rdd. Is there any specific xml libraries
for spark. Could anyone help on this.

View raw message