spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Subject Re: Best way to read XML data from RDD
Date Mon, 22 Aug 2016 10:49:13 GMT

Hi Kwon, 

Was trying  out  spark  XML library .  I keep  on  getting  errors in inferring schema.
Looks like it cannot infer single line  XML data. 

Sent from Samsung Mobile.

-------- Original message --------
From: Hyukjin Kwon <gurwls223@gmail.com>
Date:21/08/2016 15:40 (GMT+05:30)
To: Jörn Franke <jornfranke@gmail.com>
Cc: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>, Felix Cheung <felixcheung_m@hotmail.com>,
user <user@spark.apache.org>
Subject: Re: Best way to read XML data from RDD

Hi Diwakar,

Spark XML library can take RDD as source.

```
val df = new XmlReader()
  .withRowTag("book")
  .xmlRdd(sqlContext, rdd)
```

If performance is critical, I would also recommend to take care of creation and destruction
of the parser.

If the parser is not serializble, then you can do the creation for each partition within mapPartition
just like

https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325

I hope this is helpful.



2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfranke@gmail.com>:
I fear the issue is that this will create and destroy a XML parser object 2 mio times, which
is very inefficient - it does not really look like a parser performance issue. Can't you do
something about the format choice? Ask your supplier to deliver another format (ideally avro
or sth like this?)?
Otherwise you could just create one XML Parser object / node, but sharing this among the parallel
tasks on the same node is tricky.
The other possibility could be simply more hardware ...

On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com> wrote:

Yes . It accepts a xml file as source but not RDD. The XML data embedded  inside json is streamed
from kafka cluster.  So I could get it as RDD. 
Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map function  but  performance
 wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes
100G 4 cpu each environment. 


Sent from Samsung Mobile.


-------- Original message --------
From: Felix Cheung <felixcheung_m@hotmail.com>
Date:20/08/2016 09:49 (GMT+05:30) 
To: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>, user <user@spark.apache.org>
Cc:
Subject: Re: Best way to read XML data from RDD

Have you tried

https://github.com/databricks/spark-xml
?




On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <diwakar.dhanuskodi@gmail.com>
wrote:

Hi, 

There is a RDD with json data. I could read json data using rdd.read.json . The json data
has XML data in couple of key-value paris. 

Which is the best method to read and parse XML from rdd. Is there any specific xml libraries
for spark. Could anyone help on this.

Thanks. 

Mime
View raw message