spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Best way to read XML data from RDD
Date Mon, 22 Aug 2016 11:04:13 GMT
Do you mind share your codes and sample data? It should be okay with single
XML if I remember this correctly.

2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>
:

> Hi Darin,
>
> Ate  you  using  this  utility  to  parse single line XML?
>
>
> Sent from Samsung Mobile.
>
>
> -------- Original message --------
> From: Darin McBeath <ddmcbeath@yahoo.com>
> Date:21/08/2016 17:44 (GMT+05:30)
> To: Hyukjin Kwon <gurwls223@gmail.com>, Jörn Franke <jornfranke@gmail.com>
>
> Cc: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>, Felix Cheung <
> felixcheung_m@hotmail.com>, user <user@spark.apache.org>
> Subject: Re: Best way to read XML data from RDD
>
> Another option would be to look at spark-xml-utils.  We use this
> extensively in the manipulation of our XML content.
>
> https://github.com/elsevierlabs-os/spark-xml-utils
>
>
>
> There are quite a few examples.  Depending on your preference (and what
> you want to do), you could use xpath, xquery, or xslt to transform,
> extract, or filter.
>
> Like mentioned below, you want to initialize the parser in a mapPartitions
> call (one of the examples shows this).
>
> Hope this is helpful.
>
> Darin.
>
>
>
>
>
> ________________________________
> From: Hyukjin Kwon <gurwls223@gmail.com>
> To: Jörn Franke <jornfranke@gmail.com>
> Cc: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>; Felix Cheung <
> felixcheung_m@hotmail.com>; user <user@spark.apache.org>
> Sent: Sunday, August 21, 2016 6:10 AM
> Subject: Re: Best way to read XML data from RDD
>
>
>
> Hi Diwakar,
>
> Spark XML library can take RDD as source.
>
> ```
> val df = new XmlReader()
>   .withRowTag("book")
>   .xmlRdd(sqlContext, rdd)
> ```
>
> If performance is critical, I would also recommend to take care of
> creation and destruction of the parser.
>
> If the parser is not serializble, then you can do the creation for each
> partition within mapPartition just like
>
> https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9b
> b188140423/sql/core/src/main/scala/org/apache/spark/sql/
> DataFrameReader.scala#L322-L325
>
>
> I hope this is helpful.
>
>
>
>
> 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfranke@gmail.com>:
>
> I fear the issue is that this will create and destroy a XML parser object
> 2 mio times, which is very inefficient - it does not really look like a
> parser performance issue. Can't you do something about the format choice?
> Ask your supplier to deliver another format (ideally avro or sth like
> this?)?
> >Otherwise you could just create one XML Parser object / node, but sharing
> this among the parallel tasks on the same node is tricky.
> >The other possibility could be simply more hardware ...
> >
> >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <
> diwakar.dhanuskodi@gmail.com> wrote:
> >
> >
> >Yes . It accepts a xml file as source but not RDD. The XML data embedded
> inside json is streamed from kafka cluster.  So I could get it as RDD.
> >>Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map
> function  but  performance  wise I am not happy as it takes 4 minutes to
> parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
> >>
> >>
> >>
> >>
> >>Sent from Samsung Mobile.
> >>
> >>
> >>-------- Original message --------
> >>From: Felix Cheung <felixcheung_m@hotmail.com>
> >>Date:20/08/2016  09:49  (GMT+05:30)
> >>To: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com> , user <
> user@spark.apache.org>
> >>Cc:
> >>Subject: Re: Best way to read XML data from RDD
> >>
> >>
> >>Have you tried
> >>
> >>https://github.com/databricks/ spark-xml
> >>?
> >>
> >>
> >>
> >>
> >>
> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <
> diwakar.dhanuskodi@gmail.com> wrote:
> >>
> >>
> >>Hi,
> >>
> >>
> >>There is a RDD with json data. I could read json data using
> rdd.read.json . The json data has XML data in couple of key-value paris.
> >>
> >>
> >>Which is the best method to read and parse XML from rdd. Is there any
> specific xml libraries for spark. Could anyone help on this.
> >>
> >>
> >>Thanks.
>

Mime
View raw message