spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Subject Re: Best way to read XML data from RDD
Date Mon, 22 Aug 2016 15:29:04 GMT
Below is source code for parsing xml RDD which has single line xml data.

import scala.xml.XML
import scala.xml.Elem
import scala.collection.mutable.ArrayBuffer
import scala.xml.Text
import scala.xml.Node


var dataArray                        = new ArrayBuffer[String]()
def processNode(node: Node, fp1: String):Unit = node match
{
   case Elem(prefix,label,attribs,scope,Text(text)) =>
          dataArray.+=:("Cust.001.001.03-"+fp1+","+text)
   case _ => for (n <- node.child)
                  {
                   val fp=fp1+"/"+n.label
                   processNode(n, fp)
                  }
}


val dataDF         = xmlData
                             .map { x =>
                                val p =
XML.loadString(x.get(0).toString.mkString)
                                val xsd = utils.getXSD(p)
                                println("xsd -- ",xsd)
                                val f  = "/" + p.label
                                val msgId = (p \\ "Fnd" \ "Mesg" \ "Paid" \
"Record" \ "CustInit" \ "GroupFirst" \ "MesgId").text
                                processNode(p,f,xsd)
                                (mesgId
,utils.dataArray,x.get(1).toString())
                                 }
                               .flatMap{x =>
                                 val msgId = x._1
                                 val y = x._2.toIterable.map { x1 =>

 (mesgId,x1.split(",").apply(0),x1.split(",").apply(1),x._3)
                                 }
                                 y
                                 }.toDF("key","attribute","value","type")

On Mon, Aug 22, 2016 at 4:34 PM, Hyukjin Kwon <gurwls223@gmail.com> wrote:

> Do you mind share your codes and sample data? It should be okay with
> single XML if I remember this correctly.
>
> 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi <
> diwakar.dhanuskodi@gmail.com>:
>
>> Hi Darin,
>>
>> Ate  you  using  this  utility  to  parse single line XML?
>>
>>
>> Sent from Samsung Mobile.
>>
>>
>> -------- Original message --------
>> From: Darin McBeath <ddmcbeath@yahoo.com>
>> Date:21/08/2016 17:44 (GMT+05:30)
>> To: Hyukjin Kwon <gurwls223@gmail.com>, Jörn Franke <jornfranke@gmail.com>
>>
>> Cc: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>, Felix Cheung <
>> felixcheung_m@hotmail.com>, user <user@spark.apache.org>
>> Subject: Re: Best way to read XML data from RDD
>>
>> Another option would be to look at spark-xml-utils.  We use this
>> extensively in the manipulation of our XML content.
>>
>> https://github.com/elsevierlabs-os/spark-xml-utils
>>
>>
>>
>> There are quite a few examples.  Depending on your preference (and what
>> you want to do), you could use xpath, xquery, or xslt to transform,
>> extract, or filter.
>>
>> Like mentioned below, you want to initialize the parser in a
>> mapPartitions call (one of the examples shows this).
>>
>> Hope this is helpful.
>>
>> Darin.
>>
>>
>>
>>
>>
>> ________________________________
>> From: Hyukjin Kwon <gurwls223@gmail.com>
>> To: Jörn Franke <jornfranke@gmail.com>
>> Cc: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>; Felix Cheung <
>> felixcheung_m@hotmail.com>; user <user@spark.apache.org>
>> Sent: Sunday, August 21, 2016 6:10 AM
>> Subject: Re: Best way to read XML data from RDD
>>
>>
>>
>> Hi Diwakar,
>>
>> Spark XML library can take RDD as source.
>>
>> ```
>> val df = new XmlReader()
>>   .withRowTag("book")
>>   .xmlRdd(sqlContext, rdd)
>> ```
>>
>> If performance is critical, I would also recommend to take care of
>> creation and destruction of the parser.
>>
>> If the parser is not serializble, then you can do the creation for each
>> partition within mapPartition just like
>>
>> https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a
>> 48fed9bb188140423/sql/core/src/main/scala/org/apache/
>> spark/sql/DataFrameReader.scala#L322-L325
>>
>>
>> I hope this is helpful.
>>
>>
>>
>>
>> 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfranke@gmail.com>:
>>
>> I fear the issue is that this will create and destroy a XML parser object
>> 2 mio times, which is very inefficient - it does not really look like a
>> parser performance issue. Can't you do something about the format choice?
>> Ask your supplier to deliver another format (ideally avro or sth like
>> this?)?
>> >Otherwise you could just create one XML Parser object / node, but
>> sharing this among the parallel tasks on the same node is tricky.
>> >The other possibility could be simply more hardware ...
>> >
>> >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <
>> diwakar.dhanuskodi@gmail.com> wrote:
>> >
>> >
>> >Yes . It accepts a xml file as source but not RDD. The XML data
>> embedded  inside json is streamed from kafka cluster.  So I could get it as
>> RDD.
>> >>Right  now  I am using  spark.xml  XML.loadstring method inside  RDD
>> map function  but  performance  wise I am not happy as it takes 4 minutes
>> to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each
>> environment.
>> >>
>> >>
>> >>
>> >>
>> >>Sent from Samsung Mobile.
>> >>
>> >>
>> >>-------- Original message --------
>> >>From: Felix Cheung <felixcheung_m@hotmail.com>
>> >>Date:20/08/2016  09:49  (GMT+05:30)
>> >>To: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com> , user <
>> user@spark.apache.org>
>> >>Cc:
>> >>Subject: Re: Best way to read XML data from RDD
>> >>
>> >>
>> >>Have you tried
>> >>
>> >>https://github.com/databricks/ spark-xml
>> >>?
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <
>> diwakar.dhanuskodi@gmail.com> wrote:
>> >>
>> >>
>> >>Hi,
>> >>
>> >>
>> >>There is a RDD with json data. I could read json data using
>> rdd.read.json . The json data has XML data in couple of key-value paris.
>> >>
>> >>
>> >>Which is the best method to read and parse XML from rdd. Is there any
>> specific xml libraries for spark. Could anyone help on this.
>> >>
>> >>
>> >>Thanks.
>>
>
>

Mime
View raw message