spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darin McBeath <>
Subject Re: Reading xml in java using spark
Date Tue, 01 Sep 2015 18:58:12 GMT
Another option might be to leverage spark-xml-utils (

This is a collection of xml utilities that I've recently revamped that make it relatively
easy to use xpath, xslt, or xquery within the context of a Spark application (or at least
I think so).  My previous attempt was not overly friendly, but as I've learned more about
Spark (and needed easier to use xml utilities) I've hopefully made this easier to use and
understand.  I hope others find it useful.

Back to your problem.  Assuming you have a bunch of xml records in an RDD, you should be able
to do something like the following to count the number of elements for that type.  In the
example below, I'm counting the number of references in documents.  The xmlKeyPair is an RDD
of type (String,String) where the first item is the 'key' and the second item is the xml record.
 The xpath expression identifies the 'reference' element I want to count.

import com.elsevier.spark_xml_utils.xpath.XPathProcessor
import scala.collection.JavaConverters._
import java.util.HashMap

xmlKeyPair.mapPartitions(recsIter => {
                 val xpath = "count(/xocs:doc/xocs:meta/xocs:references/xocs:ref-info)"
                 val namespaces = new HashMap[String,String](Map(
                                            "xocs" -> ""
                 val proc = XPathProcessor.getInstance(xpath,namespaces)
        => proc.evaluateString(rec._2).toInt)

There is more documentation on the spark-xml-utils github site.  Let me know if the documentation
is not clear or if you have any questions. 


From: Rick Hillegas <>
To: Sonal Goyal <> 
Cc: rakesh sharma <>; 
Sent: Monday, August 31, 2015 10:51 AM
Subject: Re: Reading xml in java using spark

Hi Rakesh,

You might also take a look at the Derby code.
   org.apache.derby.vti.XmlVTI provides a number of static methods for
   turning an XML resource into a JDBC ResultSet.


On 8/31/15 4:44 AM, Sonal Goyal wrote: 

I think the mahout project had an xmlinoutformat which you can leverage.
>On Aug 31, 2015 5:10 PM, "rakesh sharma" <> wrote:
>I want to parse an xml file in spark 
>>But as far as example is concerned it reads it as text file. The maping to xml will
be a tedious job.
>>How can I find the number of elements of a particular type using that. Any help in
java/scala code is also welcome

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message