spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ram Sriharsha <sriharsha....@gmail.com>
Subject Re: XML Parsing
Date Sun, 19 Jul 2015 23:59:27 GMT
You would need to write an Xml Input Format that can parse XML into lines
based on start/end tags
Mahout has a XMLInputFormat implementation you should be able to import:
https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java

Once you have such a format, you can use Spark's Hadoop API to read the XML
into Strings

sc.newAPIHadoopFile(path,classOf[XMLInputFormat],classOf[NullWritable],classOf[Text])

Ram


On Sun, Jul 19, 2015 at 10:38 AM, Ashish Soni <asoni.learn@gmail.com> wrote:

> Hi All ,
>
> I have an XML file with same tag repeated multiple times as below , Please
> suggest what would be best way to process this data inside spark as ...
>
> How can i extract each open and closing tag and process them or how can i
> combine multiple line into single line
>
> <review>
> </review>
> <review>
> </review>
> ...
> ..
> ..
>
> Thanks,
>

Mime
View raw message