spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sreekanth Jella" <srikanth.je...@gmail.com>
Subject Re: Flattening XML in a DataFrame
Date Sun, 14 Aug 2016 15:45:36 GMT
Hi Hyukjin Kwon,

Thank you for reply.

There are several types of XML documents with different schema which needs to be parsed and
tag names do not know in hand. All we know is the XSD for the given XML. 

Is it possible to get the same results even when we do not know the xml tags like manager.id,
manager.name or is it possible to read the tag names from XSD and use?

Thanks, 
Sreekanth

 

On Aug 12, 2016 9:58 PM, "Hyukjin Kwon" <gurwls223@gmail.com <mailto:gurwls223@gmail.com>
> wrote:

Hi Sreekanth,

 

Assuming you are using Spark 1.x,

 

I believe this code below:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml")
  .selectExpr("manager.id <http://manager.id> ", "manager.name <http://manager.name>
", "explode(manager.subordinates.clerk) as clerk")
  .selectExpr("id", "name", "clerk.cid", "clerk.cname")
  .show()

would print the results below as you want:

+---+----+---+-----+
| id|name|cid|cname|
+---+----+---+-----+
|  1| foo|  1|  foo|
|  1| foo|  1|  foo|
+---+----+---+-----+

​

 

I hope this is helpful.

 

Thanks!

 

 

 

 

2016-08-13 9:33 GMT+09:00 Sreekanth Jella <srikanth.jella@gmail.com <mailto:srikanth.jella@gmail.com>
>:

Hi Folks,

 

I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml package which
is automatically inferring my schema and creating a DataFrame. 

 

I do not want to hard code any column names in DataFrame as I have lot of varieties of XML
documents and each might be lot more depth of child nodes. I simply want to flatten any type
of XML and then write output data to a hive table. Can you please give some expert advice
for the same.

 

Example XML and expected output is given below.

 

Sample XML:

<emplist>

<emp>

   <manager>

   <id>1</id>

   <name>foo</name>

    <subordinates>

      <clerk>

        <cid>1</cid>

        <cname>foo</cname>

      </clerk>

      <clerk>

        <cid>1</cid>

        <cname>foo</cname>

      </clerk>

    </subordinates>

   </manager>

</emp>

</emplist>

 

Expected output:

id, name, clerk.cid, clerk.cname

1, foo, 2, cname2

1, foo, 3, cname3

 

Thanks,

Sreekanth Jella

 

 


Mime
View raw message