spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Flattening XML in a DataFrame
Date Sat, 13 Aug 2016 01:58:17 GMT
Hi Sreekanth,

Assuming you are using Spark 1.x,

I believe this code below:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag",
"emp").load("/tmp/sample.xml")
  .selectExpr("manager.id", "manager.name",
"explode(manager.subordinates.clerk) as clerk")
  .selectExpr("id", "name", "clerk.cid", "clerk.cname")
  .show()

would print the results below as you want:

+---+----+---+-----+
| id|name|cid|cname|
+---+----+---+-----+
|  1| foo|  1|  foo|
|  1| foo|  1|  foo|
+---+----+---+-----+

​

I hope this is helpful.

Thanks!




2016-08-13 9:33 GMT+09:00 Sreekanth Jella <srikanth.jella@gmail.com>:

> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> <emplist>
>
> <emp>
>
>    <manager>
>
>    <id>1</id>
>
>    <name>foo</name>
>
>     <subordinates>
>
>       <clerk>
>
>         <cid>1</cid>
>
>         <cname>foo</cname>
>
>       </clerk>
>
>       <clerk>
>
>         <cid>1</cid>
>
>         <cname>foo</cname>
>
>       </clerk>
>
>     </subordinates>
>
>    </manager>
>
> </emp>
>
> </emplist>
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>

Mime
View raw message