spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: Flattening XML in a DataFrame
Date Wed, 17 Aug 2016 05:25:35 GMT
Sorry for late reply.

Currently, the library only supports to load XML documents just as they are.

Do you mind if I ask open an issue with some more explanations here,
https://github.com/databricks/spark-xml/issues?




2016-08-17 7:22 GMT+09:00 Sreekanth Jella <srikanth.jella@gmail.com>:

> Hi Experts,
>
>
>
> Please suggest. Thanks in advance.
>
>
>
> Thanks,
>
> Sreekanth
>
>
>
> *From:* Sreekanth Jella [mailto:srikanth.jella@gmail.com]
> *Sent:* Sunday, August 14, 2016 11:46 AM
> *To:* 'Hyukjin Kwon' <gurwls223@gmail.com>
> *Cc:* 'user @spark' <user@spark.apache.org>
> *Subject:* Re: Flattening XML in a DataFrame
>
>
>
> Hi Hyukjin Kwon,
>
> Thank you for reply.
>
> There are several types of XML documents with different schema which needs
> to be parsed and tag names do not know in hand. All we know is the XSD for
> the given XML.
>
> Is it possible to get the same results even when we do not know the xml
> tags like manager.id, manager.name or is it possible to read the tag
> names from XSD and use?
>
> Thanks,
> Sreekanth
>
>
>
> On Aug 12, 2016 9:58 PM, "Hyukjin Kwon" <gurwls223@gmail.com> wrote:
>
> Hi Sreekanth,
>
>
>
> Assuming you are using Spark 1.x,
>
>
>
> I believe this code below:
>
> sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml")
>
>   .selectExpr("manager.id", "manager.name", "explode(manager.subordinates.clerk) as clerk")
>
>   .selectExpr("id", "name", "clerk.cid", "clerk.cname")
>
>   .show()
>
> would print the results below as you want:
>
> +---+----+---+-----+
>
> | id|name|cid|cname|
>
> +---+----+---+-----+
>
> |  1| foo|  1|  foo|
>
> |  1| foo|  1|  foo|
>
> +---+----+---+-----+
>
> ​
>
>
>
> I hope this is helpful.
>
>
>
> Thanks!
>
>
>
>
>
>
>
>
>
> 2016-08-13 9:33 GMT+09:00 Sreekanth Jella <srikanth.jella@gmail.com>:
>
> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> <emplist>
>
> <emp>
>
>    <manager>
>
>    <id>1</id>
>
>    <name>foo</name>
>
>     <subordinates>
>
>       <clerk>
>
>         <cid>1</cid>
>
>         <cname>foo</cname>
>
>       </clerk>
>
>       <clerk>
>
>         <cid>1</cid>
>
>         <cname>foo</cname>
>
>       </clerk>
>
>     </subordinates>
>
>    </manager>
>
> </emp>
>
> </emplist>
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>
>

Mime
View raw message