Hi Folks,

 

I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml package which is automatically inferring my schema and creating a DataFrame.

 

I do not want to hard code any column names in DataFrame as I have lot of varieties of XML documents and each might be lot more depth of child nodes. I simply want to flatten any type of XML and then write output data to a hive table. Can you please give some expert advice for the same.

 

Example XML and expected output is given below.

 

Sample XML:

<emplist>

<emp>

   <manager>

   <id>1</id>

   <name>foo</name>

    <subordinates>

      <clerk>

        <cid>1</cid>

        <cname>foo</cname>

      </clerk>

      <clerk>

        <cid>1</cid>

        <cname>foo</cname>

      </clerk>

    </subordinates>

   </manager>

</emp>

</emplist>

 

Expected output:

id, name, clerk.cid, clerk.cname

1, foo, 2, cname2

1, foo, 3, cname3

 

Thanks,

Sreekanth Jella