spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Witkon <eranwit...@gmail.com>
Subject Extract compressed JSON withing JSON
Date Thu, 24 Dec 2015 09:42:27 GMT
Hi,

I have a JSON file with the following row format:
{"cty":"United
Kingdom","gzip":"H4sIAAAAAAAAAKtWystVslJQcs4rLVHSUUouqQTxQvMyS1JTFLwz89JT8nOB4hnFqSBxj/zS4lSF/DQFl9S83MSibKBMZVExSMbQwNBM19DA2FSpFgDvJUGVUwAAAA==","nm":"Edmund
lronside","yrs":"1016"}

The gzip field is a compressed JSON by itself

I want to read the file and build the full nested JSON as a row:

{"cty":"United Kingdom","hse":{"nm": "Cnut","cty": "United
Kingdom","hse": "House of Denmark","yrs": "1016-1035"},"nm":"Edmund
lronside","yrs":"1016"}

I already have the function which extract the compressed field to a string.

Questions:

*if I use the following code the build the RDD :*

val jsonData = sqlContext.read.json(sourceFilesPath)
//
//loop through the DataFrame and manipulate the gzip Filed

val jsonUnGzip = jsonData.map(r => Row(r.getString(0),
GZipHelper.unCompress(r.getString(1)).get, r.getString(2),
r.getString(3)))

*I get a row with 4 columns (String,String,String,String)*

 org.apache.spark.sql.Row = [United Kingdom,{"nm": "Cnut","cty":
"United Kingdom","hse": "House of Denmark","yrs": "1016-1035"},Edmund
lronside,1016]

*Now, I can't tell Spark to "re-parse" Col(1) as JSON, right?*

I seen some post about using case classes or explode but I don't
understand how this can help here?

Eran

Mime
View raw message