spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corey J. Nolet (JIRA)" <>
Subject [jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
Date Wed, 04 Feb 2015 03:15:35 GMT


Corey J. Nolet commented on SPARK-5260:

I'm thinking all the schema-specific functions should be pulled out into an object called
JsonSchemaFunctions. allKeysWithValueTypes() and createSchema() functions should be exposed
via the public API and commented well based on their use. 

For the project I have that's using these functions, I am actually using the allKeysWithValueTypes()
over my entire RDD as it's being saved to a sequence file and I'm using an Accumulator[Set[(String,
DataType)]] that is aggregating all the schema elements for the RDD into a final Set where
I can then store off the schema and later call "CreateSchema()" to get the final StructType
that can be used with the sql table. I had to write a isConflicted(Set[(String, DataType)]])
function as well to determine if it's possible that a JSON object or JSON array was also encountered
as a primitive type in one of the records in the RDD or vice versa.

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> ----------------------------------------------------------
>                 Key: SPARK-5260
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Corey J. Nolet
>            Assignee: Corey J. Nolet
> I have found this method extremely useful when implementing my own strategy for inferring
a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD
class into my own project but I think it would be immensely useful to keep the code in Spark
and expose it publicly somewhere else- like an object called JsonSchema.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message