spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Quirk in how Spark DF handles JSON input records?
Date Wed, 02 Nov 2016 18:50:13 GMT
This may be a silly mistake on my part…

Doing an example using Chicago’s Crime data.. (There’s a lot of it going around. ;-)

The goal is to read a file containing a JSON record that describes the crime data.csv for
ingestion into a data frame, then I want to output to a Parquet file.
(Pretty simple right?)

I ran this both in Zeppelin and in the Spark-Shell (2.01)

// Setup of environment
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option",
"some-value").getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

// Load the JSON from file:
val df = spark.read.json(“~/datasets/Chicago_Crimes.json")
df.show()


The output
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "metadata": {|
| "source": "CSV_...|
| "table": "Chica...|
| "compression": ...|
| },|
| "columns": [{|
| "col_name": "Id",|
| "data_type": "I...|
| }, {|
| "col_name": "Ca...|
| "data_type": "B...|
| }, {|

I checked the JSON file against a JSONLint tool (two actually)
My JSON record is valid w no errors. (see below)

So what’s happening?  What am I missing?
The goal is to create an ingestion schema for each source. From this I can build the schema
for the Parquet file or other data target.

Thx

-Mike

My JSON record:
{
"metadata": {
"source": "CSV_FILE",
"table": "Chicago_Crime",
"compression": "SNAPPY"
},
"columns": [{
"col_name": "Id",
"data_type": "INT64"
}, {
"col_name": "Case_No.",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Date",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Block",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "IUCR",
"data_type": "INT32"
}, {
"col_name": "Primary_Type",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Location_Description",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Arrest",
"data_type": "BOOLEAN"
}, {
"col_name": "Domestic",
"data_type": "BOOLEAN"
}, {
"col_name": "Beat",
"data_type": "BYTE_ARRAYI"
}, {
"col_name": "District",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Ward",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Community",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "FBI_Code",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "X_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Y_Coordinate",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Year",
"data_type": "INT32"
}, {
"col_name": "Updated_On",
"data_type": "BYTE_ARRAY"
}, {
"col_name": "Latitude",
"data_type": "DOUBLE"
}, {
"col_name": "Longitude",
"data_type": "DOUBLE"
}, {
"col_name": "Location",
"data_type": "BYTE_ARRAY"


}]
}
Mime
View raw message