spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mendelson, Assaf" <Assaf.Mendel...@rsa.com>
Subject RE: Quirk in how Spark DF handles JSON input records?
Date Thu, 03 Nov 2016 07:41:05 GMT
I agree this can be a little annoying. The reason this is done this way is to enable cases
where the json file is huge. To allow splitting it, a separator is needed and newline is the
separator used (as is done in all text files in Hadoop and spark).
I always wondered why support has not been implemented for cases where each file is small
(e.g. has one object) but the implementation now assume each line has a legal json object.

Why I do to overcome this is use RDDs (using pyspark):

// get an RDD of the text context. The map is used because wholeTextFiles returns a tuple
of filename, file content
jsonRDD = sc.wholeTextFiles(filename).map(lambda x: x[1])

// remove whitespaces. This can actually be too much as it would also work inside string info
so you can maybe remove just the end line characters (e.g. \r, \n)
import re
js = jsonRDD.map(lambda x: re.sub(r"\s+", "", x, flags=re.UNICODE))

// convert the rdd to dataframe. If you have your own schema, this is where you should add
it.
df = spark.read.json(js)

Assaf.

From: Michael Segel [mailto:msegel_hadoop@hotmail.com]
Sent: Wednesday, November 02, 2016 9:39 PM
To: Daniel Siegmann
Cc: user @spark
Subject: Re: Quirk in how Spark DF handles JSON input records?


On Nov 2, 2016, at 2:22 PM, Daniel Siegmann <dsiegmann@securityscorecard.io<mailto:dsiegmann@securityscorecard.io>>
wrote:

Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record
separator by default. While it is possible to use a different string as a record separator,
what would you use in the case of JSON?
If you do some Googling I suspect you'll find some possible solutions. Personally, I would
just use a separate JSON library (e.g. json4s) to parse this metadata into an object, rather
than trying to read it in through Spark.


Yeah, that’s the basic idea.

This JSON is metadata to help drive the process not row records… although the column descriptors
are row records so in the short term I could cheat and just store those in a file.

:-(


--
Daniel Siegmann
Senior Software Engineer
SecurityScorecard Inc.
214 W 29th Street, 5th Floor
New York, NY 10001

Mime
View raw message