I agree this can be a little annoying. The reason this is done this way is to enable cases where the json file is huge. To allow splitting it, a separator is needed and newline is the separator used (as is done in all text files in Hadoop and spark).
I always wondered why support has not been implemented for cases where each file is small (e.g. has one object) but the implementation now assume each line has a legal json object.
Why I do to overcome this is use RDDs (using pyspark):
// get an RDD of the text context. The map is used because wholeTextFiles returns a tuple of filename, file content
jsonRDD = sc.wholeTextFiles(filename).map(lambda x: x)
// remove whitespaces. This can actually be too much as it would also work inside string info so you can maybe remove just the end line characters (e.g. \r, \n)
js = jsonRDD.map(lambda x: re.sub(r"\s+", "", x, flags=re.UNICODE))
// convert the rdd to dataframe. If you have your own schema, this is where you should add it.
df = spark.read.json(js)
On Nov 2, 2016, at 2:22 PM, Daniel Siegmann <firstname.lastname@example.org> wrote:
Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a record separator, what would you use in the case of JSON?
If you do some Googling I suspect you'll find some possible solutions. Personally, I would just use a separate JSON library (e.g. json4s) to parse this metadata into an object, rather than trying to read it in through Spark.
Yeah, that’s the basic idea.
This JSON is metadata to help drive the process not row records… although the column descriptors are row records so in the short term I could cheat and just store those in a file.
Senior Software Engineer
214 W 29th Street, 5th Floor
New York, NY 10001