spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Henry <>
Subject Re: structured streaming handling validation and json flattening
Date Tue, 12 Feb 2019 09:58:49 GMT

I'm in a somewhat similar situation. Here's what I do (it seems to be
working so far):

1. Stream in the JSON as a plain string.
2. Feed this string into a JSON library to validate it (I use Circe).
3. Using the same library, parse the JSON and extract fields X, Y and Z.
4. Create a dataset with fields X, Y, Z and the JSON as a String/
5. Write this dataset to HDFS as Parquet partitioned on X and sorted on Y.

Obviously, this is not exactly the same as your use case (for instance, I
have no idea what your requirements are regarding "flattening the nesting
jsons"). Also, I extract only a few fields that I use as columns in the
resulting Dataset but then store the rest of the JSON as a string. However,
the principle should be the same for you.



On Mon, Feb 11, 2019 at 2:59 PM Jacek Laskowski <> wrote:

> Hi Lian,
> "What have you tried?" would be a good starting point. Any help on this?
> How do you read the JSONs? readStream.json? You could use readStream.text
> followed by filter to include/exclude good/bad JSONs.
> Pozdrawiam,
> Jacek Laskowski
> ----
> Mastering Spark SQL
> Spark Structured Streaming
> Mastering Kafka Streams
> Follow me at
> On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <> wrote:
>> Hi,
>> We have a structured streaming job that converting json into parquets. We
>> want to validate the json records. If a json record is not valid, we want
>> to log a message and refuse to write it into the parquet. Also the json has
>> nesting jsons and we want to flatten the nesting jsons into other parquets
>> by using the same streaming job. My questions are:
>> 1. how to validate the json records in a structured streaming job?
>> 2. how to flattening the nesting jsons in a structured streaming job?
>> 3. is it possible to use one structured streaming job to validate json,
>> convert json into a parquet and convert nesting jsons into other parquets?
>> I think unstructured streaming can achieve these goals but structured
>> streaming is recommended by spark community.
>> Appreciate your feedback!

View raw message