spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Henry <londonjava...@gmail.com>
Subject Re: structured streaming handling validation and json flattening
Date Tue, 12 Feb 2019 09:58:49 GMT
Hi,

I'm in a somewhat similar situation. Here's what I do (it seems to be
working so far):

1. Stream in the JSON as a plain string.
2. Feed this string into a JSON library to validate it (I use Circe).
3. Using the same library, parse the JSON and extract fields X, Y and Z.
4. Create a dataset with fields X, Y, Z and the JSON as a String/
5. Write this dataset to HDFS as Parquet partitioned on X and sorted on Y.

Obviously, this is not exactly the same as your use case (for instance, I
have no idea what your requirements are regarding "flattening the nesting
jsons"). Also, I extract only a few fields that I use as columns in the
resulting Dataset but then store the rest of the JSON as a string. However,
the principle should be the same for you.

HTH.

Phillip





On Mon, Feb 11, 2019 at 2:59 PM Jacek Laskowski <jacek@japila.pl> wrote:

> Hi Lian,
>
> "What have you tried?" would be a good starting point. Any help on this?
>
> How do you read the JSONs? readStream.json? You could use readStream.text
> followed by filter to include/exclude good/bad JSONs.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <jiangok2006@gmail.com> wrote:
>
>> Hi,
>>
>> We have a structured streaming job that converting json into parquets. We
>> want to validate the json records. If a json record is not valid, we want
>> to log a message and refuse to write it into the parquet. Also the json has
>> nesting jsons and we want to flatten the nesting jsons into other parquets
>> by using the same streaming job. My questions are:
>>
>> 1. how to validate the json records in a structured streaming job?
>> 2. how to flattening the nesting jsons in a structured streaming job?
>> 3. is it possible to use one structured streaming job to validate json,
>> convert json into a parquet and convert nesting jsons into other parquets?
>>
>> I think unstructured streaming can achieve these goals but structured
>> streaming is recommended by spark community.
>>
>> Appreciate your feedback!
>>
>

Mime
View raw message