spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joarley Wanzeler de Moraes <jmor...@babbel.com>
Subject Multi-schema data pipeline
Date Fri, 10 Sep 2021 13:56:32 GMT
We want to create a Spark-based streaming data pipeline that consumes from a source (e.g. Kinesis),
apply some basic transformations, and write the data to a file-based sink (e.g. s3). We have
thousands of different event types coming in and the transformations would take place on a
set of common fields. Once the events are transformed, they need to be split by writing them
to different output locations according to the event type. This pipeline is described in the
figure below:

https://ibb.co/8jq3MR6

Goals:

- To infer schema safely in order to apply transformations based on the merged schema. The
assumption is that the event types are compatible with each other (i.e. without overlapping
schema structure) but the schema of any of them can change at unpredictable times. The pipeline
should handle it dynamically.
- To split the output after the transformations while keeping the original individual schema?


What we considered:
- Schema inference seems to work fine on sample data. But is it safe for production usecases
and for a large number of different event types?
- Simply using `partitionBy("name")` while writing out is not enough because it would use
the merged schema.




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message