spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Kim <>
Subject Glue-like Functionality
Date Sat, 08 Jul 2017 17:49:35 GMT
Has anyone seen AWS Glue? I was wondering if there is something similar going to be built into
Spark Structured Streaming? I like the Data Catalog idea to store and track any data source/destination.
It profiles the data to derive the scheme and data types. Also, it does some sort-of automated
schema evolution when or if the schema changes. It leaves only the transformation logic to
the ETL developer. I think some of this can enhance or simplify Structured Streaming. For
example, AWS S3 can be catalogued as a Data Source; in Structured Streaming, Input DataFrame
is created like a SQL view based off of the S3 Data Source; lastly, the Transform logic, if
any, just manipulates the data going from the Input DataFrame to the Result DataFrame, which
is another view based off of a catalogued Data Destination. This would relieve the ETL developer
from caring about any Data Source or Destination. All server information, access credentials,
data schemas, folder directory structures, file formats, and any other properties can be securely
stored away with only a select few.

I'm just curious to know if anyone has thought the same thing.

To unsubscribe e-mail:

View raw message