spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott W <defy...@gmail.com>
Subject Spark Dataframe validating column names
Date Tue, 05 Jul 2016 05:02:59 GMT
Hello,

I'm processing events using Dataframes converted from a stream of JSON
events (Spark streaming) which eventually gets written out as as Parquet
format. There are different JSON events coming in so we use schema
inference feature of Spark SQL

The problem is some of the JSON events contains spaces in the keys which I
want to log and filter/drop such events from the data frame before
converting it to Parquet because ,;{}()\n\t= are considered special
characters in Parquet schema (CatalystSchemaConverter) as listed in [1]
below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop
such an event altogether without erroring out the Spark Streaming job?

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
    // ,;{}()\n\t= and space are special characters in Parquet schema
    checkConversionRequirement(
      !name.matches(".*[ ,;{}()\n\t=].*"),
      s"""Attribute name "$name" contains invalid character(s) among "
,;{}()\\n\\t=".
         |Please use alias to rename it.
       """.stripMargin.split("\n").mkString(" ").trim)
  }

Mime
View raw message