There is a way to way obtain these malformed/rejected records. Rejection can happen not only because of column number mismatch but also if the data type of the data does not match the data type mentioned in the schema.
To obtain the rejected records, you can do the following:
1. Add an extra column (eg: CorruptRecCol) to your schema of type StringType()
2. In the datadrame reader, add the mode 'PERMISSIVE' while simultaneously adding the column
CorruptRecCol to columnNameOfCorruptRecord
3. The column
CorruptRecCol will contain the complete record if it is malformed/corrupted. On the other hand, it will be null if the record is valid. So you can use a filter (CorruptRecCol
is NULL) to obtain the malformed/corrupted record.
You can use any column name to contain the invalid records. I have used
CorruptRecCol just for example.
This example is for pyspark. Similar example will exist for Java/Scala also.