spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lotkowski, Michael" <mllot...@amazon.co.uk.INVALID>
Subject Support subdirectories when accessing partitioned Parquet Hive table
Date Tue, 03 Dec 2019 10:27:35 GMT
Originally https://issues.apache.org/jira/browse/SPARK-30024


Hi all,

We have ran in to issues when trying to read parquet partitioned table created by Hive. I
think I have narrowed down the cause to how InMemoryFileIndex<https://issues.apache.org/jira/browse/SPARK-30024#L95%5D>
created a parent -> file mapping.

The folder structure created by Hive is as follows:

s3://bucket/table/date=2019-11-25/subdir1/data1.parquet

s3://bucket/table/date=2019-11-25/subdir2/data2.parquet

Looking through the code it seems that InMemoryFileIndex is creating a mapping of leaf files
to their parents yielding the following mapping:

 val leafDirToChildrenFiles = Map(

    s3://bucket/table/date=2019-11-25/subdir1 -> s3://bucket/table/date=2019-11-25/subdir1/data1.parquet,

    s3://bucket/table/date=2019-11-25/subdir2 -> s3://bucket/table/date=2019-11-25/subdir2/data2.parquet

)

Which then in turn is used in PartitioningAwareFileIndex<https://issues.apache.org/jira/browse/SPARK-30024#L83%5D>

to prune the partitions. From my understanding pruning works by looking up the partition path
in leafDirToChildrenFiles which in this case is s3://bucket/table/date=2019-11-25/ and therefore
it fails to find any files for this partition.

My suggested fix is to update how the InMemoryFileIndex builds the mapping, instead of having
a map between parent dir to file, is to have a map of rootPath to file. More concretely https://gist.github.com/lotkowskim/76e8ff265493efd0b2b7175446805a82

I have tested this by updating the jar running on EMR and we correctly can now read the data
from these partitioned tables. It's also worth noting that we can read the data, without any
modifications to the code, if we use the following settings:

"spark.sql.hive.convertMetastoreParquet" to "false",
"spark.hive.mapred.supports.subdirectories" to "true",
"spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true"

However with these settings we lose the ability to prune partitions causing us to read the
entire table every time as we aren't using a Spark relation.

I want to start discussion on whether this is a correct change, or if we are missing something
more obvious. In either case I would be happy to fully implement the change.

Thanks,

Michael




Amazon Development Centre (Scotland) Limited registered office: Waverley Gate, 2-4 Waterloo
Place, Edinburgh EH1 3EG, Scotland. Registered in Scotland Registration number SC26867



Mime
View raw message