spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhrubajyoti Hati <dhruba.w...@gmail.com>
Subject Re: Error while reading hive tables with tmp/hidden files inside partitions
Date Wed, 22 Apr 2020 19:15:22 GMT
Just wondering if any one could help me out on this.

Thank you!




*Regards,Dhrubajyoti Hati.*


On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <dhruba.work@gmail.com>
wrote:

> Hi,
>
> Is there any way to discard files starting with dot(.) or ending with .tmp
> in the hive partition while reading from Hive table using spark.read.table
> method.
>
> I tried using PathFilters but they didn't work. I am using spark-submit
> and passing my python file(pyspark) containing the source code.
>
> spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class",
> "com.abc.hadoop.utility.TmpFileFilter")
>
> class TmpFileFilter extends PathFilter {
>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
> }
>
> Still in the detailed logs I can see .tmp files are getting considered in
> the detailed logs:
> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
> maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
> maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
> 20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus
> maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp
>
>
> Is there any way to discard the tmp(.tmp) or the hidden files(filename
> starting with dot or underscore) in hive partitions while reading from
> spark?
>
>
>
>
> *Regards,Dhrubajyoti Hati.*
>

Mime
View raw message