hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15542) S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory tree
Date Mon, 18 Jun 2018 06:30:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515391#comment-16515391
] 

Steve Loughran commented on HADOOP-15542:
-----------------------------------------

as far as directory logic is concerned, the path {{s3://mybucket/d1/d2/d3/d4/d5/d6/d7}} is
equivalent to the path {{s3://mybucket/d1/d2/d3/d4/d5/d6/d7/}}, therefore it is considered
a direct parent.

View it like this. If you were in the local fs in a directory {{d1/d2/d3/d4/d5/d6}}, you couldn't
have a file d7 and a directory d7/, as {{ls}}, {{mv}} and {{rm}} wouldn't know what to do.
The Hadoop FS API has the same model on HDFS, localfs, maprfs, etc, and on other object stores
(adl://, for example). We have to keep that metaphor consistent, even when, if you look closely
at S3, you can create sets of objects which break the metaphor

bq. hive does the reads and does not seem to complain with the file/dir same name).

Not something we've ever tested for. If you have a directory structure set up and, say, {{s3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/}}
is used as the base of a query, nothing will notice that file up the tree. Pass in a query
with the base of {{s3://mybucket/d1/d2/d3/d4/d5/d6/d7}} and it will find the file, not any
of the children, because we do a HEAD before a LIST; the file gets found first. 

Anyway, WONTFIX. Sorry. If you look at, what HADOOP-9565, we've discussed in the past what
a blobstore-specific API would look like, but never come up with a good model here.

> S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory
tree
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15542
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15542
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.7.5, 3.1.0
>            Reporter: t oo
>            Priority: Blocker
>
> We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar  hadoop-aws-2.7.5.jar
to write parquet files to an S3 bucket. We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7'
in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet
being a file)
> When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/'
(ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet'
get created in s3) we get the below error
>  
>  
> org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7'
since it is a file.
> at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
> at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message