hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-22054) Avoid recursive listing to check if a directory is empty
Date Tue, 30 Jul 2019 11:09:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896009#comment-16896009
] 

Steve Loughran commented on HIVE-22054:
---------------------------------------

you are correct, the getContentSummary call will be horribly bad on S3; didn't know anyone
used it. Filed HADOOP-16468 to speed it up, but it'll still be issuing {{descendants/1000}}
LIST calls, which costs $ as well as time.

For directories where the parent is deleted, things are low cost today; this patch will deliver
significant speedups in the state where the parent directory is not empty and 1+ subdirectory
has a deep tree -its the depth which is potentially more expensive than the number of entries
in a directory.



> Avoid recursive listing to check if a directory is empty
> --------------------------------------------------------
>
>                 Key: HIVE-22054
>                 URL: https://issues.apache.org/jira/browse/HIVE-22054
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.13.0, 1.2.0, 2.1.0, 3.1.1, 2.3.5
>            Reporter: Prabhas Kumar Samanta
>            Assignee: Prabhas Kumar Samanta
>            Priority: Major
>         Attachments: HIVE-22054.2.patch, HIVE-22054.patch
>
>
> During drop partition on a managed table, first we delete the directory corresponding
to the partition. After that we recursively delete the parent directory as well if parent
directory becomes empty. To do this emptiness check, we call Warehouse::getContentSummary(),
which in turn recursively check all files and subdirectories. This is a costly operation when
a directory has a lot of files or subdirectories. This overhead is even more prominent for
cloud based file systems like s3. And for emptiness check, this is unnecessary too.
> This is recursive listing was introduced as part of HIVE-5220. Code snippet for reference
:
> {code:java}
> // Warehouse.java
> public boolean isEmpty(Path path) throws IOException, MetaException {
>   ContentSummary contents = getFs(path).getContentSummary(path);
>   if (contents != null && contents.getFileCount() == 0 && contents.getDirectoryCount()
== 1) {
>     return true;
>   }
>   return false;
> }
> // HiveMetaStore.java
> private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, boolean
needRecycle)
>   throws IOException, MetaException {
>   if (depth > 0 && parent != null && wh.isWritable(parent)) {
>     if (wh.isDir(parent) && wh.isEmpty(parent)) {
>       wh.deleteDir(parent, true, mustPurge, needRecycle);
>     }
>     deleteParentRecursive(parent.getParent(), depth - 1, mustPurge, needRecycle);
>   }
> }
> // Note: FileSystem::getContentSummary() performs a recursive listing.{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message