beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devon Meunier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2150) Support for recursive wildcards in GcsPath
Date Thu, 04 May 2017 14:18:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996790#comment-15996790
] 

Devon Meunier commented on BEAM-2150:
-------------------------------------

This is what I went off: https://github.com/GoogleCloudPlatform/gsutil/blob/cfe99899910375bf3695b7e4a119ee0074110259/gslib/addlhelp/wildcards.py

And then I ran some tests with {{gsutil}} using the {{-D}} flag to see the kinds of requests
that were being made.

When you're doing normal globbing, {{delimiter=%2F}} is passed along which truncates the results
from the API so you get {{prefix[^delimiter]*[delimiter]}} However this isn't actually what
Beam is doing, and if you work across a bucket with a lot of files in the prefix, you can
see that it actually pages through every single object key and then filters down. The end-behaviour
is the same because of how we filter by regex, but it actually made it really easy to relax
the regex.

A approach would be to specify a delimiter [here|https://github.com/apache/beam/blob/f43b61af4d5a3ee77a610d8b11ef80d421c34501/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/util/GcsUtil.java#L371]
so that we actually see some efficiency gains when not using recursive globbing, but I figured
that could happen in a followup PR.

> Support for recursive wildcards in GcsPath
> ------------------------------------------
>
>                 Key: BEAM-2150
>                 URL: https://issues.apache.org/jira/browse/BEAM-2150
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core, sdk-java-gcp
>            Reporter: Devon Meunier
>            Assignee: Devon Meunier
>            Priority: Minor
>
> When working with heavily nested folder structures in Google Cloud Storage, it's great
to make use of recursive wildcards, which the current API explicitly does not support.
> This code hasn't been touched in 2 years so it's likely that simply no one's gotten around
to it yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message