beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devon Meunier (JIRA)" <>
Subject [jira] [Commented] (BEAM-2150) Support for recursive wildcards in GcsPath
Date Thu, 04 May 2017 14:18:04 GMT


Devon Meunier commented on BEAM-2150:

This is what I went off:

And then I ran some tests with {{gsutil}} using the {{-D}} flag to see the kinds of requests
that were being made.

When you're doing normal globbing, {{delimiter=%2F}} is passed along which truncates the results
from the API so you get {{prefix[^delimiter]*[delimiter]}} However this isn't actually what
Beam is doing, and if you work across a bucket with a lot of files in the prefix, you can
see that it actually pages through every single object key and then filters down. The end-behaviour
is the same because of how we filter by regex, but it actually made it really easy to relax
the regex.

A approach would be to specify a delimiter [here|]
so that we actually see some efficiency gains when not using recursive globbing, but I figured
that could happen in a followup PR.

> Support for recursive wildcards in GcsPath
> ------------------------------------------
>                 Key: BEAM-2150
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core, sdk-java-gcp
>            Reporter: Devon Meunier
>            Assignee: Devon Meunier
>            Priority: Minor
> When working with heavily nested folder structures in Google Cloud Storage, it's great
to make use of recursive wildcards, which the current API explicitly does not support.
> This code hasn't been touched in 2 years so it's likely that simply no one's gotten around
to it yet.

This message was sent by Atlassian JIRA

View raw message