beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devon Meunier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2150) Support for recursive wildcards in GcsPath
Date Sun, 07 May 2017 01:24:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999640#comment-15999640
] 

Devon Meunier commented on BEAM-2150:
-------------------------------------

[~dhalperi@google.com] noticed that gsutil's globbing semantics don't quite match my PR.

He noted:

{quote}
[11:12:18 dhalperi@dhalperi:beam a3cbf5905* ] gsutil ls 'gs://clouddfe-dhalperi/gcs-recursive/**/*.txt'
                       [1]
gs://clouddfe-dhalperi/gcs-recursive/file1.txt
gs://clouddfe-dhalperi/gcs-recursive/somedir/file2.txt

[2:13] 
However that same glob passed to TextIO only gets the second file.
{quote}

However, testing against a shell also seems to have different semantics:

{code}
[I] » tree glob/                                                              ~
glob/
├── dir
│   └── file2.txt
└── file1.txt

1 directory, 2 files
[I] » ls glob/**/*.txt                                                        ~
glob/dir/file2.txt
[I] » ls glob/**.txt                                                          ~
glob/dir/file2.txt glob/file1.txt
[I] »                                                                         ~
{code}

My PR matches the behaviour of a shell, so gsutil seems like the odd one out. I think we can
commit to it with more tests to make this behaviour explicit. What do you think?

> Support for recursive wildcards in GcsPath
> ------------------------------------------
>
>                 Key: BEAM-2150
>                 URL: https://issues.apache.org/jira/browse/BEAM-2150
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core, sdk-java-gcp
>            Reporter: Devon Meunier
>            Assignee: Devon Meunier
>            Priority: Minor
>
> When working with heavily nested folder structures in Google Cloud Storage, it's great
to make use of recursive wildcards, which the current API explicitly does not support.
> This code hasn't been touched in 2 years so it's likely that simply no one's gotten around
to it yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message