beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (BEAM-2750) Read whole files as one PCollection element each
Date Sun, 03 Sep 2017 23:52:00 GMT

     [ https://issues.apache.org/jira/browse/BEAM-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Kirpichov closed BEAM-2750.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0

This has been fixed to a sufficient extent by FileIO.read() in https://github.com/apache/beam/pull/3799

> Read whole files as one PCollection element each
> ------------------------------------------------
>
>                 Key: BEAM-2750
>                 URL: https://issues.apache.org/jira/browse/BEAM-2750
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Christopher Hebert
>            Assignee: Eugene Kirpichov
>             Fix For: 2.2.0
>
>
> I'd like to read whole files as one element each.
> If my input files are hi.txt, what.txt, and yes.txt, then the whole contents of hi.txt
are an element of the returned PCollection, the whole contents of what.txt are the next element,
etc., giving me a PCollection with three elements.
> This contrasts with TextIO which reads a new element for every line of text in the input
files.
> This read (I'll call it WholeFileIO for now) would work like so:
> {code:java}
> PCollection<KV<String, Byte[]>> fileNamesAndBytes = p.apply("Read", WholeFileIO.read().from("/path/to/input/dir/*"));
> {code}
> The above example passes the raw file contents and the filename.
> Alternatively, we could pass a PCollection of some sort of FileWrapper around an InputStream
to support lazy loading.
> This ticket complements [BEAM-2751].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message