beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <>
Subject [jira] [Updated] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource
Date Sat, 25 Jun 2016 17:10:37 GMT


Daniel Halperin updated BEAM-48:
    Fix Version/s: 0.1.0-incubating

> BigQueryIO.Read reimplemented as BoundedSource
> ----------------------------------------------
>                 Key: BEAM-48
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Assignee: Pei He
>             Fix For: 0.1.0-incubating
> BigQueryIO.Read is currently implemented in a hacky way: the DirectPipelineRunner streams
all rows in the table or query result directly using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path implemented
in the Google Cloud Dataflow service. (A BigQuery export job to GCS, followed by a parallel
read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support other runners
in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in the process.
A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, Object>
with well-defined types, for example, or an Avro GenericRecord. Dropping TableRow will get
around a variety of issues with types, fields named 'f', etc., and it will also reduce confusion
as we use TableRow objects differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where possible
we should also allow users to provide the JSON objects that configure the underlying intermediate
tables, query export, etc. This would let users directly control result flattening, location
of intermediate tables, table decorators, etc., and also optimistically let users take advantage
of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel scan vs API
read based on factors such as the size of the table at pipeline construction time.

This message was sent by Atlassian JIRA

View raw message