beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <>
Subject [jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java
Date Mon, 28 Mar 2016 17:05:25 GMT


Daniel Halperin commented on BEAM-50:

Other things to include:

When users exceed BigQuery's #files or #bytes limits, chunk into multiple import jobs.

> BigQueryIO.Write: reimplement in Java
> -------------------------------------
>                 Key: BEAM-50
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-gcp
>            Reporter: Daniel Halperin
>            Priority: Minor
> BigQueryIO.Write is currently implemented in a somewhat hacky way.
> Unbounded sink:
> * The DirectPipelineRunner and the DataflowPipelineRunner use StreamingWriteFn and BigQueryTableInserter
to insert rows using BigQuery's streaming writes API.
> Bounded sink:
> * The DirectPipelineRunner still uses streaming writes.
> * The DataflowPipelineRunner uses a different code path in the Google Cloud Dataflow
service that writes to GCS and the initiates a BigQuery load job.
> * Per-window table destinations do not work scalably. (See Beam-XXX).
> We need to reimplement BigQueryIO.Write fully in Java code in order to support other
runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO sink in the process.
A short list:
> * Do not use TableRow as the default value for rows. It could be Map<String, Object>
with well-defined types, for example, or an Avro GenericRecord. Dropping TableRow will get
around a variety of issues with types, fields named 'f', etc., and it will also reduce confusion
as we use TableRow objects differently than usual (for good reason).
> * Possibly support not-knowing the schema until pipeline execution time.
> * Our builders for BigQueryIO.Write are useful and we should keep them. Where possible
we should also allow users to provide the JSON objects that configure the underlying table
creation, write disposition, etc. This would let users directly control things like table
expiration time, table location, etc., Would also optimistically let users take advantage
of some new BigQuery features without code changes.
> * We could choose between streaming write API and load jobs based on user preference
or dynamic job properties . We could use streaming write in a batch pipeline if the data is
small. We could use load jobs in streaming pipelines if the windows are large enough to make
this practical.
> * When issuing BigQuery load jobs, we could leave files in GCS if the import fails, so
that data errors can be debugged.
> * We should make per-window table writes scalable in batch.
> Caveat, possibly blocker:
> * (Beam-XXX): cleanup and temp file management. One advantage of the Google Cloud Dataflow
implementation of BigQueryIO.Write is cleanup: we ensure that intermediate files are deleted
when bundles or jobs fail, etc. Beam does not currently support this.

This message was sent by Atlassian JIRA

View raw message