beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérémie Vexiau (JIRA) <>
Subject [jira] [Created] (BEAM-2803) JdbcIO read is very slow when query return a lot of rows
Date Thu, 24 Aug 2017 09:34:00 GMT
Jérémie Vexiau created BEAM-2803:

             Summary: JdbcIO read is very slow when query return a lot of rows
                 Key: BEAM-2803
             Project: Beam
          Issue Type: Improvement
          Components: sdk-java-extensions
    Affects Versions: Not applicable
            Reporter: Jérémie Vexiau
            Assignee: Reuven Lax
             Fix For: Not applicable


I'm using JdbcIO reader in batch mode with the postgresql driver.
my select query return more than 5 Millions rows
using cursors with Statement.setFetchSize().

these ParDo are OK :
          .apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
          .apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
            private Random random;
            public void setup() {
              random = new Random();
            public void processElement(ProcessContext context) {
              context.output(KV.of(random.nextInt(), context.element()));

but reshuffle is very very slow. 
it must be the GroupByKey with more than 5 millions of Key.
.apply(GroupByKey.<Integer, T>create())
is there a way to optimize the reshuffle, or use another method to prevent fusion ? 

thanks in advance,

This message was sent by Atlassian JIRA

View raw message