sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herman Schistad <herman.schis...@gmail.com>
Subject Fwd: Split query on timestamp or using LIMIT and OFFSET?
Date Wed, 17 Dec 2014 10:57:46 GMT
Hey guys and gals.

Imagine you have a SQL table with three columns:

    username varchar(80),
    registered timestamp,
    description text

Now, you would like to select all users that registered in the
previous month and import them to HDFS. How would you do it? Note that
in this example I do not have any primary keys.

My closest attempt is something along the lines of:

    --table
    tablename

    --split-by
    registered

    --where
    "registered >= TIMESTAMP '2014-11-01 00:00:00' AND registered <
TIMESTAMP '2014-12-01 00:00:00'"

    --boundary-query
    "SELECT TIMESTAMP '2014-11-01 00:00:00', TIMESTAMP '2014-12-01 00:00:00'"

But this does not work, as I get an SQL exception in the
mapreduce.DBRecordReader.nextKeyValue (similar to this one:
http://mail-archives.apache.org/mod_mbox/sqoop-user/201209.mbox/%3CCAE0_eQEBMdP=6uACbSes-dfJ6jVTPnPWUCTX8aDxyexUQScrgg@mail.gmail.com%3E)

I could have skipped the --boundary-query there as well, as it
successfully figures out the min() and max() of the --split-by key,
but since I anyways limit it in the --where clause I already know what
the boundaries are.

Now what I figured would be a better approach (since I don't have
primary keys to --split-by) in order to create input splits for Hadoop
was to first do a count(*) as a boundary query, then create splits
based on LIMIT and OFFSET. Is this possible? The procedure then
becomes:

    1. Get number of rows matching the query
       SELECT count(*) FROM tablename WHERE ${WHERECLAUSE}
     Here the $WHERECLAUSE is "registered >= TIMESTAMP '...' AND
registered < TIMESTAMP '...'"

    2. Using the number of resulting rows, calculate splits and
offsets based on the number of mappers. I.e. if the number is 4 and
the number of rows are 20000:
     (0, 5000)
     (5000, 10000)
     (10000, 15000)
     (15000, 20000)

    3. Using these offsets, generate 4 sql queries and map to workers:
      SELECT username,registered,description FROM tablename WHERE
${CONDITIONS} LIMIT 5000 OFFSET 0
      SELECT username,registered,description FROM tablename WHERE
${CONDITIONS} LIMIT 10000 OFFSET 5000
      SELECT username,registered,description FROM tablename WHERE
${CONDITIONS} LIMIT 15000 OFFSET 10000
      SELECT username,registered,description FROM tablename WHERE
${CONDITIONS} LIMIT 20000 OFFSET 15000

For all I know all or some of this is possible already, and I just
suck at reading documentation. Hopefully some of you can advise me and
put me on the right track. Any help is appreciated,

Herman

Mime
View raw message