sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arvind Prabhakar <arv...@apache.org>
Subject Re: denormalize data during import
Date Tue, 23 Aug 2011 16:39:25 GMT
Hi Bijoy,

Your suggested solution sounds good to me as it gives the flexibility to the
user to decide how best to modify the data.

One alternative that could work in a limited manner is to use dynamic views
instead of tables that have the denormalized relations already. That would
allow the user to run a Sqoop import and directly get the necessary data set
out in one shot.


On Sun, Aug 21, 2011 at 12:33 PM, <bejoyks@gmail.com> wrote:

> ** Hi Srivathsan
> I did get your requirement and if you are in for developing a map reduce
> program, may be you can achieve the same in two steps.
> 1. SQOOP Import data into HDFS
> 2. Use a custom map reduce to de normalize the same.
> In your map reduce join the two data sets using the foreign key. For this
> In Mapper
> -Use Multiple Input Format to parse the two different inputs
> -project the data in each mapper with key as the foreign key and the value
> as the combination of required column values in each case
> -prefix an identifier to identify the source of the values in the reducer
> In Reducer
> - the identifier helps to know the source of data(which table the data is
> from)
> - Append the data from different sources as required(choose output key
> value as required)
> The solution is a crude one and has its flaws. Just scribbled a possible
> solution from my limited expertise.
> Experts, please share some sophisticated approach which you feel is optimal
> in handling such requirements.
> Regards
> Bejoy K S
> ------------------------------
> *From: * Srivathsan Srinivas <periya.data@gmail.com>
> *Date: *Sat, 20 Aug 2011 23:06:06 -0700
> *To: *<sqoop-user@incubator.apache.org>
> *ReplyTo: * sqoop-user@incubator.apache.org
> *Subject: *denormalize data during import
> Hi,
>    I am looking for a way to denormalize my tables when I am importing from
> MySQL to HDFS/HIVE. What is the best approach? Is there a way to add columns
> dynamically when using Sqoop or any other means?
> For eg., an RDBMS table of students might have student name and
> address(es). This may be a normalized form in different tables. What I want
> during my import process is a single row for each student where there will
> be multiple columns for all the addresses for a given student (like addr_1,
> addr_2, ...etc). I can write UDFs for it, but, do not know how to attach it
> to sqoop. I undertand that sqoop can add SQL queries (such as join) before
> importing. But, that will add more rows. I want a single row with multiple
> columns.
> Should I have UDFs in Sqoop or Oozie in my workflow and denormalize them on
> the fly? I have not tried it. Is there any suggested approach? I am
> beginning to look at Solr's DataImportHandler...perhaps that could solve
> this.
> Suggestions are appreciated.
> Thanks,
> PD.

View raw message