spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: importing data into hdfs/spark using Informatica ETL tool
Date Wed, 09 Nov 2016 19:59:07 GMT
Yes, it can be done and a standard practice. I would suggest a mixed
approach: use Informatica to create files in hdfs and have hive staging
tables as external tables on those directories. Then that point onwards use
spark.

Hth
Ayan
On 10 Nov 2016 04:00, "Mich Talebzadeh" <mich.talebzadeh@gmail.com> wrote:

> Thanks Mike for insight.
>
> This is a request landed on us which is rather unusual.
>
> As I understand Informatica is an ETL tool. Most of these are glorified
> Sqoop with GUI where you define your source and target.
>
> In a normal day Informatica takes data out of an RDBMS like Oracle table
> and lands it on Teradata or Sybase IQ (DW).
>
> So in our case we really need to redefine the map. Customer does not want
> the plug in from the Informatica for Hive etc which admittedly will make
> life far easier. They want us to come up with a solution.
>
> In the absence of the fact that we cannot use JDBC for Hive etc as target
> (?), the easiest option is to dump it into landing zone and then do
> whatever we want with it.
>
> Also I am not sure we can use Flume for it? That was a thought in my mind.
>
> So sort of stuck between Hard and Rock here. So in short we want a plug in
> to be consumer of Informatica.
>
> cheers
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 November 2016 at 16:14, Michael Segel <michael_segel@hotmail.com>
> wrote:
>
>> Mich,
>>
>> You could do that. But really?
>>
>> Putting on my solutions architect hat…
>>
>> You or your client is spending $$$ for product licensing and you’re not
>> really using the product to its fullest.
>>
>> Yes, you can use Informatica to pull data from the source systems and
>> provide some data cleansing and transformations before you drop it on your
>> landing zone.
>>
>> If you’re going to bypass Hive, then you have to capture the schema,
>> including data types.  You’re also going to have to manage schema evolution
>> as they change over time. (I believe the ETL tools will do this for you or
>> help in the process.)
>>
>> But if you’re already working on the consumption process for ingestion on
>> your own… what is the value that you derive from using Informatica?  Is the
>> unloading and ingestion process that difficult that you can’t write that as
>> well?
>>
>> My point is that if you’re going to use the tool, use it as the vendor
>> recommends (and they may offer options…) or skip it.
>>
>> I mean heck… you may want to take the flat files (CSV, etc) that are
>> dropped in the landing zone, and then ingest and spit out parquet files via
>> spark. You just need to know the Schema(s) of ingestion and output if they
>> are not the same. ;-)
>>
>> Of course you may decide that using Informatica to pull and transform the
>> data and drop it on to the landing zone provides enough value to justify
>> its expense.  ;-) YMMV
>>
>> Just my $0.02 worth.
>>
>> Take it with a grain of Kosher Sea Salt.  (The grains are larger and the
>> salt taste’s better) ;-)
>>
>> -Mike
>>
>> On Nov 9, 2016, at 7:56 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I am exploring the idea of flexibility with importing multiple RDBMS
>> tables using Informatica that customer has into HDFS.
>>
>> I don't want to use connectivity tools from Informatica to Hive etc.
>>
>> So this is what I have in mind
>>
>>
>>    1. If possible get the tables data out using Informatica and use
>>    Informatica ui  to convert RDBMS data into some form of CSV, TSV file (Can
>>    Informatica do it?) I guess yes
>>    2. Put the flat files on an edge where HDFS node can see them.
>>    3. Assuming that a directory can be created by Informatica daily,
>>    periodically run a cron that ingest that data from directories into HDFS
>>    equivalent daily directories
>>    4. Once the data is in HDFS one can use, Spark csv, Hive etc to query
>>    data
>>
>> The problem I have is to see if someone has done such thing before.
>> Specifically can Informatica create target flat files on normal directories.
>>
>> Any other generic alternative?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>

Mime
View raw message