spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: importing data into hdfs/spark using Informatica ETL tool
Date Wed, 09 Nov 2016 17:00:24 GMT
Thanks Mike for insight.

This is a request landed on us which is rather unusual.

As I understand Informatica is an ETL tool. Most of these are glorified
Sqoop with GUI where you define your source and target.

In a normal day Informatica takes data out of an RDBMS like Oracle table
and lands it on Teradata or Sybase IQ (DW).

So in our case we really need to redefine the map. Customer does not want
the plug in from the Informatica for Hive etc which admittedly will make
life far easier. They want us to come up with a solution.

In the absence of the fact that we cannot use JDBC for Hive etc as target
(?), the easiest option is to dump it into landing zone and then do
whatever we want with it.

Also I am not sure we can use Flume for it? That was a thought in my mind.

So sort of stuck between Hard and Rock here. So in short we want a plug in
to be consumer of Informatica.

cheers

Mich

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 November 2016 at 16:14, Michael Segel <michael_segel@hotmail.com>
wrote:

> Mich,
>
> You could do that. But really?
>
> Putting on my solutions architect hat…
>
> You or your client is spending $$$ for product licensing and you’re not
> really using the product to its fullest.
>
> Yes, you can use Informatica to pull data from the source systems and
> provide some data cleansing and transformations before you drop it on your
> landing zone.
>
> If you’re going to bypass Hive, then you have to capture the schema,
> including data types.  You’re also going to have to manage schema evolution
> as they change over time. (I believe the ETL tools will do this for you or
> help in the process.)
>
> But if you’re already working on the consumption process for ingestion on
> your own… what is the value that you derive from using Informatica?  Is the
> unloading and ingestion process that difficult that you can’t write that as
> well?
>
> My point is that if you’re going to use the tool, use it as the vendor
> recommends (and they may offer options…) or skip it.
>
> I mean heck… you may want to take the flat files (CSV, etc) that are
> dropped in the landing zone, and then ingest and spit out parquet files via
> spark. You just need to know the Schema(s) of ingestion and output if they
> are not the same. ;-)
>
> Of course you may decide that using Informatica to pull and transform the
> data and drop it on to the landing zone provides enough value to justify
> its expense.  ;-) YMMV
>
> Just my $0.02 worth.
>
> Take it with a grain of Kosher Sea Salt.  (The grains are larger and the
> salt taste’s better) ;-)
>
> -Mike
>
> On Nov 9, 2016, at 7:56 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Hi,
>
> I am exploring the idea of flexibility with importing multiple RDBMS
> tables using Informatica that customer has into HDFS.
>
> I don't want to use connectivity tools from Informatica to Hive etc.
>
> So this is what I have in mind
>
>
>    1. If possible get the tables data out using Informatica and use
>    Informatica ui  to convert RDBMS data into some form of CSV, TSV file (Can
>    Informatica do it?) I guess yes
>    2. Put the flat files on an edge where HDFS node can see them.
>    3. Assuming that a directory can be created by Informatica daily,
>    periodically run a cron that ingest that data from directories into HDFS
>    equivalent daily directories
>    4. Once the data is in HDFS one can use, Spark csv, Hive etc to query
>    data
>
> The problem I have is to see if someone has done such thing before.
> Specifically can Informatica create target flat files on normal directories.
>
> Any other generic alternative?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Mime
View raw message