spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: importing data into hdfs/spark using Informatica ETL tool
Date Thu, 10 Nov 2016 09:01:43 GMT
Sounds like the only option informatica has for Hadoop is connector to Hive
and as I read it it connects to Hive thrift server.

The tool is called Informatica cloud connector and it is add-on which means
that it is not part of standard informatica offering.

anyway if we can use informatica to get data from source as  is and create
file system format (csv, etc)   as target, then those files can land on a
directory. A cron job then can put those files in HDFS directories and the
rest our choice how to treat those files. Hive external tables can be used
plus using Hive or Spark to ingest data into target tables in Hive
periodically. I will still go for ORC tables. Data. will be append only.

That is my conclusion.but still open to suggestions.

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 November 2016 at 08:11, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Forwarded conversation
> Subject: importing data into hdfs/spark using Informatica ETL tool
> ------------------------
>
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 13:56
> To: "user @spark" <user@spark.apache.org>
>
>
> Hi,
>
> I am exploring the idea of flexibility with importing multiple RDBMS
> tables using Informatica that customer has into HDFS.
>
> I don't want to use connectivity tools from Informatica to Hive etc.
>
> So this is what I have in mind
>
>
>    1. If possible get the tables data out using Informatica and use
>    Informatica ui  to convert RDBMS data into some form of CSV, TSV file (Can
>    Informatica do it?) I guess yes
>    2. Put the flat files on an edge where HDFS node can see them.
>    3. Assuming that a directory can be created by Informatica daily,
>    periodically run a cron that ingest that data from directories into HDFS
>    equivalent daily directories
>    4. Once the data is in HDFS one can use, Spark csv, Hive etc to query
>    data
>
> The problem I have is to see if someone has done such thing before.
> Specifically can Informatica create target flat files on normal directories.
>
> Any other generic alternative?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 16:14
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Mich,
>
> You could do that. But really?
>
> Putting on my solutions architect hat…
>
> You or your client is spending $$$ for product licensing and you’re not
> really using the product to its fullest.
>
> Yes, you can use Informatica to pull data from the source systems and
> provide some data cleansing and transformations before you drop it on your
> landing zone.
>
> If you’re going to bypass Hive, then you have to capture the schema,
> including data types.  You’re also going to have to manage schema evolution
> as they change over time. (I believe the ETL tools will do this for you or
> help in the process.)
>
> But if you’re already working on the consumption process for ingestion on
> your own… what is the value that you derive from using Informatica?  Is the
> unloading and ingestion process that difficult that you can’t write that as
> well?
>
> My point is that if you’re going to use the tool, use it as the vendor
> recommends (and they may offer options…) or skip it.
>
> I mean heck… you may want to take the flat files (CSV, etc) that are
> dropped in the landing zone, and then ingest and spit out parquet files via
> spark. You just need to know the Schema(s) of ingestion and output if they
> are not the same. ;-)
>
> Of course you may decide that using Informatica to pull and transform the
> data and drop it on to the landing zone provides enough value to justify
> its expense.  ;-) YMMV
>
> Just my $0.02 worth.
>
> Take it with a grain of Kosher Sea Salt.  (The grains are larger and the
> salt taste’s better) ;-)
>
> -Mike
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 17:00
> To: Michael Segel <michael_segel@hotmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Thanks Mike for insight.
>
> This is a request landed on us which is rather unusual.
>
> As I understand Informatica is an ETL tool. Most of these are glorified
> Sqoop with GUI where you define your source and target.
>
> In a normal day Informatica takes data out of an RDBMS like Oracle table
> and lands it on Teradata or Sybase IQ (DW).
>
> So in our case we really need to redefine the map. Customer does not want
> the plug in from the Informatica for Hive etc which admittedly will make
> life far easier. They want us to come up with a solution.
>
> In the absence of the fact that we cannot use JDBC for Hive etc as target
> (?), the easiest option is to dump it into landing zone and then do
> whatever we want with it.
>
> Also I am not sure we can use Flume for it? That was a thought in my mind.
>
> So sort of stuck between Hard and Rock here. So in short we want a plug in
> to be consumer of Informatica.
>
> cheers
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 17:16
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> Mich,
>
> Taking this offline… for obvious reasons…
>
>
> Gotcha. Customer probably already has Informatica and they don’t want to
> pay for the adaptor to go directly in to hive.
> ( My guess is that they created an ‘optimized’ adaptor to connect to Hive
> … probably a customized thrift server…)
>
>
> So…
>
> To do the ingestion… let Informatica query the source systems and do
> whatever transformations they need. You would need to work with their
> Informatica guy to tell them what you need or want.
>
>
> Then take the flat files in the landing zone and process them. Since this
> is batch… you could use a Spark Streaming job to process each file for
> ingestion.
> You will need to know for each file, which schema matches the input and
> the target system.  You can then write a simple spark file that takes the
> URL of the flat file, reads it in to an RDD and then pushes it to the
> store. Whether its an HDFS parquet file, or HBase.  The trick is to make
> this a data driven process.
>
> You wouldn’t need flume but just a daemon that wakes every period (cron??)
> and see’s what’s new.  There are a couple of ways to do this along with
> what sort of metadata you will want to create and track.
>
> There’s more to this… what’s in the file? Is it net new data? Or is it an
> update to an existing row.  If its an update, then you have to decide if
> there’s a temporal aspect to the data, or do you just want to update the
> existing row.
>
> This is a basic part of building a clean data lake and avoiding the
> cesspool / sewer.
>
> The interesting thing is that you have a couple of design options… e.g.
> Spark Streaming that takes the list of files to be processed, or just
> launch a spark job each time because you will check on input files at a
> relatively slow rate. (e.g. every 1/2 hour, hour, 4 hours, or even once a
> day… )
>
> Not really a rock or a hard place.
>
> The trick is how / when do you migrate from an RDBMs centric view, to a
> Hierarchical (record) view. And if you want to transform this in to JSON or
> manage the 1:m subtables differently.
>
> HTH
>
> -Mike
>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 17:19
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Had to take this offline…
>
> This is really a classic Data Lake problem when dealing with ingestion.
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 19:40
> To: Michael Segel <michael_segel@hotmail.com>
>
>
> Thanks Mike again for the valuable info.
>
> This is the way I may proceed please correct me if needed:
>
>
>    1. Informatica will be used for "E" or "ET" in ETL.
>    2. For each input table we have to specify the output to be a csv type
>    file to be standard across all the activities. If there is some "T" and
>    enrichment, then the output should be as the user wants it. In short it
>    should reflect what they load in their Data Warehouse
>    3. As you mentioned it could be new data, updates or deletes. I have
>    found that the best approach is to add two columns op_time (timestamp) and
>    op_type (INSERT/UPDATE/DELETES,  flagged as 1/2/3)
>    4. Let us assume that all these files land in Landing zone as flat
>    files. We can easily create a cron that runs say every hour, puts the new
>    files into respective HDFS sub-directories and then lunch Spark job (I
>    guess we will need a spark job for each flat file type (schema stuff) that
>    reads from the said directory, creates an RDD from CSV, transforms to DF,
>    create a tempTable and inserts/select from tempTable into ORC table in Hive
>    or simply put data into Hbase.
>    5. The guy from customer (IT) does not want to use Phoenix on Hbase so
>    I suggested using Hive External table on Hbase so we can run Tableau on
>    Hbase via Hive (if we decided to do Hbase way)
>
> I guess we have to work with Informatica guys to ask them to create target
> files for us. I agree with you that using Informatica organic connectors to
> Hive would be much better. So we have to work around it.
>
> Please let me know your thoughts.
>
> Regards,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 19:54
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> Yeah, I’d avoid Phoenix.
>
> Not impressed with it and it lacks commercial support along with not
> enough critical mass.
> I’m also hesitant to use HBase in certain environments along with concerns
> about stability. (It depends on who’s administering the system)
> (But its still a viable tool. I prefer MapRDB over HBase. )
>
> If you want SQL, you need to either use Drill, Spark or HiveServer2 and
> put Hive tables over HBase.
>
> The problem with SQL over HBase… you end up doing full table scans unless
> you use an inverted table as an index and do a map side join. (Secondary
> indexes that you have to maintain.)
>
> The other option is to dump it from HBase in to parquet files. Again it
> depends on the data and how you intend to use it.
>
>
> I’m seriously going to have to write a book on this…  ;-)
>
> -Mike
>
> ----------
> From: ayan guha <guha.ayan@gmail.com>
> Date: 9 November 2016 at 19:59
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: user <user@spark.apache.org>, Michael Segel <michael_segel@hotmail.com
> >
>
>
> Yes, it can be done and a standard practice. I would suggest a mixed
> approach: use Informatica to create files in hdfs and have hive staging
> tables as external tables on those directories. Then that point onwards use
> spark.
>
> Hth
> Ayan
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:05
> To: Michael Segel <michael_segel@hotmail.com>
>
>
> LOL Mike agreed. Keep a copy for me :)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:10
> To: ayan guha <guha.ayan@gmail.com>
>
>
> Thanks Ayan. Taking this offline.
>
> I gather you know about Informatica :)
>
>
> This is the way I may proceed please correct me if needed:
>
>
>    1. Informatica will be used for "E" or "ET" in ETL.
>    2. For each input table we have to specify the output to be a csv type
>    file to be standard across all the activities. If there is some "T" and
>    enrichment, then the output should be as the user wants it. In short it
>    should reflect what they load in their Data Warehouse
>    3. It could be new data, updates or deletes. I have found that the
>    best approach is to add two columns op_time (timestamp) and op_type
>    (INSERT/UPDATE/DELETES,  flagged as 1/2/3)
>    4. Let us assume that all these files land in Landing zone as flat
>    files. We can easily create a cron that runs say every hour, puts the new
>    files into respective HDFS sub-directories and then lunch Spark job (I
>    guess we will need a spark job for each flat file type (schema stuff) that
>    reads from the said directory, creates an RDD from CSV, transforms to DF,
>    create a tempTable and inserts/select from tempTable into ORC table in Hive
>    or simply puts data into Hbase.
>    5. The guy from customer (IT) does not want to use Phoenix on Hbase so
>    I suggested using Hive External table on Hbase so we can run Tableau on
>    Hbase via Hive (if we decided to do Hbase way)
>
> I guess we have to work with Informatica guys to ask them to create target
> files for us. I agree with you that using Informatica organic connectors to
> Hive would be much better. So we have to work around it.
>
> Let me know your thoughts. I gather you have done something similar?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:48
> To: mich@peridale.co.uk
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:49
> To: mich@peridale.co.uk
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Forwarded conversation
> Subject: importing data into hdfs/spark using Informatica ETL tool
> ------------------------
>
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 13:56
> To: "user @spark" <user@spark.apache.org>
>
>
> Hi,
>
> I am exploring the idea of flexibility with importing multiple RDBMS
> tables using Informatica that customer has into HDFS.
>
> I don't want to use connectivity tools from Informatica to Hive etc.
>
> So this is what I have in mind
>
>
>    1. If possible get the tables data out using Informatica and use
>    Informatica ui  to convert RDBMS data into some form of CSV, TSV file (Can
>    Informatica do it?) I guess yes
>    2. Put the flat files on an edge where HDFS node can see them.
>    3. Assuming that a directory can be created by Informatica daily,
>    periodically run a cron that ingest that data from directories into HDFS
>    equivalent daily directories
>    4. Once the data is in HDFS one can use, Spark csv, Hive etc to query
>    data
>
> The problem I have is to see if someone has done such thing before.
> Specifically can Informatica create target flat files on normal directories.
>
> Any other generic alternative?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 16:14
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Mich,
>
> You could do that. But really?
>
> Putting on my solutions architect hat…
>
> You or your client is spending $$$ for product licensing and you’re not
> really using the product to its fullest.
>
> Yes, you can use Informatica to pull data from the source systems and
> provide some data cleansing and transformations before you drop it on your
> landing zone.
>
> If you’re going to bypass Hive, then you have to capture the schema,
> including data types.  You’re also going to have to manage schema evolution
> as they change over time. (I believe the ETL tools will do this for you or
> help in the process.)
>
> But if you’re already working on the consumption process for ingestion on
> your own… what is the value that you derive from using Informatica?  Is the
> unloading and ingestion process that difficult that you can’t write that as
> well?
>
> My point is that if you’re going to use the tool, use it as the vendor
> recommends (and they may offer options…) or skip it.
>
> I mean heck… you may want to take the flat files (CSV, etc) that are
> dropped in the landing zone, and then ingest and spit out parquet files via
> spark. You just need to know the Schema(s) of ingestion and output if they
> are not the same. ;-)
>
> Of course you may decide that using Informatica to pull and transform the
> data and drop it on to the landing zone provides enough value to justify
> its expense.  ;-) YMMV
>
> Just my $0.02 worth.
>
> Take it with a grain of Kosher Sea Salt.  (The grains are larger and the
> salt taste’s better) ;-)
>
> -Mike
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 17:00
> To: Michael Segel <michael_segel@hotmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Thanks Mike for insight.
>
> This is a request landed on us which is rather unusual.
>
> As I understand Informatica is an ETL tool. Most of these are glorified
> Sqoop with GUI where you define your source and target.
>
> In a normal day Informatica takes data out of an RDBMS like Oracle table
> and lands it on Teradata or Sybase IQ (DW).
>
> So in our case we really need to redefine the map. Customer does not want
> the plug in from the Informatica for Hive etc which admittedly will make
> life far easier. They want us to come up with a solution.
>
> In the absence of the fact that we cannot use JDBC for Hive etc as target
> (?), the easiest option is to dump it into landing zone and then do
> whatever we want with it.
>
> Also I am not sure we can use Flume for it? That was a thought in my mind.
>
> So sort of stuck between Hard and Rock here. So in short we want a plug in
> to be consumer of Informatica.
>
> cheers
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 17:16
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> Mich,
>
> Taking this offline… for obvious reasons…
>
>
> Gotcha. Customer probably already has Informatica and they don’t want to
> pay for the adaptor to go directly in to hive.
> ( My guess is that they created an ‘optimized’ adaptor to connect to Hive
> … probably a customized thrift server…)
>
>
> So…
>
> To do the ingestion… let Informatica query the source systems and do
> whatever transformations they need. You would need to work with their
> Informatica guy to tell them what you need or want.
>
>
> Then take the flat files in the landing zone and process them. Since this
> is batch… you could use a Spark Streaming job to process each file for
> ingestion.
> You will need to know for each file, which schema matches the input and
> the target system.  You can then write a simple spark file that takes the
> URL of the flat file, reads it in to an RDD and then pushes it to the
> store. Whether its an HDFS parquet file, or HBase.  The trick is to make
> this a data driven process.
>
> You wouldn’t need flume but just a daemon that wakes every period (cron??)
> and see’s what’s new.  There are a couple of ways to do this along with
> what sort of metadata you will want to create and track.
>
> There’s more to this… what’s in the file? Is it net new data? Or is it an
> update to an existing row.  If its an update, then you have to decide if
> there’s a temporal aspect to the data, or do you just want to update the
> existing row.
>
> This is a basic part of building a clean data lake and avoiding the
> cesspool / sewer.
>
> The interesting thing is that you have a couple of design options… e.g.
> Spark Streaming that takes the list of files to be processed, or just
> launch a spark job each time because you will check on input files at a
> relatively slow rate. (e.g. every 1/2 hour, hour, 4 hours, or even once a
> day… )
>
> Not really a rock or a hard place.
>
> The trick is how / when do you migrate from an RDBMs centric view, to a
> Hierarchical (record) view. And if you want to transform this in to JSON or
> manage the 1:m subtables differently.
>
> HTH
>
> -Mike
>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 17:19
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Had to take this offline…
>
> This is really a classic Data Lake problem when dealing with ingestion.
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 19:40
> To: Michael Segel <michael_segel@hotmail.com>
>
>
> Thanks Mike again for the valuable info.
>
> This is the way I may proceed please correct me if needed:
>
>
>    1. Informatica will be used for "E" or "ET" in ETL.
>    2. For each input table we have to specify the output to be a csv type
>    file to be standard across all the activities. If there is some "T" and
>    enrichment, then the output should be as the user wants it. In short it
>    should reflect what they load in their Data Warehouse
>    3. As you mentioned it could be new data, updates or deletes. I have
>    found that the best approach is to add two columns op_time (timestamp) and
>    op_type (INSERT/UPDATE/DELETES,  flagged as 1/2/3)
>    4. Let us assume that all these files land in Landing zone as flat
>    files. We can easily create a cron that runs say every hour, puts the new
>    files into respective HDFS sub-directories and then lunch Spark job (I
>    guess we will need a spark job for each flat file type (schema stuff) that
>    reads from the said directory, creates an RDD from CSV, transforms to DF,
>    create a tempTable and inserts/select from tempTable into ORC table in Hive
>    or simply put data into Hbase.
>    5. The guy from customer (IT) does not want to use Phoenix on Hbase so
>    I suggested using Hive External table on Hbase so we can run Tableau on
>    Hbase via Hive (if we decided to do Hbase way)
>
> I guess we have to work with Informatica guys to ask them to create target
> files for us. I agree with you that using Informatica organic connectors to
> Hive would be much better. So we have to work around it.
>
> Please let me know your thoughts.
>
> Regards,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Michael Segel <michael_segel@hotmail.com>
> Date: 9 November 2016 at 19:54
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> Yeah, I’d avoid Phoenix.
>
> Not impressed with it and it lacks commercial support along with not
> enough critical mass.
> I’m also hesitant to use HBase in certain environments along with concerns
> about stability. (It depends on who’s administering the system)
> (But its still a viable tool. I prefer MapRDB over HBase. )
>
> If you want SQL, you need to either use Drill, Spark or HiveServer2 and
> put Hive tables over HBase.
>
> The problem with SQL over HBase… you end up doing full table scans unless
> you use an inverted table as an index and do a map side join. (Secondary
> indexes that you have to maintain.)
>
> The other option is to dump it from HBase in to parquet files. Again it
> depends on the data and how you intend to use it.
>
>
> I’m seriously going to have to write a book on this…  ;-)
>
> -Mike
>
> ----------
> From: ayan guha <guha.ayan@gmail.com>
> Date: 9 November 2016 at 19:59
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: user <user@spark.apache.org>, Michael Segel <michael_segel@hotmail.com
> >
>
>
> Yes, it can be done and a standard practice. I would suggest a mixed
> approach: use Informatica to create files in hdfs and have hive staging
> tables as external tables on those directories. Then that point onwards use
> spark.
>
> Hth
> Ayan
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:05
> To: Michael Segel <michael_segel@hotmail.com>
>
>
> LOL Mike agreed. Keep a copy for me :)
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:10
> To: ayan guha <guha.ayan@gmail.com>
>
>
> Thanks Ayan. Taking this offline.
>
> I gather you know about Informatica :)
>
>
> This is the way I may proceed please correct me if needed:
>
>
>    1. Informatica will be used for "E" or "ET" in ETL.
>    2. For each input table we have to specify the output to be a csv type
>    file to be standard across all the activities. If there is some "T" and
>    enrichment, then the output should be as the user wants it. In short it
>    should reflect what they load in their Data Warehouse
>    3. It could be new data, updates or deletes. I have found that the
>    best approach is to add two columns op_time (timestamp) and op_type
>    (INSERT/UPDATE/DELETES,  flagged as 1/2/3)
>    4. Let us assume that all these files land in Landing zone as flat
>    files. We can easily create a cron that runs say every hour, puts the new
>    files into respective HDFS sub-directories and then lunch Spark job (I
>    guess we will need a spark job for each flat file type (schema stuff) that
>    reads from the said directory, creates an RDD from CSV, transforms to DF,
>    create a tempTable and inserts/select from tempTable into ORC table in Hive
>    or simply puts data into Hbase.
>    5. The guy from customer (IT) does not want to use Phoenix on Hbase so
>    I suggested using Hive External table on Hbase so we can run Tableau on
>    Hbase via Hive (if we decided to do Hbase way)
>
> I guess we have to work with Informatica guys to ask them to create target
> files for us. I agree with you that using Informatica organic connectors to
> Hive would be much better. So we have to work around it.
>
> Let me know your thoughts. I gather you have done something similar?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 20:48
> To: mich@peridale.co.uk
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
> ----------
> From: Jörn Franke <jornfranke@gmail.com>
> Date: 9 November 2016 at 21:26
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Basically you mention the options. However, there are several ways how
> informatica can extract (or store) from/to rdbms. If the native option is
> not available then you need to go via JDBC as you have described.
> Alternatively (but only if it is worth it) you can schedule fetching of
> the files via oozie and use it to convert the csv into orc/ parquet etc.
> If this is a common use case in the company you can extend informatica
> with Java classes that for instance convert the data directly into parquet
> or orc. However, is some effort.
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 22:16
> To: Jörn Franke <jornfranke@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
>
>
> Thanks guys,
>
> Sounds like let Informatica get the data out of RDBMS and create mapping
> to flat files that will be delivered to a directory visible by HDFS host.
> Then push the csv files into HDFS. then there are number of options to work
> on:
>
>
>    1. run cron or oozie to get data out of HDFS (or build external Hive
>    table on that directory) and do insert/select into Hive managed table
>    2. alternatively use a spark job to get CSV data into RDD and then
>    create tempTable and do insert/select from tempTable to Hive table. Bear in
>    mind that we need a spark job tailored to each table schema
>
>
> I believe the above is feasible?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: ayan guha <guha.ayan@gmail.com>
> Date: 9 November 2016 at 22:34
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> Hi Mich
>
> not recently but my team did this exact stuff in few years back.
> Essentially, Informatica is pretty good in terms of writing to HDFS. So, I
> would suggest NOT to use landing zone but go one step further and write to
> HDFS from Informatica. This will be append only step, so no update. Once
> they are written to HDFS, develop  few patterns such as append-event,
> dim-update, dim-replace,dim-scd2 and so on and apply to source data ->
> final datasets.
>
> Hive on HBase is doable, yes. But not recommended. But if you want to
> handle updates, you can use it.
> --
> Best Regards,
> Ayan Guha
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 22:46
> To: ayan guha <guha.ayan@gmail.com>
>
>
> Thanks Ayan.
>
> Our problem is that the business does not want to pay for Informatica
> plug-in to HDFS. I don't think this plug-in comes standard so we have to
> save it as flat files?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: ayan guha <guha.ayan@gmail.com>
> Date: 9 November 2016 at 22:55
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> Just to mention the difference, HDFS and Hive plug ins are way different.
> When I last looked at it in 2013, HDFS connector (Please note the
> distinction, it is NOT a plug in but a connector, in Informatica terms) was
> fairly matured and integrated with Power Center directly. However, Hive
> Plug ins were little fuzzy and required to have Informatica Developer
> license etc.
>
> So, Yes, if you do not have HDFS connector then you are kind of stuck.  In
> that case, it is essentially a file system source you are ingesting to
> Hadoop :)
>
> ----------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Date: 9 November 2016 at 23:04
> To: ayan guha <guha.ayan@gmail.com>
>
>
> is hdfs connector standard with Informatica Ayan? In other words we can
> write data to HDFS directory using Informatica and nothing else
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> ----------
> From: ayan guha <guha.ayan@gmail.com>
> Date: 9 November 2016 at 23:19
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
>
>
> HDFS connector is standard for Informatica, yes.....in the same sense it
> has connectors to non-standard data sources such as SAP, Mainframes,
> HTTP,Lotun Notes etc....
>
> It depends on licensing, but I am fairly confident even if it is sold
> separately, it is much cheaper than hive plug in....
>
> ----------
> From: Michael Segel <msegel_hadoop@hotmail.com>
> Date: 9 November 2016 at 23:25
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com>
> Cc: Jörn Franke <jornfranke@gmail.com>, "user @spark" <
> user@spark.apache.org>
>
>
> Oozie, a product only a mad Russian would love. ;-)
>
> Just say no to hive. Go from Flat to Parquet.
> (This sounds easy, but there’s some work that has to occur…)
>
> Sorry for being cryptic, Mich’s question is pretty much generic for anyone
> building a data lake so it ends up overlapping with some work that I have
> to do…
>
> -Mike
>
>
>

Mime
View raw message