spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Data from PostgreSQL to Spark
Date Tue, 28 Jul 2015 17:57:22 GMT
Can you put some transparent cache in front of the database? Or some jdbc
proxy?

Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele <gangele397@gmail.com> a
écrit :

> can the source write to Kafka/Flume/Hbase in addition to Postgres? no
> it can't write ,this is due to the fact that there are many applications
> those are producing this postGreSql data.I can't really asked all the teams
> to start writing to some other source.
>
>
> velocity of the application is too high.
>
>
>
>
>
>
> On 28 July 2015 at 21:50, <santoshv98@gmail.com> wrote:
>
>>  Sqoop’s incremental data fetch will reduce the data size you need to
>> pull from source, but then by the time that incremental data fetch is
>> complete, is it not current again, if velocity of the data is high?
>>
>> May be you can put a trigger in Postgres to send data to the big data
>> cluster as soon as changes are made. Or as I was saying in another email,
>> can the source write to Kafka/Flume/Hbase in addition to Postgres?
>>
>> Sent from Windows Mail
>>
>> *From:* Jeetendra Gangele <gangele397@gmail.com>
>> *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
>> *To:* santoshv98@gmail.com
>> *Cc:* ayan guha <guha.ayan@gmail.com>, felixcheung_m@hotmail.com,
>> user@spark.apache.org
>>
>> I trying do that, but there will always data mismatch, since by the time
>> scoop is fetching main database will get so many updates. There is
>> something called incremental data fetch using scoop but that hits a
>> database rather than reading the WAL edit.
>>
>>
>>
>> On 28 July 2015 at 02:52, <santoshv98@gmail.com> wrote:
>>
>>>  Why cant you bulk pre-fetch the data to HDFS (like using Sqoop)
>>> instead of hitting Postgres multiple times?
>>>
>>> Sent from Windows Mail
>>>
>>> *From:* ayan guha <guha.ayan@gmail.com>
>>> *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
>>> *To:* Jeetendra Gangele <gangele397@gmail.com>
>>> *Cc:* felixcheung_m@hotmail.com, user@spark.apache.org
>>>
>>> You can call dB connect once per partition. Please have a look at design
>>> patterns of for each construct in document.
>>> How big is your data in dB? How soon that data changes? You would be
>>> better off if data is in spark already
>>> On 28 Jul 2015 04:48, "Jeetendra Gangele" <gangele397@gmail.com> wrote:
>>>
>>>> Thanks for your reply.
>>>>
>>>> Parallel i will be hitting around 6000 call to postgreSQl which is not
>>>> good my database will die.
>>>> these calls to database will keeps on increasing.
>>>> Handling millions on request is not an issue with Hbase/NOSQL
>>>>
>>>> any other alternative?
>>>>
>>>>
>>>>
>>>>
>>>> On 27 July 2015 at 23:18, <felixcheung_m@hotmail.com> wrote:
>>>>
>>>>> You can have Spark reading from PostgreSQL through the data access
>>>>> API. Do you have any concern with that approach since you mention copying
>>>>> that data into HBase.
>>>>>
>>>>> From: Jeetendra Gangele
>>>>> Sent: Monday, July 27, 6:00 AM
>>>>> Subject: Data from PostgreSQL to Spark
>>>>> To: user
>>>>>
>>>>> Hi All
>>>>>
>>>>> I have a use case where where I am consuming the Events from RabbitMQ
>>>>> using spark streaming.This event has some fields on which I want to query
>>>>> the PostgreSQL and bring the data and then do the join between event
data
>>>>> and PostgreSQl data and put the aggregated data into HDFS, so that I
run
>>>>> run analytics query over this data using SparkSQL.
>>>>>
>>>>> my question is PostgreSQL data in production data so i don't want to
>>>>> hit so many times.
>>>>>
>>>>> at any given  1 seconds time I may have 3000 events,that means I need
>>>>> to fire 3000 parallel query to my PostGreSQl and this data keeps on
>>>>> growing, so my database will go down.
>>>>>
>>>>>
>>>>>
>>>>> I can't migrate this PostgreSQL data since lots of system using it,but
>>>>> I can take this data to some NOSQL like base and query the Hbase, but
here
>>>>> issue is How can I make sure that Hbase has upto date data?
>>>>>
>>>>> Any anyone suggest me best approach/ method to handle this case?
>>>>>
>>>>> Regards
>>>>>
>>>>> Jeetendra
>>>>>
>>>>>
>>
>>
>>
>>
>
>
>
>

Mime
View raw message