spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <aakash.spark....@gmail.com>
Subject Re: Multiple Kafka Spark Streaming Dataframe Join query
Date Thu, 15 Mar 2018 05:22:57 GMT
Thanks to TD, the savior!

Shall look into it.

On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das <tathagata.das1565@gmail.com>
wrote:

> Relevant: https://databricks.com/blog/2018/03/13/
> introducing-stream-stream-joins-in-apache-spark-2-3.html
>
> This is true stream-stream join which will automatically buffer delayed
> data and appropriately join stuff with SQL join semantics. Please check it
> out :)
>
> TD
>
>
>
> On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes <djmgguedes@gmail.com>
> wrote:
>
>> I misread it, and thought that you question was if pyspark supports kafka
>> lol. Sorry!
>>
>> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu <aakash.spark.raj@gmail.com>
>> wrote:
>>
>>> Hey Dylan,
>>>
>>> Great!
>>>
>>> Can you revert back to my initial and also the latest mail?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On 15-Mar-2018 12:27 AM, "Dylan Guedes" <djmgguedes@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've been using the Kafka with pyspark since 2.1.
>>>>
>>>> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu <
>>>> aakash.spark.raj@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm yet to.
>>>>>
>>>>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
>>>>> allows Python? I read somewhere, as of now Scala and Java are the languages
>>>>> to be used.
>>>>>
>>>>> Please correct me if am wrong.
>>>>>
>>>>> Thanks,
>>>>> Aakash.
>>>>>
>>>>> On 14-Mar-2018 8:24 PM, "Georg Heiler" <georg.kf.heiler@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Did you try spark 2.3 with structured streaming? There watermarking
>>>>>> and plain sql might be really interesting for you.
>>>>>> Aakash Basu <aakash.spark.raj@gmail.com> schrieb am Mi. 14.
März
>>>>>> 2018 um 14:57:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>>>>>>
>>>>>>> *Spark 2.2.1*
>>>>>>> *Kafka 1.0.1*
>>>>>>>
>>>>>>> As of now, I am feeding paragraphs in Kafka console producer
and my
>>>>>>> Spark, which is acting as a receiver is printing the flattened
words, which
>>>>>>> is a complete RDD operation.
>>>>>>>
>>>>>>> *My motive is to read two tables continuously (being updated)
as two
>>>>>>> distinct Kafka topics being read as two Spark Dataframes and
join them
>>>>>>> based on a key and produce the output. *(I am from Spark-SQL
>>>>>>> background, pardon my Spark-SQL-ish writing)
>>>>>>>
>>>>>>> *It may happen, the first topic is receiving new data 15 mins
prior
>>>>>>> to the second topic, in that scenario, how to proceed? I should
not lose
>>>>>>> any data.*
>>>>>>>
>>>>>>> As of now, I want to simply pass paragraphs, read them as RDD,
>>>>>>> convert to DF and then join to get the common keys as the output.
(Just for
>>>>>>> R&D).
>>>>>>>
>>>>>>> Started using Spark Streaming and Kafka today itself.
>>>>>>>
>>>>>>> Please help!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Aakash.
>>>>>>>
>>>>>>
>>>>
>>
>

Mime
View raw message