spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Spark Structured Streaming for Twitter Streaming data
Date Thu, 01 Feb 2018 03:36:15 GMT
The code uses the format "socket" which is only for text sent over a simple
socket, which is completely different from how Twitter APIs works. So this
wont work at all.
Fundamentally, for Structured Streaming, we have focused only on those
streaming sources that have the capabilities record-level tracking offsets
(e.g. Kafka offsets) and replayability in order to give strong exactly-once
fault-tolerance guarantees. Hence we have focused on files, Kafka, Kinesis
(socket is just for testing as is documented). Twitter APIs as a source
does not provide those, hence we have not focused on building one. In
general, for such sources (ones that are not perfectly replayable), there
are two possible solutions.

1. Build your own source: A quick google search shows that others in the
community have attempted to build structured-streaming sources for Twitter.
It wont provide the same fault-tolerance guarantees as Kafka, etc. However,
I dont recommend this now because the DataSource APIs to build streaming
sources are not public yet, and are in flux.

2. Use Kafka/Kinesis as an intermediate system: Write something simple that
uses Twitter APIs directly to read tweets and write them into
Kafka/Kinesis. And then just read from Kafka/Kinesis.

Hope this helps.

TD

On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot <divya.htconex@gmail.com>
wrote:

> Hi ,
> I see ,Does that means Spark structured streaming doesn't work with
> Twitter streams ?
> I could see people used kafka or other streaming tools and used spark to
> process the data in structured streaming .
>
> The below doesn't work directly with Twitter Stream until I set up Kafka  ?
>
>> import org.apache.spark.sql.SparkSession
>> val spark = SparkSession
>>   .builder()
>>   .appName("Spark SQL basic example")
>>   .config("spark.some.config.option", "some-value")
>>   .getOrCreate()
>> // For implicit conversions like converting RDDs to DataFrames
>> import spark.implicits
>>>
>>> / Read text from socket
>>
>> val socketDF = spark
>>
>>   .readStream
>>
>>   .format("socket")
>>
>>   .option("host", "localhost")
>>
>>   .option("port", 9999)
>>
>>   .load()
>>
>>
>>> socketDF.isStreaming    // Returns True for DataFrames that have
>>> streaming sources
>>
>>
>>> socketDF.printSchema
>>
>>
>>
>
>
> Thanks,
> Divya
>
> On 1 February 2018 at 10:30, Tathagata Das <tathagata.das1565@gmail.com>
> wrote:
>
>> Hello Divya,
>>
>> To add further clarification, the Apache Bahir does not have any
>> Structured Streaming support for Twitter. It only has support for Twitter +
>> DStreams.
>>
>> TD
>>
>>
>>
>> On Wed, Jan 31, 2018 at 2:44 AM, vermanurag <anurag.verma@fnmathlogic.com
>> > wrote:
>>
>>> Twitter functionality is not part of Core Spark. We have successfully
>>> used
>>> the following packages from maven central in past
>>>
>>> org.apache.bahir:spark-streaming-twitter_2.11:2.2.0
>>>
>>> Earlier there used to be a twitter package under spark, but I find that
>>> it
>>> has not been updated beyond Spark 1.6
>>> org.apache.spark:spark-streaming-twitter_2.11:1.6.0
>>>
>>> Anurag
>>> www.fnmathlogic.com
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>

Mime
View raw message