spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Davidson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json
Date Tue, 26 Jan 2016 20:04:39 GMT
Andrew Davidson created SPARK-13009:
---------------------------------------

             Summary: spark-streaming-twitter_2.10 does not make it possible to access the
raw twitter json
                 Key: SPARK-13009
                 URL: https://issues.apache.org/jira/browse/SPARK-13009
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 1.6.0
            Reporter: Andrew Davidson
            Priority: Blocker


The Streaming-twitter package makes it easy for Java programmers to work with twitter. The
implementation returns the raw twitter data in JSON formate as a twitter4J StatusJSONImpl
object

JavaDStream<Status> tweets = TwitterUtils.createStream(ssc, twitterAuth);

The status class is different then the raw JSON. I.E. serializing the status object will be
the same as the original json. I have down stream systems that can only process raw tweets
not twitter4J Status objects. 

Here is my bug/RFE request made to Twitter4J <twitter4j@googlegroups.com>. They asked
 I create a spark tracking issue.


On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
Hi All

Quick problem summary:

My system uses the Status objects to do some analysis how ever I need to store the raw JSON.
There are other systems that process that data that are not written in Java.
Currently we are serializing the Status Object. The JSON is going to break down stream systems.
I am using the Apache Spark Streaming spark-streaming-twitter_2.10  http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Request For Enhancement:
I imagine easy access to the raw JSON is a common requirement. Would it be possible to add
a member function to StatusJSONImpl getRawJson(). By default the returned value would be null
unless jsonStoreEnabled=True  is set in the config.


Alternative implementations:
 

It should be possible to modify the spark-streaming-twitter_2.10 to provide this support.
The solutions is not very clean

It would required apache spark to define their own Status Pojo. The current StatusJSONImpl
class is marked final
The Wrapper is not going to work nicely with existing code.
spark-streaming-twitter_2.10  does not expose all of the twitter streaming API so many developers
are writing their implementations of org.apache.park.streaming.twitter.TwitterInputDStream.
This make maintenance difficult. Its not easy to know when the spark implementation for twitter
has changed. 
Code listing for spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

private[streaming]
class TwitterReceiver(
    twitterAuth: Authorization,
    filters: Seq[String],
    storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

  @volatile private var twitterStream: TwitterStream = _
  @volatile private var stopped = false

  def onStart() {
    try {
      val newTwitterStream = new TwitterStreamFactory().getInstance(twitterAuth)
      newTwitterStream.addListener(new StatusListener {
        def onStatus(status: Status): Unit = {
          store(status)
        }
Ref: https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html

What do people think?

Kind regards

Andy

From: <twit...@googlegroups.com> on behalf of Igor Brigadir <igor.b...@ucdconnect.ie>
Reply-To: <twit...@googlegroups.com>
Date: Tuesday, January 19, 2016 at 5:55 AM
To: Twitter4J <twit...@googlegroups.com>
Subject: Re: [Twitter4J] trouble writing unit test

Main issue is that the Json object is in the wrong json format.

eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 +0000 2015", ...

It looks like the json you have was serialized from a java Status object, which makes json
objects different to what you get from the API, TwitterObjectFactory expects json from Twitter
(I haven't had any problems using TwitterObjectFactory instead of the Deprecated DataObjectFactory).

You could "fix" it by matching the keys & values you have with the correct, twitter API
json - it should look like the example here: https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid

But it might be easier to download the tweets again, but this time use TwitterObjectFactory.getRawJSON(status)
to get the Original Json from the Twitter API, and save that for later. (You must have jsonStoreEnabled=True
in your config, and call getRawJSON in the same thread as .showStatus() or lookup() or whatever
you're using to load tweets.)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message