spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: filtering out non English tweets using TwitterUtils
Date Tue, 11 Nov 2014 19:45:41 GMT
You could get all the tweets in the stream, and then apply "filter"
transformation on the DStream of tweets to filter away non-english
tweets. The tweets in the DStream is of type twitter4j.Status which
has a field describing the language. You can use that in the filter.

Though in practice, a lot of non-english tweets are also marked as
english by Twitter. To really filter out ALL non-english tweets, you
will have to probably do some machine learning stuff to "identify"
English tweets.

On Tue, Nov 11, 2014 at 11:41 AM, SK <skrishna.id@gmail.com> wrote:
> Hi,
>
> Is there a way to extract only the English language tweets when using
> TwitterUtils.createStream()? The "filters" argument specifies the strings
> that need to be contained in the tweets, but I am not sure how this can be
> used to specify the language.
>
> thanks
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message