spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xnts@o2.pl <x...@o2.pl>
Subject list of documents sentiment analysis - problem with defining proper approach with Spark
Date Wed, 17 Sep 2014 13:57:35 GMT
Hi,

For last few days I am working on an exercise where I want to understand the sentiment of
a set of articles.

As the input I have XML file with articles and the AFINN-111.txt file defining sentiment of
few hundred words.

What I am able to do without any problem is loading of the data, putting it into structures
(classes for articles, tuples (word, sentiment-value) for sentiments).

Then what I think I need to do (from the logical pov) is:

foreach article
   articleWords = split the body by " " 
   join the two lists (articleWords and sentimentWords) together.
   calculate the sentiment for the article by summing up sentiments of all words that it includes
dump the article id, sentiment into a flat file

And this is where I am stuck :) I tried multiple combinations of map/reduceByKey all either
didn't make too much sense (like getting sentiment for all articles combined) or resulted
in errors that function cannot be serialised. Today I even tried to implement this with a
brute-force approach doing:

articles.foreach(calculateSentiment)

where calculateSentiment looks like below:

val words = sc.parallelize(post.body.split(" ")) // split body by " " 
val wordPairs = words.map(w => (w, 1)).reduceByKey(_+_, 1) // create tuples of (word, #occurrences
in article)
val joinedValues = wordPairs.join(sentiments_) // join 

But somehow I had a feeling this is not the best idea and I think I was right, since the job
is running for like an hour (and I have few hundred GBs to process only).

So the question is - what I am doing wrong? Any hints or suggestions for direction are really
appreciated!

Thank you,
Leszek



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message