mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Filipponi <luca.filippon...@gmail.com>
Subject Re: Naive Bayes Classifier Sentiment Analysis
Date Tue, 29 Jul 2014 15:36:30 GMT
I appreciate your help, but for my lack of knowledge I didn’t understand.

I’ll try to explain better my problem :D

What I’ve done is to create a sequence File starting from csv like this ( is italian tweet
:D ):

negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... Per adesso io rido !!!!!

positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si preoccupa di chi gli garantisce
voti...ma stanno scoprendo il prezzo di quei fottutissimi #80euro dagli …

neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di altre schifezze... http://t.co/euFbtP7hQ1
… #Europee2014 via 

So I create a sequence file in this way:
 

String[] tokens = line.split(",", 3);
          
            String label = tokens[0];
            String id = tokens[1];
            String message = tokens[2];
            key.set("/" + label + "/" + id);
            value.set(message);
            writer.append(key, value);


So I’m creating a sequence File of the form <Text,Text> where the key is composed
in this way : “/label/documentID/“ and the value contains the original text of the document.

After this step I create tfidf document using mahout utilities, then I’ve a sequence file
Text,VectorWritable like this:

Key: /negativo/468437278663409666 Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.2884088933275849,241:0.2772479861583959,309:0.22061363650715415}

Then I am using the command on the newly created vector:

./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c

And then:

./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o trainingVectorTest-result -c

and this is the output:

14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary Results: 
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :        112    99,115%
Incorrectly Classified Instances        :          1    0,885%
Total Classified Instances              :        113

=======================================================
Confusion Matrix
-------------------------------------------------------
a    b    c    <--Classified as
47   0    0     |  47    a     = negativo
0    41   0     |  41    b     = neutrale
0    1    24    |  25    c     = positivo

=======================================================
Statistics
-------------------------------------------------------
Kappa                                       0,9361
Accuracy                                    99,115%
Reliability                                     74%
Reliability (standard deviation)            0,4937


What I want to do now is to use the classifier on a new dataset that is unlabeled, so I’ve
a csv like this:

471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso io rido !!!!!

So I wrote a sequence file with:

key= /documentid/ value= Content of the document

and then use mahout utilities to create a tfidf-vector:

Key: /471685156584292353/ Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
...

But when I use the command testnb on this new dataset I get this exception:

java.lang.IllegalArgumentException: Label not found: 471685156584292353

I know that this is due, to the fact that the documentID is recognized as label, but I don’t
know how to resolve that, could be great if you provide me some similar example, becouse I
can’t find nothing similar.

Thank you so much in advance, your help is really appreciated.

Luca Filipponi.


Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava <vaibhavcse30@gmail.com> ha
scritto:

> Hi
> The sequence file format will be Text and Vector Writable.
> suppose you have test document named as 1,2,3,4.
> The you can have sequence file format as Key : /test/1 Value : <vectors1>
> /test/2 Value : <vectors2>
> 
> this line in BayesTestMapper
> //the key is the expected value
> 
>    context.write(new Text(SLASH.split(key.toString())[1]), new
> VectorWritable(result));
> 
> 
> and TestNaiveBayesDriver.java might help you . if you remove this part from
> this code  you will not get confusion matrix  and initial labels are not
> required.
> 
> 
> 
> 
> if (bestIdx != Integer.MIN_VALUE) {
> 
>        ClassifierResult classifierResult = new ClassifierResult(labelMap
> .get(bestIdx), bestScore);
> 
>        analyzer.addInstance(pair.getFirst().toString(), classifierResult);
> 
>      }
> 
> 
> your out file will contain our document name suppose 1 and label vector
> with its values.
> 
> 
> hope this help.
> 
> Thanks,
> 
> Vaibhav
> 
> vaibhavcse30@gmail.com
> 
> 
> 
> 
> On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <luca.filipponi89@gmail.com>
> wrote:
> 
>> I am using mahout 0.9, which part of source code should I look?
>> 
>> My problem is that I don't know how to the sequence file without the label
>> should be structured.
>> 
>> Do you have any hint?
>> 
>> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <
>> vaibhavcse30@gmail.com> ha scritto:
>> 
>>> Hi,
>>> If you want to create a test set and if you do not want to measure
>> accuracy.
>>> Then you can make an instance of claasifier and load your model on that
>>> classifier and then can find the best score.
>>> Look at  navie bayes test code.
>>> Hope this help. Thanks .
>>> On 29 Jul 2014 12:53, "Luca Filipponi" <luca.filipponi89@gmail.com>
>> wrote:
>>> 
>>>> Hi , I am trying to develop sentiment analysis on italian tweet from
>>>> twitter using the naive bayes classifier, but I've some trouble.
>>>> 
>>>> My idea was to classify a lot of tweet as positive, negative or
>> neautral,
>>>> and using that as training set for the Classifier. To do that I've
>> wrote a
>>>> sequence file, in the format <Text,Text>, where in the key there is
>>>> /label/tweetID and in the key the text, and then the text of all the
>>>> dataset is converted in tfidf vector, using mahout utilities.
>>>> 
>>>> Then I'm using the command:
>>>> 
>>>> ./mahout trainnb and ./mahout testnb to check the classifier, and the
>>>> score is right (I've got nearly 100% because the test set is the same as
>>>> the train set)
>>>> 
>>>> My question is if I want to use a test set that is unlabeled how should
>> it
>>>> be created? because if the format isn't like:
>>>> 
>>>> key = /label/  the classifier can't find the label and I've got an
>>>> exception
>>>> 
>>>> but in a new dataset, obviously this will be unlabeled because i need to
>>>> classify that, so I don't know what put in the key of the sequence file.
>>>> 
>>>> I've searched online for some example, but the only ones that I've found
>>>> use the split command, on the original dataset, and then testing on
>> part of
>>>> that, but isn't my case.
>>>> 
>>>> 
>>>> Every idea for developing a better sentiment analysis is welcome, thanks
>>>> in advance for the help.
>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> Thanks and Regards,
> Vaibhav Srivastava
> Email-id: vaibhavcse30@gmail.com
> Mobile no.: 9552543029


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message