From user-return-13316-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Tue May 8 13:42:05 2012 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DB0D6C304 for ; Tue, 8 May 2012 13:42:05 +0000 (UTC) Received: (qmail 16124 invoked by uid 500); 8 May 2012 13:42:04 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 16062 invoked by uid 500); 8 May 2012 13:42:04 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 16054 invoked by uid 99); 8 May 2012 13:42:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2012 13:42:04 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of datageekpune@gmail.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bk0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2012 13:41:59 +0000 Received: by bkcik5 with SMTP id ik5so10761220bkc.1 for ; Tue, 08 May 2012 06:41:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=fcrntOJ7GOWBKO+sHFwgculEDH8DYZ9FMgO9vgyhNAA=; b=j0cz9eAz673XZYBXZ9HNxTyPpPtMp6gMtz0uAYsqBmpaH3lB7xXsH5owduZ1uqyjU7 xjF9UN473lefDJzhGDvwPnb0DwHHxVUbICnArI2NXYwVrShWaWdchfJheIVM8Twf+9Uq 6XXU3YlWmUT/Bd6D/0AEQSbbB/O9AgSY7uHHkerBMs+W8qIXJIU03+AKH2AP6MIenK9j 6e8HXGfpDjDnk2wrde5ZIHs/7lrfHfN/qASZb/xY5nPi8PrLDQkLUsSQ5HVxtjs5qIAI IIO9wjgTbOcEtkzHrghLnFpX8mY3ZVklqixH1pni9Me3tASA99lgccEMucOONUtciyrE ZUgg== MIME-Version: 1.0 Received: by 10.205.133.197 with SMTP id hz5mr7080450bkc.126.1336484497695; Tue, 08 May 2012 06:41:37 -0700 (PDT) Received: by 10.204.37.5 with HTTP; Tue, 8 May 2012 06:41:37 -0700 (PDT) In-Reply-To: References: <2012050421082168717948@gmail.com> Date: Tue, 8 May 2012 19:11:37 +0530 Message-ID: Subject: Re: A Mahout Naive Bayes classifier problem From: Nimesh Parikh To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=000e0cdfd3eabba53004bf868b8d X-Virus-Checked: Checked by ClamAV on apache.org --000e0cdfd3eabba53004bf868b8d Content-Type: text/plain; charset=ISO-8859-1 Well, You can take a chance with changing parameter "UTF-8" to something else.. Thanks, Nimesh On Sat, May 5, 2012 at 5:53 AM, Lance Norskog wrote: > Yes, it could be the charset problem. Also, it could be the total > number of terms you supply. > > Which analyzer do you use? It is the Lucene "CJKAnalyzer"? This > creates bigrams of all successive words, and so the number of unique > terms explodes. This will cause the Hadoop job to explode. The "Smart > Chinese Analyzer" uses a trained model to split words into 1-, 2- and > 3-word clusters. The "Standard Analyzer" will split all CJK words into > single terms. Given that this is a Bayesian model, the Bayesian > assumption would be that single terms are good enough. I would go with > the StandardAnalyzer. > > (I learned all of this just now in my day job in the Lucene business.) > > On Fri, May 4, 2012 at 6:32 AM, Robin Anil wrote: > > Can you provide the console output when you run train or test > > On May 4, 2012 8:09 AM, "Zehao Jin" wrote: > > > >> ** > >> Dear all, > >> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier > for > >> text classification.To get started, I followed the example of Twenty > >> NewsGroup: > >> 1.Start the Hadoop clusters. > >> 2.Run the 20 newsgroup example by executing the script: > >> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method. > >> 3.Finally I got the result same Confusion Matrix as you put here: > >> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups > >> But I have to classifier the Chinese texts, I had no clue, so I read the > >> shell script:examples/bin/build-20news-bayes.sh and I knew how this > example > >> processed.Then I did like the script: > >> 1.Preparing Training Data. > >> The script use > >> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the > >> E-mail texts and gets one document per line,the label and the words,you > >> know,the Chinese is different from English,the words cannot splitted by > a > >> space,different combination have different meaning, so I used a Chinese > >> text analyzer to split the words, and match the format. Each line is > like > >> this: Label+'\t'+word1 word2 ....+'\n'; > >> The example's analyzer output : > >> And the Chinese anlyzer output: > >> > >> 2.Put the formatted train data and the test data to HDFS.(My Hadoop > >> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20 > >> categories, and my corpus has 10 categoris: > >> The example: My categories: > >> > >> 3.Train the classifier and test the classifier on Hadoop. > >> The example do like this: > >> > >> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o > /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs > >> > >> ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d > /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method > mapreduce > >> And my commands are absolutely accord the example,the only difference is > >> the directory. > >> > >> Strangely I cannot get the result as the example,I ran the program > several > >> times, but the mapreduce job always fail! > >> Task xxx failed to report status for 600 seconds.Killing. > >> > >> What I want to ask that are the mahout trainclassifer ( > >> ./bin/mahout trainclassifier xxx)and testclassifier( ./bin/mahout > testclassifier > >> xxx) codes fit for my program ? Or it can only be used by the 20 > >> newsgroup example? if they cannot be used ,it's really hard for me to > >> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many > >> problems are occurred by this. Can you give me some support? I > scratched my > >> head for a few days. Thank you very much!!! > >> ------------------------------ > >> Zehao Jin,SCUT , China. > >> > > > > -- > Lance Norskog > goksron@gmail.com > --000e0cdfd3eabba53004bf868b8d--