mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Debashis Das <debashis...@gmail.com>
Subject Split giving wrong test and training partitions
Date Fri, 21 Jun 2013 16:05:18 GMT
Hi,
  I have a question about the split program in Mahout. I have a set of
tfidf vectors that I want to split into training and test sets using a
60-40 split (I want to ultimately run a Naive Bayes classifier to do
some text classification). I have 19 tfidf vectors that were created
using seq2sparse. Here is the part-r-00000 tfidf vectors file
seqdumped:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop and
HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB: /opt/mapr/mahout/mahout-distribution-0.7/mahout-examples-0.7-mapr-job.jar
13/06/21 10:55:11 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[TwitterSentimentAnalysis/VSMComp/tfidf-vectors/part-r-00000],
--startPhase=[0], --tempDir=[temp]}
13/06/21 10:55:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/21 10:55:11 INFO security.JniBasedUnixGroupsMapping: Using
JniBasedUnixGroupsMapping for Group resolution
Input Path: TwitterSentimentAnalysis/VSMComp/tfidf-vectors/part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: /0/14: Value:
{30:2.8458266258239746,28:2.335000991821289,15:2.152679443359375,13:1.5465437173843384,10:2.8458266258239746,6:2.335000991821289,36:2.8458266258239746,35:3.617762565612793,3:1.9985288381576538,32:2.335000991821289,1:2.6375045776367188}
Key: /0/17: Value:
{29:2.5581445693969727,25:2.8458266258239746,19:2.152679443359375,14:1.641853928565979,38:2.5581445693969727,3:1.9985288381576538,33:1.641853928565979,32:2.335000991821289,1:1.864997386932373}
Key: /0/19: Value:
{35:2.5581445693969727,34:2.335000991821289,32:3.302190065383911,24:2.8458266258239746,21:4.598021030426025,18:2.152679443359375,17:2.8458266258239746,14:1.641853928565979,13:1.5465437173843384,11:5.691653251647949,3:1.9985288381576538,1:1.864997386932373,37:3.617762565612793}
Key: /0/21: Value:
{20:2.8458266258239746,18:2.152679443359375,14:1.641853928565979,13:1.5465437173843384,33:2.321932077407837,6:2.335000991821289,1:2.6375045776367188}
Key: /0/9: Value:
{3:1.9985288381576538,13:1.5465437173843384,34:2.335000991821289}
Key: /2/13: Value:
{12:2.8458266258239746,34:2.335000991821289,33:1.641853928565979,8:2.8458266258239746,7:2.8458266258239746,28:2.335000991821289}
Key: /4/10: Value:
{14:1.641853928565979,23:2.5581445693969727,22:2.5581445693969727,4:2.8458266258239746}
Key: /4/12: Value:
{19:2.152679443359375,39:2.8458266258239746,11:2.8458266258239746,28:2.335000991821289,27:2.5581445693969727,2:2.8458266258239746}
Key: /4/15: Value:
{28:2.335000991821289,24:2.8458266258239746,17:2.8458266258239746,12:2.8458266258239746,8:2.8458266258239746,7:2.8458266258239746,37:2.5581445693969727,1:1.864997386932373}
Key: /4/16: Value:
{25:2.8458266258239746,23:2.5581445693969727,18:2.152679443359375,38:2.5581445693969727,37:2.5581445693969727,34:2.335000991821289,1:1.864997386932373,31:2.8458266258239746}
Key: /4/18: Value:
{0:2.8458266258239746,29:2.5581445693969727,27:2.5581445693969727,35:2.5581445693969727,13:1.5465437173843384,36:2.8458266258239746,4:2.8458266258239746,2:2.8458266258239746,31:2.8458266258239746}
Key: /4/20: Value:
{19:2.152679443359375,18:2.152679443359375,13:1.5465437173843384}
Key: /4/22: Value:
{18:2.152679443359375,14:1.641853928565979,33:1.641853928565979}
Key: /4/3: Value:
{27:2.5581445693969727,22:2.5581445693969727,16:2.152679443359375,14:2.321932077407837,13:1.5465437173843384,6:2.335000991821289,33:2.321932077407837,32:2.335000991821289,0:2.8458266258239746}
Key: /4/4: Value:
{16:2.152679443359375,15:2.152679443359375,14:1.641853928565979,22:2.5581445693969727,19:2.152679443359375}
Key: /4/5: Value:
{29:2.5581445693969727,15:2.152679443359375,33:1.641853928565979}
Key: /4/6: Value:
{33:1.641853928565979,26:2.8458266258239746,23:2.5581445693969727,19:2.152679443359375,16:2.152679443359375,14:1.641853928565979,13:1.5465437173843384,9:4.598021030426025,5:2.8458266258239746,3:1.9985288381576538,39:2.8458266258239746,1:1.864997386932373}
Key: /4/7: Value:
{30:2.8458266258239746,16:2.152679443359375,15:2.152679443359375,13:2.187143087387085,10:2.8458266258239746,6:2.335000991821289,3:1.9985288381576538,33:1.641853928565979}
Key: /4/8: Value:
{26:2.8458266258239746,20:2.8458266258239746,16:2.152679443359375,15:2.152679443359375,14:1.641853928565979,13:1.5465437173843384,38:2.5581445693969727,5:2.8458266258239746,33:1.641853928565979}
Count: 19
13/06/21 10:55:12 INFO driver.MahoutDriver: Program took 1102 ms
(Minutes: 0.018366666666666667)

  I ran:

 mahout split -i
/user/ddas/TwitterSentimentAnalysis/VSMComp/tfidf-vectors -te
/user/ddas/TwitterSentimentAnalysis/TwitterTesting -tr
/user/ddas/TwitterSentimentAnalysis/TwitterTraining -xm sequential -rp
40 -ow -seq

  yielding:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop and
HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB: /opt/mapr/mahout/mahout-distribution-0.7/mahout-examples-0.7-mapr-job.jar
13/06/21 10:18:34 WARN driver.MahoutDriver: No split.props found on
classpath, will use command-line arguments only
13/06/21 10:18:35 INFO common.AbstractJob: Command line arguments:
{--endPhase=[2147483647],
--input=[/user/ddas/TwitterSentimentAnalysis/VSMComp/tfidf-vectors],
--method=[sequential], --overwrite=null, --randomSelectionPct=[40],
--sequenceFiles=null, --startPhase=[0], --tempDir=[temp],
--testOutput=[/user/ddas/TwitterSentimentAnalysis/TwitterTesting],
--trainingOutput=[/user/ddas/TwitterSentimentAnalysis/TwitterTraining]}
13/06/21 10:18:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/21 10:18:35 INFO security.JniBasedUnixGroupsMapping: Using
JniBasedUnixGroupsMapping for Group resolution
13/06/21 10:18:35 INFO common.HadoopUtil: Deleting
/user/ddas/TwitterSentimentAnalysis/TwitterTraining
13/06/21 10:18:35 INFO common.HadoopUtil: Deleting
/user/ddas/TwitterSentimentAnalysis/TwitterTesting
13/06/21 10:18:35 INFO utils.SplitInput: part-r-00000 has 16 lines
13/06/21 10:18:35 INFO utils.SplitInput: part-r-00000 test split size
is 6 based on random selection percentage 40
13/06/21 10:18:35 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
13/06/21 10:18:35 INFO compress.CodecPool: Got brand-new compressor
13/06/21 10:18:35 INFO compress.CodecPool: Got brand-new compressor
13/06/21 10:18:35 INFO utils.SplitInput: file: part-r-00000, input: 16
train: 13, test: 6 starting at 0
13/06/21 10:18:35 INFO driver.MahoutDriver: Program took 505 ms
(Minutes: 0.008416666666666666)

  Note that it states that my tfidf vectors file part-r-00000 has 16
lines and not 19. This is a small example I generated from my problem
where it is returning various random splits based on different number
of lines read from the tfidf vectors file. Sometimes that number is
larger than the total number of training examples in the entire
dataset. Can someone please point out what I am doing wrong here?

  Thanks in advance!!!

Mime
View raw message