mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sarang Deshpande <sar...@Shopzilla.com>
Subject How to convert input key:value text format to mahout digestible format
Date Tue, 25 Sep 2012 18:56:23 GMT
Dear mahout users,
I am trying to use bayes classifier from mahout distribution 0.7. As input training set, I
have a text file in following format: One document per line, first  entry on the line is the
label (key), rest is the evidence (value = document contents). In mahout 0.5, command trainclassifier
used to take directory containing files with above kind of format as input but in mahout 0.7,
seqdirectory command needs input directory with one file per document. My training set contains
millions of small documents so I am trying to avoid having millions of tiny files on HDFS.
Is there an easy way to convert above files into sequence files that could be digestible by
seq2sparse command subsequently.

Thanks much
~Sarang


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message