mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Woodward <>
Subject Converting one large text file with multiple documents to SequenceFile format
Date Tue, 30 Oct 2012 20:00:59 GMT

I have done a lot of searching on the web for this, but I've found nothing, even though I
feel like it has to be somewhat common. I have used Mahout's 'seqdirectory' command to convert
a folder containing text files (each file is a separate document) in the past. But in this
case there are so many documents (in the 100,000s) that I have one very large text file in
which each line is a document. How can I convert this large file to SequenceFile format so
that Mahout understands that each line should be considered a separate document?  Would it
be better if the file was structured like so....docId1 {tab} document textdocId2 {tab} document
textdocId3 {tab} document text...

Thank you very much for any help.Nick
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message