mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Schlaikjer <>
Subject Re: Converting one large text file with multiple documents to SequenceFile format
Date Fri, 02 Nov 2012 01:42:57 GMT
Lots of ways to do this, but I'd use pig + elephant-bird-pig to (1) load
the data from tsv format into pig and (2) convert from pig tuples to
writables and store in sequence file:

-- params
%default MY_DATA_FILE '/path/to/docs.tsv';
%default OUTPUT_PATH '/path/to/output';

-- constants
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';

-- pull in EB
REGISTER '/path/to/elephant-bird-pig.jar';

-- load data
doc = LOAD '$MY_DATA_FILE' USING PigStorage AS (doc_id: long, text:

-- store data


On Tue, Oct 30, 2012 at 1:00 PM, Nick Woodward <> wrote:

> I have done a lot of searching on the web for this, but I've found
> nothing, even though I feel like it has to be somewhat common. I have used
> Mahout's 'seqdirectory' command to convert a folder containing text files
> (each file is a separate document) in the past. But in this case there are
> so many documents (in the 100,000s) that I have one very large text file in
> which each line is a document. How can I convert this large file to
> SequenceFile format so that Mahout understands that each line should be
> considered a separate document?  Would it be better if the file was
> structured like so....docId1 {tab} document textdocId2 {tab} document
> textdocId3 {tab} document text...
> Thank you very much for any help.Nick

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message