mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Schlaikjer <andrew.schlaik...@gmail.com>
Subject Re: Converting one large text file with multiple documents to SequenceFile format
Date Fri, 02 Nov 2012 01:42:57 GMT
Lots of ways to do this, but I'd use pig + elephant-bird-pig to (1) load
the data from tsv format into pig and (2) convert from pig tuples to
writables and store in sequence file:

{code}
-- params
%default MY_DATA_FILE '/path/to/docs.tsv';
%default OUTPUT_PATH '/path/to/output';

-- constants
%declare SEQFILE_STORAGE
'com.twitter.elephantbird.pig.store.SequenceFileStorage';
%declare LONG_CONVERTER
'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';

-- pull in EB
REGISTER '/path/to/elephant-bird-pig.jar';

-- load data
doc = LOAD '$MY_DATA_FILE' USING PigStorage AS (doc_id: long, text:
chararray);

-- store data
rmf '$OUTPUT_PATH'
STORE doc INTO '$OUTPUT_PATH' USING $SEQFILE_STORAGE (
  '-c $LONG_CONVERTER', '-c '$TEXT_CONVERTER'
);
{code}

https://github.com/kevinweil/elephant-bird/

Andy


On Tue, Oct 30, 2012 at 1:00 PM, Nick Woodward <levar1@hotmail.com> wrote:

>
> I have done a lot of searching on the web for this, but I've found
> nothing, even though I feel like it has to be somewhat common. I have used
> Mahout's 'seqdirectory' command to convert a folder containing text files
> (each file is a separate document) in the past. But in this case there are
> so many documents (in the 100,000s) that I have one very large text file in
> which each line is a document. How can I convert this large file to
> SequenceFile format so that Mahout understands that each line should be
> considered a separate document?  Would it be better if the file was
> structured like so....docId1 {tab} document textdocId2 {tab} document
> textdocId3 {tab} document text...
>
> Thank you very much for any help.Nick
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message