mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rahman <drahman1...@googlemail.com>
Subject Re: text classification using mahout and lucene index
Date Thu, 03 Nov 2011 14:50:21 GMT
Hi,

it took some time. We used an older version of lucene, it's not the same as
in Mahout. So before we create a new lucene index of the data, we will try
another aproach using the xml-data. I looked into the wikipedia-example and
I have a few questions:

1. The first step of the example is to chunk the data in pieces. Is this
necessary, because I have the data in pieces. In each xml-file are ~1000
documents and I want to use ~250 xml-files in a first test. Could I just
put the existing xml-files into a HDFS-folder in Hadoop?

2. Second step is using the wikipediaDataSetCreator on the chunk-files
(chunk-****.xml). I found the WikipediaDataSetCreatorDriver, -Mapper and
-Reducer. Can someone explain how they work, for example I don't
understand, how the label (the category "country") is selected. In my case
there would also be more than one label on one document.

3. And in the third step the classifier is trained, here I would use the
ComplementaryBayes. As result, when I test the classifier I would also need
all possible candidates (not only the top one). How can I list all possible
candidates with their weights? I just found the possibility to List the top
candidates, did I miss something?

But overall it should be be as the same as the wikipedia-example, only with
more labels (xml+text+possible categories).

Thanks and regards,
David

2011/10/18 David Rahman <drahman1985@googlemail.com>

> Hi,
>
> thanks for your directions. RIght now, I can't tell you how the indexes
> are made, because I just got the resulting data. I first have to find the
> responsible person, this will take some time. Also I have a lot to learn
> regarding lucene. As soon as I have the information, I will write back,
> maybe in a week or two.
>
> But your points are helping me a lot, so thanks again!
>
> Regards,
> David
>
>
> 2011/10/18 Lance Norskog <goksron@gmail.com>
>
>> Yes, you can try starting the job at the right place. (I did not write the
>> script.)
>>
>> A few points:
>> a. The Lucene version in Mahout is given in the pom.xml file. Your index
>> has
>> to be made with the same version of Lucene.
>> b. I don't know exactly what the index "schema" is in Mahout. To learn
>> this,
>> you need to create indexes with the sample job and examine them in Luke
>> (UI
>> browser for Lucene). You would compare this design to your index.
>>
>> The Lucene index reading code is in the appropriate job. I don't have the
>> source in front of me. The "right thing to do" would be to enhance the
>> existing Lucene index reader to take a Lucene query from the command line.
>> This would open up integration options.
>>
>> How are your indexes made?
>>
>> On Tue, Oct 18, 2011 at 2:52 AM, David Rahman <drahman1985@googlemail.com
>> >wrote:
>>
>> > Hi,
>> >
>> > thank you Lance. I'm going through your example. But at a first look
>> your
>> > are creating those sequence files and from that you create the vectors.
>> > Than
>> > you start the training by choosing one algorithm.
>> > Since I have the the data as lucene index / vector I could skip the
>> first
>> > part and start right away with the training, after I read the data, or
>> am I
>> > wrong here?
>> >
>> > Also, is there some sample code for reading a luce index (looking for a
>> > function something like "readLucene(indexfile)")?
>> >
>> > Regards,
>> > David
>> >
>> > 2011/10/15 Lance Norskog <goksron@gmail.com>
>> >
>> > > If you are using the trunk, look at examples/bin/build-asf-email.sh.
>> This
>> > > does the "three C's": classification, clustering, and collaborative
>> > > filtering all on archive of apache.org mailing lists.
>> > >
>> > > The 'classification' path at the end goes through the high-level
>> jobs. It
>> > > should show you how to get to where you want to go. You may have to
>> write
>> > > alternate code to read your Lucene index.
>> > >
>> > > Lance
>> > >
>> > > On Fri, Oct 14, 2011 at 5:17 AM, David Rahman <
>> > drahman1985@googlemail.com
>> > > >wrote:
>> > >
>> > > > Ok, I discovered that I have to check, if my data contains TermFreq
>> > > > vectors.
>> > > > That has to wait until next week, I think...
>> > > >
>> > > > Do I have to convert the lucene index files into lucene vector
>> files,
>> > in
>> > > > order to use the data for training?
>> > > >
>> > > > Regards,
>> > > > David
>> > > >
>> > > > 2011/10/14 David Rahman <drahman1985@googlemail.com>
>> > > >
>> > > > > Ok, thanks.
>> > > > > Just to make it clear to me: I take the date with the lucene
>> vectors
>> > > and
>> > > > > operate a training Alg. on them. And this should result into
a
>> model.
>> > I
>> > > > > don't need some preprocessing steps or anything else?
>> > > > >
>> > > > > Another question: your book MiA gives a good explanation and
>> overview
>> > > > about
>> > > > > mahout. Can you tell me, if there is more coming about
>> mahout+lucene?
>> > > I'm
>> > > > > new at this stuff, and I need some more readings.
>> > > > >
>> > > > > I did find "Taming Text" but from the abstract I could not
>> determine
>> > if
>> > > > > this applies to my problem.
>> > > > >
>> > > > > Thanks and regards,
>> > > > > David
>> > > > >
>> > > > > take lucene vectors --> train on them with nBayes or another
Alg.
>> -->
>> > > > > getting a model
>> > > > >
>> > > > >
>> > > > > 2011/10/13 Ted Dunning <ted.dunning@gmail.com>
>> > > > >
>> > > > >> I just meant that there are separate components to do the
>> different
>> > > > steps.
>> > > > >>  Historically, some glue code was required between them,
but I
>> think
>> > > > that
>> > > > >> the gap has been narrowed lately.
>> > > > >>
>> > > > >> On Thu, Oct 13, 2011 at 12:41 PM, David Rahman
>> > > > >> <drahman1985@googlemail.com>wrote:
>> > > > >>
>> > > > >> > @Ted: Clould you explain the last part of your respond,
please.
>> > That
>> > > I
>> > > > >> > didn't understand:
>> > > > >> >
>> > > > >> > >You will need to glue the lucene document vector
extraction to
>> > the
>> > > > >> > >naive bayes and you may want to adapt it to use
feature
>> hashing
>> > for
>> > > > the
>> > > > >> > SGD
>> > > > >> > >classifiers.
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Lance Norskog
>> > > goksron@gmail.com
>> > >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message