mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <fr...@frankscholten.nl>
Subject Re: Annotation based vectorizer
Date Mon, 03 Feb 2014 21:53:33 GMT
The second field of Newsgroup should be called bodyText of course.


On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <frank@frankscholten.nl>wrote:

> Hi all,
>
> I put together a utility which vectorizes plain old Java objects annotated
> with @Feature and @Target via Mahout's vector encoders.
>
> See my Github branch:
> https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
>
> and the unit test:
> https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
>
> Use it like this:
>
> class NewsgroupPost {
>
>   @Target
>   private String newsgroup;
>
>   @Feature(encoder = TextValueEncoder.class)
>   private String newsgroup;
>
>   // Getters & setters
>
> }
>
> AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
> AnnotationBasedVectorizer<NewsgroupPost>(new
> TypeReference<NewsgroupPost>(){});
>
> Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
> this:
>
> NewsgroupPost post = ...
>
> Vector vector = vectorizer.vectorize(post);
> int target = vectorizer.getTarget(post);
> int numFeatures = vectorizer.getNumberOfFeatures();
>
> Note that vectorize() and getTarget() methods are genericly typed and due
> to the type token passed in the constructor we can enforce that only
> NewsgroupPosts are accepted.
>
> The vectorizer uses a Dictionary for encoding the target.
>
> Thoughts?
>
> Cheers,
>
> Frank
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message