mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Mahout Clustering Help Please
Date Thu, 13 Aug 2015 01:35:34 GMT
You have a lot of problems to solve here.

1) can you find the price? Is it in text or in structured data? If text you have an NLP problem.
You can use regex for price.
2) how do you associate a price with the object, there may be several money amounts in the
ad. Some do this with proximity so how many chanracters away from the item id is the price.
3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6 plus", and it gets
worse for things with lots of numbers and modifiers in the name like "super whiz bang deLux
5G XLS” The right level of de-duplication vs fragmentation is a deep and hard problem.

How much is an NLP problem and what structure does the data have? Unless I misunderstand your
problem, extracting the data will be the hardest part and not something Mahout can help with.

On Aug 12, 2015, at 5:49 AM, David Kaplan <davkap92@gmail.com> wrote:

Hi all,
Hope someone can please point me in the right direction,
Very new to mahout..
Here's my scenario:

I have written a system that collects Classifieds items from multiple
websites - phones,cars,antiques and many more using scrapy, all the items
are then ingested into Solr - +- 3 million entries.
This is then the backend for my search engine

I want to be able to extract meaningful information to accurately
calculate realistic price average etc. I need guidance/perhaps examples in
accurate outlier detection, categorization etc extreme beginner in machine
learning so need to know if that's what I should be using

Part of my challenge is the broad range of items/categories, different
levels of skewed data etc. e.g. finding outliers with "iphone" results when
many of those are cheap iphone accessories.

Basically it seems i need to cluster/classify but not sure exactly how to
go about it, because i do already have the categories for 500K of the
entries, example category "Cell Phones & Accessories - Accessories"

And then actually connecting Mahout to Solr...

Many thanks!
David


Mime
View raw message