mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Kaplan <>
Subject Re: Mahout Clustering Help Please
Date Thu, 13 Aug 2015 13:53:01 GMT
Hi Pat,
Thanks for the reply,
Yes I think there are a lot of problems,

So there are 4 data sources, they each use different categorisation
conventions, some one level,some multilevel,
so I basically picked one source that is about 500K of the entries out of 3

I do have the prices, the data is separated in solr, so i can extract
title, category and price.

My confusion is trying to work out classifier vs clustering as I understand
it clustering is when you don't
have labelled data, but I do for some. Am i looking for a hybrid
classifier/clustering - kmeans or is just SVM sufficient?

To make matters more complicated they are categories and then
sub-categories, so "Cell Phones & Accessories" => "Accesories" ,
Don't know if that means i have train separate models?

Example data snippet:

"2800mAh External Battery Backup Power Bank and Leather Case for iPhone 5 -
White","Cell Phones & Accessories - Accessories",529.0
"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0
"Orange PLA 3D Printer Filament 1.75mm 1kg","Computers & Networking -
"Canon LV-7292 S Projector","Electronics - TVs & Projectors",6998.0

Perhaps I'm overcomplicating the problem...

Many thanks,

On Thu, Aug 13, 2015 at 3:35 AM, Pat Ferrel <> wrote:

> You have a lot of problems to solve here.
> 1) can you find the price? Is it in text or in structured data? If text
> you have an NLP problem. You can use regex for price.
> 2) how do you associate a price with the object, there may be several
> money amounts in the ad. Some do this with proximity so how many
> chanracters away from the item id is the price.
> 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6
> plus", and it gets worse for things with lots of numbers and modifiers in
> the name like "super whiz bang deLux 5G XLS” The right level of
> de-duplication vs fragmentation is a deep and hard problem.
> How much is an NLP problem and what structure does the data have? Unless I
> misunderstand your problem, extracting the data will be the hardest part
> and not something Mahout can help with.
> On Aug 12, 2015, at 5:49 AM, David Kaplan <> wrote:
> Hi all,
> Hope someone can please point me in the right direction,
> Very new to mahout..
> Here's my scenario:
> I have written a system that collects Classifieds items from multiple
> websites - phones,cars,antiques and many more using scrapy, all the items
> are then ingested into Solr - +- 3 million entries.
> This is then the backend for my search engine
> I want to be able to extract meaningful information to accurately
> calculate realistic price average etc. I need guidance/perhaps examples in
> accurate outlier detection, categorization etc extreme beginner in machine
> learning so need to know if that's what I should be using
> Part of my challenge is the broad range of items/categories, different
> levels of skewed data etc. e.g. finding outliers with "iphone" results when
> many of those are cheap iphone accessories.
> Basically it seems i need to cluster/classify but not sure exactly how to
> go about it, because i do already have the categories for 500K of the
> entries, example category "Cell Phones & Accessories - Accessories"
> And then actually connecting Mahout to Solr...
> Many thanks!
> David

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message