mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Dolan <buddha...@gmail.com>
Subject Re: Process UnStructured Data in Mahout for Clustering
Date Thu, 04 Dec 2014 16:54:53 GMT
My experience has been that it's best to leave the data processing for Python.  I strongly
suggest you re-write your ETL and let Mahout only do the clustering. The built-in vectorization
routines are fairly primitive.

Then I would wash the features, basically set up your own list of stop words or phrases, before
you let Mahout do anything.

On Dec 4, 2014, at 8:38 AM, Shahid Shaikh <shaikhshahidg@gmail.com> wrote:

> Hey Donni thanks but I have used the configurations and obtained the
> clusters .the results are not promising enough . I was looking if there are
> any known technics I can follow specifically while generating vectors .
> 
> Thanks
> 
> On Thursday, December 4, 2014, Donni Khan <prince.donnii@googlemail.com>
> wrote:
>> Hi
>> it depends on the nature of data you are clustering. If you have knowledge
>> about your data, you can figure out the results and you can also set the
>> correct parameters to the clustering algorithm like number of topics or
>> number of clusters.
>> 
>> Cheers,
>> Donni
>> 
>> On Thu, Dec 4, 2014 at 2:38 PM, Shahid Shaikh <shaikhshahidg@gmail.com>
>> wrote:
>> 
>>> Hi All,
>>>   I have been trying mahout clustering  on unstructured data i.e human
>>> written data . I have tried mahout clustering algorithms like
>>> Kmeans,Canopy+Kmeans and LDA but the results produced are not help full .
>>> 
>>> i see the problem is with the way data is written , Can some one please
>>> provide me some pointers on how to proceed with unstructured data  for
>>> clustering.
>>> 
>>> 
>>> i have written and analyzer that uses lower-Case and stop-words filter
> also
>>> .
>>> 
>>> thanks :)
>>> 
>>> 
>>> Regards,
>>> Shaikh Shahid G .
>>> +91 9503954781
>>> 
>> 
> 
> -- 
> Regards,
> Shaikh Shahid G .
> +91 9503954781


Mime
View raw message