mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: Cluster dumper crashes when run on a large dataset
Date Fri, 04 Nov 2011 06:21:47 GMT
Such big data would need to run on Hadoop cluster.

Right now, I think there is no utility which can help you collect data 
in the form you want. You will have to read it line by line, group 
vectors belonging to similar cluster. Would be good if you can write it 
on file system incrementally, as this would get rid of memory problem.

Or, try CanopyDriver with clusterFilter > 0 , which might help in 
reducing the number of clusters that you are getting as output, which, 
in turn, might help in less memory usage.

On 04-11-2011 11:43, gaurav redkar wrote:
> Actually i have to run the meanshift algorithm on a large dataset for my
> project. the clusterdumper facility works on smaller data sets .
>
> But my project will mostly include large-scale data (size will mostly
> extend to gigabytes). So i need to modify the clusterdumper facility to
> work on the such dataset. Also the vector is densely populated.
>
> i probably need to read each file from pointsDir one at a tym while
> constructing the "result" map. Any pointers as to how do i do it.??
>
> Thanks
>
> On Fri, Nov 4, 2011 at 11:27 AM, Paritosh Ranjan<pranjan@xebia.com>  wrote:
>
>> Reducing dimension (drastically, try less than 100 if functionality allows
>> this) can be a solution.
>>
>> Which vector implementation are you using? If the vectors are sparsely
>> populated ( have lots of uninitialized/unused dimensions) , you can use
>> RandomAccessSparseVector or SequentialAccessSparseVector, which will
>> populate only the dimensions which you are using. This can also decrease
>> memory consumption.
>>
>>
>> On 04-11-2011 11:19, gaurav redkar wrote:
>>
>>> Hi,
>>>
>>> yes Paritosh..even i think the same. actually i am using a test data set
>>> that has 5000 tuples with 1000 dimensions each.  the thing is der are too
>>> many files created in the pointsDir folder and i think the program tries
>>> to
>>> open a path to all d files(i.e. read all the files in memory at once). Is
>>> my interpretation correct.?? Also how do i go about fixing it..?
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Fri, Nov 4, 2011 at 11:03 AM, Paritosh Ranjan<pranjan@xebia.com>
>>>   wrote:
>>>
>>>   Reading point is keeping everything in memory which might have crashed
>>>> it.
>>>> pointList.add(record.****getSecond());
>>>>
>>>>
>>>> Your dataset size is 40 MB but the vectors might be too large. How many
>>>> dimensions are you having in your Vector?
>>>>
>>>>
>>>> On 04-11-2011 10:57, gaurav redkar wrote:
>>>>
>>>>   Hello,
>>>>> I am in  a fix with the Clusterdumper utility. The clusterdump utility
>>>>> crashes when it tries to output the clusters by outputting an out of
>>>>> memory
>>>>> exception: java heap space.
>>>>>
>>>>> when i checked the error stack, it seems that the program crashed in
>>>>> readPoints() function. i guess it is unable to build the "result" map.
>>>>> Any
>>>>> idea how do i fix this.??
>>>>>
>>>>> I am working on a dataset of size 40mb. I had tried increaseing the heap
>>>>> space but with no luck.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Gaurav
>>>>>
>>>>>
>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>>>>
>>>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>>
>>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11


Mime
View raw message