mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gaurav redkar <gauravred...@gmail.com>
Subject Re: Cluster dumper crashes when run on a large dataset
Date Fri, 04 Nov 2011 06:13:03 GMT
Actually i have to run the meanshift algorithm on a large dataset for my
project. the clusterdumper facility works on smaller data sets .

But my project will mostly include large-scale data (size will mostly
extend to gigabytes). So i need to modify the clusterdumper facility to
work on the such dataset. Also the vector is densely populated.

i probably need to read each file from pointsDir one at a tym while
constructing the "result" map. Any pointers as to how do i do it.??

Thanks

On Fri, Nov 4, 2011 at 11:27 AM, Paritosh Ranjan <pranjan@xebia.com> wrote:

> Reducing dimension (drastically, try less than 100 if functionality allows
> this) can be a solution.
>
> Which vector implementation are you using? If the vectors are sparsely
> populated ( have lots of uninitialized/unused dimensions) , you can use
> RandomAccessSparseVector or SequentialAccessSparseVector, which will
> populate only the dimensions which you are using. This can also decrease
> memory consumption.
>
>
> On 04-11-2011 11:19, gaurav redkar wrote:
>
>> Hi,
>>
>> yes Paritosh..even i think the same. actually i am using a test data set
>> that has 5000 tuples with 1000 dimensions each.  the thing is der are too
>> many files created in the pointsDir folder and i think the program tries
>> to
>> open a path to all d files(i.e. read all the files in memory at once). Is
>> my interpretation correct.?? Also how do i go about fixing it..?
>>
>> Thanks
>>
>>
>>
>> On Fri, Nov 4, 2011 at 11:03 AM, Paritosh Ranjan<pranjan@xebia.com>
>>  wrote:
>>
>>  Reading point is keeping everything in memory which might have crashed
>>> it.
>>> pointList.add(record.****getSecond());
>>>
>>>
>>> Your dataset size is 40 MB but the vectors might be too large. How many
>>> dimensions are you having in your Vector?
>>>
>>>
>>> On 04-11-2011 10:57, gaurav redkar wrote:
>>>
>>>  Hello,
>>>>
>>>> I am in  a fix with the Clusterdumper utility. The clusterdump utility
>>>> crashes when it tries to output the clusters by outputting an out of
>>>> memory
>>>> exception: java heap space.
>>>>
>>>> when i checked the error stack, it seems that the program crashed in
>>>> readPoints() function. i guess it is unable to build the "result" map.
>>>> Any
>>>> idea how do i fix this.??
>>>>
>>>> I am working on a dataset of size 40mb. I had tried increaseing the heap
>>>> space but with no luck.
>>>>
>>>> Thanks
>>>>
>>>> Gaurav
>>>>
>>>>
>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>>>
>>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message