mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <>
Subject Re: Alternative Naive Bayes Datastore?
Date Wed, 15 Sep 2010 12:58:28 GMT
Hi Dean,

Does jdbm only support java-based serialization? From my experience
I've seen that java's serialization is generally an order of magnitude
slower and less space efficient than the equivalent hand-rolled
serialization such as you'd find in implementations of the Writable
class. That is precisely why you won't see Serializable used much in
Mahout. Perhaps you could use RandomAccessSparseVector combined with
VectorWritable to read/write to/from a byte array backed
DataOutput/DataInput stream?

I suspect the HBase interactions might be a useful starting point in
terms of observing how the matrices are broken down into chunks of
data loaded into/read from a persistent store.

I'm definitely interested in how things proceed for you, especially if
there are ways the existing code can be improved for easier extension
in the future.


On Tue, Sep 14, 2010 at 11:38 AM, Dean Jones <> wrote:
> HI Grant, Drew,
> Thanks for the responses. I had pretty much concluded that I would
> need to roll my own datastore, but thanks for the confirmation. I'm
> basing the implementation on the InMemoryBayesDatastore, but have hit
> a little bit of a snag because I'm trying to implement a jdbm-backed
> version of o.a.m.math.Matrix using
> o.a.m.math.RandomAccessSparseVector, but this class is not
> Serializable; it does have a default constructor with the comment "For
> serialization purposes only", which suggests that perhaps that
> intention was to allow its subclasses to be Serializable. Any thoughts
> on what the intention here was? Currently I've patched it to implement
> Serializable.
> Dean.
> On 14 September 2010 14:22, Drew Farris <> wrote:
>> Hi Dean,
>> InMemory and HBase are the only options present currently. It
>> shouldn't be too difficult to implement a new storage back-end. You
>> would need to consider the issue of both loading the data into your
>> persistence framework and reading it back out.

View raw message