spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fommil <>
Subject [GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...
Date Sun, 16 Feb 2014 21:58:12 GMT
Github user fommil commented on the pull request:
    @martinjaggi I believe you would be making a massive mistake by agreeing on a single serialisation
scheme for sparse vectors, unless that format is independent of the storage structure, e.g.
MatrixMarket, or if you're happy to pass implementation detail of the sparse format along
with the structure. BTW, solving the generic problem of building a sparse matrix from a data
stream is far from solved... some structures are just intrinsically slow to construct and
others require up-front hints to avoid waste during their construction.
    Most of the battle when doing sparse matrix calculations is choosing, or writing, an appropriate
data storage format that best suits the problem. Having a toolkit of generic sparse storage
formats is good, but orders of magnitude improvement can be obtained with a custom made format.
What works for you might be really bad for somebody else.
    There is no silver bullet sparse matrix format.
    MTJ has compressed row, compressed column and compressed diagonal formats (with some custom
sparse structures for special matrices) and a special doubly linked structure for unstructured
sparse matrices.
    Insertion is usually slow for structures that are optimised for calculations. I recently
created a compressed column structure holding a hand-crafted primitive linked list... optimised
for insertion speed, but potentially duplicating entries, and definitely not optimal for row
traversal (although column traversal would be reasonable). If insertion is to be performed
concurrently, then that again requires more thought.
    I don't know what you mean by sequential access because that depends on how you're doing
the sequential ordering: by row, column, diagonal, arrowhead? To optimise calculations, you
want the JIT to be able to treat everything as close as possible to aligned arrays to help
the CPU caches (which is by far the biggest bottleneck in linear algebra). Exotic data structures
can look great on paper, but the JIT just gets confused and all of a sudden you're into RAM
    Btw, 1% sparsity is actually quite dense when you're talking about really large matrices.
I think whatever benchmarks you're coming up with, they are probably going to be highly biased
to the kind of problems that you're solving. Creating a suite of benchmarks would be quite
the undertaking!
    Basically, I'm saying that Spark should remain flexible for future sparse formats and
shouldn't pick one at this early stage because it works well for logistic regression.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at or file a JIRA ticket with INFRA.

View raw message