mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Spark Mahout with a CLI?
Date Wed, 16 Apr 2014 18:15:16 GMT
bug in the pseudo code, should use columnIds:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(),
hashedDrms(1).columnIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

Great, and an excellent example is at hand. In it I will play the user and contributor role,
Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This
creates logfiles of the form

   timestamp, userId, itemId, action
   timestamp1, userIdString1, itemIdString1, “view"
   timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and Hadoop Mahout. The
input is split and accumulated into two matrices which could then be input to the new Spark
cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464)

   val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
       maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java)
class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the
above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be
a drm SparseMatrix and two hashed dictionaries for row and column external Id <-> mahout
Id lookup. 

The ‘cooccurrences' call would be identical and the data it deals with would also be identical.
But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions
length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could
be created with a helper function to read from logfiles.

   val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“,
"\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new HasedSparceMatrix from
the cooccurrences indicator matrix and the original itemId dictionaries:

   val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(),
hasedDrms(1).rowIds())
   RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, "hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create
a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my
scripting/web app language is not Scala the format for the output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t
need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal
calculations are done on RDDs and the drms are never written to disk. AND the logfiles can
be consumed directly producing data that any language can consume directly with external Ids
used and preserved.


BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written
out using Spark (unless I missed something). In the proposed impl the read and write would
be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly
with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell,
or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout
code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise
we leave all of this wrapper code to be duplicated over and over again buy users and expect
them to know too much about Spark Mahout internals.



On Apr 15, 2014, at 6:45 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.


On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> Which is why it’s an import/export issue.
> 
> On Apr 15, 2014, at 5:48 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <pat@occamsmachete.com>
> wrote:
> 
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> 
>> There will be Scala scripts to drive this stuff and yes even from the
> CLI.
>> Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs,
> Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to
> no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
> the
>> typical user’s world view.
>> 
> 
> Yes, ultimately there may need to be command line programs of various
> sorts, but the fact is, we need to make sure that we avoid files as the API
> for moving large amounts of data. That means that we have to have some way
> of controlling the persistence of in-memory objects and in many cases, that
> means that processing chains will not typically be integrated at the level
> of command line programs.
> 
> Dmitriy's comment about R is apropos.  You can put scripts together for
> various end-user purposes but you don't have a CLI for every R comment.
> Nor for every Perl, python or php command either.
> 
> To the extent we have in-memory persistence across the life-time of
> multiple driver programs, then a sort of CLI interface will be possible.  I
> know that h2o will do that, but I am not entirely clear on the life-time of
> RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
> I don't expect CLI interface to be the primary integration path for these
> new capabilities.
> 
> 



Mime
View raw message