flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] (FLINK-2094) Implement Word2Vec
Date Tue, 31 Jan 2017 08:12:44 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"> 
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0"
/> <base href="https://issues.apache.org/jira" /> 
        <title>Message Title</title> 
    </head> 
    <body class="jira" style="color: #333; font-family: Arial, sans-serif; font-size: 14px;
line-height: 1.429"> 
        <table id="background-table" cellpadding="0" cellspacing="0" width="100%" style="border-collapse:
collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt; background-color: #f5f5f5; border-collapse:
collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt"> 
            <!-- header here --> 
            <tr> 
                <td id="header-pattern-container" style="padding: 0px; border-collapse:
collapse; padding: 10px 20px"> 
                    <table id="header-pattern" cellspacing="0" cellpadding="0" border="0"
style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt"> 
                        <tr> 
                            <td id="header-avatar-image-container" valign="top" style="padding:
0px; border-collapse: collapse; vertical-align: top; width: 32px; padding-right: 8px">
<img id="header-avatar-image" class="image_fix" src="cid:jira-generated-image-avatar-githubbot-089aaea0-3849-43b9-8590-8ec60d9e9c50"
height="32" width="32" border="0" style="border-radius: 3px; vertical-align: top" /> 
                            </td> 
                            <td id="header-text-container" valign="middle" style="padding:
0px; border-collapse: collapse; vertical-align: middle; font-family: Arial, sans-serif; font-size:
14px; line-height: 20px; mso-line-height-rule: exactly; mso-text-raise: 1px"> <a class="user-hover"
rel="githubbot" id="email_githubbot" href="https://issues.apache.org/jira/secure/ViewProfile.jspa?name=githubbot"
style="color:#3b73af;; color: #3b73af; text-decoration: none">ASF GitHub Bot</a>
<strong>commented</strong> on <a href="https://issues.apache.org/jira/browse/FLINK-2094"
style="color: #3b73af; text-decoration: none"><img src="cid:jira-generated-image-static-improvement-b9031356-7c6a-4292-8d21-66ad5910d4dc"
height="16" width="16" border="0" align="absmiddle" alt="Improvement" /> FLINK-2094</a>

                            </td> 
                        </tr> 
                    </table> 
                </td> 
            </tr> 
            <tr> 
                <td id="email-content-container" style="padding: 0px; border-collapse:
collapse; padding: 0 20px"> 
                    <table id="email-content-table" cellspacing="0" cellpadding="0" border="0"
width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt;
border-spacing: 0; border-collapse: separate"> 
                        <tr> 
                            <!-- there needs to be content in the cell for it to render
in some clients --> 
                            <td class="email-content-rounded-top mobile-expand" style="padding:
0px; border-collapse: collapse; color: #fff; padding: 0 15px 0 16px; height: 15px; background-color:
#fff; border-left: 1px solid #ccc; border-top: 1px solid #ccc; border-right: 1px solid #ccc;
border-bottom: 0; border-top-right-radius: 5px; border-top-left-radius: 5px; height: 10px;
line-height: 10px; padding: 0 15px 0 16px; mso-line-height-rule: exactly">
                                &nbsp;
                            </td> 
                        </tr> 
                        <tr> 
                            <td class="email-content-main mobile-expand " style="padding:
0px; border-collapse: collapse; border-left: 1px solid #ccc; border-right: 1px solid #ccc;
border-top: 0; border-bottom: 0; padding: 0 15px 0 16px; background-color: #fff"> 
                                <table class="page-title-pattern" cellspacing="0" cellpadding="0"
border="0" width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace:
0pt"> 
                                    <tr> 
                                        <td style="vertical-align: top;; padding: 0px;
border-collapse: collapse; padding-right: 5px; font-size: 20px; line-height: 30px; mso-line-height-rule:
exactly" class="page-title-pattern-header-container"> <span class="page-title-pattern-header"
style="font-family: Arial, sans-serif; padding: 0; font-size: 20px; line-height: 30px; mso-text-raise:
2px; mso-line-height-rule: exactly; vertical-align: middle"> <a href="https://issues.apache.org/jira/browse/FLINK-2094"
style="color: #3b73af; text-decoration: none">Re: Implement Word2Vec</a> </span>

                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                        <tr> 
                            <td id="text-paragraph-pattern-top" class="email-content-main
mobile-expand  comment-top-pattern" style="padding: 0px; border-collapse: collapse; border-left:
1px solid #ccc; border-right: 1px solid #ccc; border-top: 0; border-bottom: 0; padding: 0
15px 0 16px; background-color: #fff; border-bottom: none; padding-bottom: 0"> 
                                <table class="text-paragraph-pattern" cellspacing="0" cellpadding="0"
border="0" width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace:
0pt; font-family: Arial, sans-serif; font-size: 14px; line-height: 20px; mso-line-height-rule:
exactly; mso-text-raise: 2px"> 
                                    <tr> 
                                        <td class="text-paragraph-pattern-container mobile-resize-text
" style="padding: 0px; border-collapse: collapse; padding: 0 0 10px 0"> 
                                            <p style="margin: 10px 0 0 0">Github user
kateri1 commented on a diff in the pull request:</p> 
                                            <p style="margin: 10px 0 0 0"> <a href="https://github.com/apache/flink/pull/2735#discussion_r98613727"
class="external-link" rel="nofollow" style="color: #3b73af; text-decoration: none">https://github.com/apache/flink/pull/2735#discussion_r98613727</a></p>

                                            <p style="margin: 10px 0 0 0"> — Diff:
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nlp/Word2Vec.scala —<br />
@@ -0,0 +1,243 @@<br /> +/*<br /> + * Licensed to the Apache Software Foundation
(ASF) under one<br /> + * or more contributor license agreements. See the NOTICE file<br
/> + * distributed with this work for additional information<br /> + * regarding
copyright ownership. The ASF licenses this file<br /> + * to you under the Apache License,
Version 2.0 (the<br /> + * &quot;License&quot;); you may not use this file except
in compliance<br /> + * with the License. You may obtain a copy of the License at<br
/> + *<br /> + * <a href="http://www.apache.org/licenses/LICENSE-2.0" class="external-link"
rel="nofollow" style="color: #3b73af; text-decoration: none">http://www.apache.org/licenses/LICENSE-2.0</a><br
/> + *<br /> + * Unless required by applicable law or agreed to in writing, software<br
/> + * distributed under the License is distributed on an &quot;AS IS&quot; BASIS,<br
/> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.<br />
+ * See the License for the specific language governing permissions and<br /> + * limitations
under the License.<br /> + */<br /> +<br /> +package org.apache.flink.ml.nlp<br
/> +<br /> +import org.apache.flink.api.scala._<br /> +import org.apache.flink.ml.common.</p>
{Parameter, ParameterMap} 
                                            <p style="margin: 10px 0 0 0"> +import org.apache.flink.ml.optimization.</p>
{Context, ContextEmbedder, HSMWeightMatrix} 
                                            <p style="margin: 10px 0 0 0"> +import org.apache.flink.ml.pipeline.</p>
{FitOperation, TransformDataSetOperation, Transformer} 
                                            <p style="margin: 10px 0 0 0"> +<br />
+/**<br /> + * Implements Word2Vec as a transformer on a DataSet[Iterable<span class="error">[String]</span>]<br
/> + *<br /> + * Calculates valuable vectorizations of individual words given<br
/> + * the context in which they appear<br /> + *<br /> + * @example<br
/> + * {{</p> { + * //constructed of 'sentences' - where each string in the iterable
is a word + * val stringsDS = DataSet[Iterable[String]] = ... + * val stringsDS2 = DataSet[Iterable[String]]
= ... + * + * val w2V = Word2Vec() + * .setIterations(5) + * .setTargetCount(10) + * .setSeed(500)
+ * + * //internalizes an initial weightSet + * w2V.fit(stringsDS) + * + * //note that the
same DS can be used to fit and optimize + * //the number of learned vectors is limted to the
vocab built in fit + * val wordVectors : DataSet[(String, Vector[Double])] = w2V.optimize(stringsDS2)
+ * } 
                                            <p style="margin: 10px 0 0 0">}}<br />
+ *<br /> + * =Parameters=<br /> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.WindowSize]</span>]<br
/> + * sets the size of window for skipGram formation: how far on either side of<br
/> + * a given word will we sample the context? (Default value: '''10''')<br /> +
*<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.Iterations]</span>]<br
/> + * sets the number of global iterations the training set is passed through - essentially
looping on<br /> + * whole set, leveraging flink's iteration operator (Default value:
'''10''')<br /> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.TargetCount]</span>]<br
/> + * sets the minimum number of occurences of a given target value before that value
is<br /> + * excluded from vocabulary (e.g. if this parameter is set to '5', and a target<br
/> + * appears in the training set less than 5 times, it is not included in vocabulary)<br
/> + * (Default value: '''5''')<br /> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.VectorSize]</span>]<br
/> + * sets the length of each learned vector (Default value: '''100''')<br /> +
*<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.LearningRate]</span>]<br
/> + * sets the rate of descent during backpropagation - this value decays linearly with<br
/> + * individual training sets, determined by BatchSize (Default value: '''0.015''')<br
/> + *<br /> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.BatchSize]</span>]<br
/> + * sets the batch size of training sets - the input DataSet will be batched into<br
/> + * groups of this size for learning (Default value: '''1000''')<br /> + *<br
/> + * - [<span class="error">[org.apache.flink.ml.nlp.Word2Vec.Seed]</span>]<br
/> + * sets the seed for generating random vectors at initial weighting DataSet creation<br
/> + * (Default value: '''Some(scala.util.Random.nextLong)''')<br /> + */<br />
+class Word2Vec extends Transformer<span class="error">[Word2Vec]</span> {<br
/> + import Word2Vec._<br /> +<br /> + private <span class="error">[nlp]</span>
var wordVectors:<br /> + Option[DataSet[HSMWeightMatrix<span class="error">[String]</span>]]
= None<br /> +<br /> + def setIterations(iterations: Int): this.type = </p>
{ + parameters.add(Iterations, iterations) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setTargetCount(targetCount: Int): this.type = </p> { + parameters.add(TargetCount,
targetCount) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setVectorSize(vectorSize: Int): this.type = </p> { + parameters.add(VectorSize,
vectorSize) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setLearningRate(learningRate: Double): this.type = </p> { + parameters.add(LearningRate,
learningRate) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setWindowSize(windowSize: Int): this.type = </p> { + parameters.add(WindowSize,
windowSize) + this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setBatchSize(batchSize: Int): this.type = </p> { + parameters.add(BatchSize, batchSize)
+ this + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def setSeed(seed: Long): this.type = </p> { + parameters.add(Seed, seed) + this +
} 
                                            <p style="margin: 10px 0 0 0"> +<br />
+}<br /> +<br /> +object Word2Vec {<br /> + case object Iterations extends
Parameter<span class="error">[Int]</span> </p> { + val defaultValue = Some(10)
+ }<br /> +<br /> + case object TargetCount extends Parameter<span class="error">[Int]</span>
{ + val defaultValue = Some(5) + }<br /> +<br /> + case object VectorSize extends
Parameter<span class="error">[Int]</span> { + val defaultValue = Some(100) + }<br
/> +<br /> + case object LearningRate extends Parameter<span class="error">[Double]</span>
{ + val defaultValue = Some(0.015) + }<br /> +<br /> + case object WindowSize
extends Parameter<span class="error">[Int]</span> { + val defaultValue = Some(10)
+ } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ case object BatchSize extends Parameter<span class="error">[Int]</span> </p>
{ + val defaultValue = Some(1000) + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ case object Seed extends Parameter<span class="error">[Long]</span> </p>
{ + val defaultValue = Some(scala.util.Random.nextLong) + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ def apply(): Word2Vec = </p> { + new Word2Vec() + } 
                                            <p style="margin: 10px 0 0 0"> +<br />
+ /** [<span class="error">[FitOperation]</span>] which builds initial vocabulary
for Word2Vec context embedding<br /> + *<br /> + * @tparam T Subtype of Iterable<span
class="error">[String]</span><br /> + * @return<br /> + */<br />
+ implicit def learnWordVectors[T &lt;: Iterable<span class="error">[String]</span>]
= {<br /> + new FitOperation<span class="error">[Word2Vec, T]</span> {<br
/> + override def fit(<br /> + instance: Word2Vec,<br /> + fitParameters: ParameterMap,<br
/> + input: DataSet<span class="error">[T]</span>)<br /> + : Unit = {<br
/> + val resultingParameters = instance.parameters ++ fitParameters<br /> + <br
/> + val skipGrams = input<br /> + .flatMap(x =&gt;<br /> + x.zipWithIndex<br
/> + .map(z =&gt; </p> { + val window = (scala.math.random * 100 % resultingParameters(WindowSize)).toInt
+ Context[String]( + z._1, x.slice(z._2 - window, z._2) ++ x.slice(z._2 +1, z._2 + window))
+ } 
                                            <p style="margin: 10px 0 0 0">))<br />
+<br /> + val weights = new ContextEmbedder<span class="error">[String]</span><br
/> + .setIterations(resultingParameters(Iterations))<br /> + .setTargetCount(resultingParameters(TargetCount))<br
/> + .setVectorSize(resultingParameters(VectorSize))<br /> + .setLearningRate(resultingParameters(LearningRate))<br
/> + .setBatchSize(resultingParameters(BatchSize))<br /> + .setSeed(resultingParameters(Seed))<br
/> + .createInitialWeightsDS(instance.wordVectors, skipGrams)<br /> +<br />
+ instance.wordVectors = Some(weights)<br /> + }<br /> + }<br /> + }<br
/> +<br /> + /** [<span class="error">[TransformDataSetOperation]</span>]
for words to vectors<br /> + * form skipgrams from the input dataset and learn vectors
against<br /> + * the vocabulary constructed during the fit operation<br /> +
* returns a dataset of distinct words and their learned representations<br /> + *<br
/> + * @tparam T subtype of Iterable<span class="error">[String]</span><br
/> + * @return<br /> + */<br /> + implicit def words2Vecs[T &lt;: Iterable<span
class="error">[String]</span>] = {<br /> + new TransformDataSetOperation[Word2Vec,
T, (String, Vector<span class="error">[Double]</span>)] {<br /> + override
def transformDataSet(instance: Word2Vec,<br /> + transformParameters: ParameterMap,<br
/> + input: DataSet<span class="error">[T]</span>): DataSet[(String, Vector<span
class="error">[Double]</span>)] = {<br /> + val resultingParameters = instance.parameters
++ transformParameters<br /> + <br /> + instance.wordVectors match {<br />
+ case Some(vectors) =&gt;<br /> + val skipGrams = input<br /> + .flatMap(x
=&gt;<br /> + x.zipWithIndex<br /> + .map(z =&gt; </p> { + val window
= (scala.math.random * 100 % resultingParameters(WindowSize)).toInt + Context[String]( + z._1,
x.slice(z._2 - window, z._2) ++ x.slice(z._2 + 1, z._2 + window)) + } 
                                            <p style="margin: 10px 0 0 0">))<br />
+<br /> + val learnedVectors = new ContextEmbedder<span class="error">[String]</span><br
/> + .setIterations(resultingParameters(Iterations))<br /> + .setTargetCount(resultingParameters(TargetCount))<br
/> + .setVectorSize(resultingParameters(VectorSize))<br /> + .setLearningRate(resultingParameters(LearningRate))<br
/> + .setBatchSize(resultingParameters(BatchSize))<br /> + .setSeed(resultingParameters(Seed))<br
/> + .optimize(skipGrams, instance.wordVectors)<br /> +<br /> + learnedVectors<br
/> + .flatMap(_.fetchVectors)<br /> + case None =&gt;<br /> — End diff
–</p> 
                                            <p style="margin: 10px 0 0 0"> Transformation
could be performed multiple times over the same model, so do you think that's ok to throw
everytime exception on incoming word set for encoding only because once the model was trained
incorrectly? May be we should consider some trivial default value instead of performing of
heavy Exception processing? </p> 
                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                        <tr> 
                            <td class="email-content-main mobile-expand " style="padding:
0px; border-collapse: collapse; border-left: 1px solid #ccc; border-right: 1px solid #ccc;
border-top: 0; border-bottom: 0; padding: 0 15px 0 16px; background-color: #fff"> 
                                <table id="actions-pattern" cellspacing="0" cellpadding="0"
border="0" width="100%" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace:
0pt; font-family: Arial, sans-serif; font-size: 14px; line-height: 20px; mso-line-height-rule:
exactly; mso-text-raise: 1px"> 
                                    <tr> 
                                        <td id="actions-pattern-container" valign="middle"
style="padding: 0px; border-collapse: collapse; padding: 10px 0 10px 24px; vertical-align:
middle; padding-left: 0"> 
                                            <table align="left" style="border-collapse:
collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt"> 
                                                <tr> 
                                                    <td class="actions-pattern-action-icon-container"
style="padding: 0px; border-collapse: collapse; font-family: Arial, sans-serif; font-size:
14px; line-height: 20px; mso-line-height-rule: exactly; mso-text-raise: 0px; vertical-align:
middle"> <a href="https://issues.apache.org/jira/browse/FLINK-2094#add-comment" target="_blank"
title="Add Comment" style="color: #3b73af; text-decoration: none"> <img class="actions-pattern-action-icon-image"
src="cid:jira-generated-image-static-comment-icon-29956bbd-b379-4217-83c8-3c27f2c2dbed" alt="Add
Comment" title="Add Comment" height="16" width="16" border="0" style="vertical-align: middle"
/> </a> 
                                                    </td> 
                                                    <td class="actions-pattern-action-text-container"
style="padding: 0px; border-collapse: collapse; font-family: Arial, sans-serif; font-size:
14px; line-height: 20px; mso-line-height-rule: exactly; mso-text-raise: 4px; padding-left:
5px"> <a href="https://issues.apache.org/jira/browse/FLINK-2094#add-comment" target="_blank"
title="Add Comment" style="color: #3b73af; text-decoration: none">Add Comment</a>

                                                    </td> 
                                                </tr> 
                                            </table> 
                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                        <!-- there needs to be content in the cell for it to render in
some clients --> 
                        <tr> 
                            <td class="email-content-rounded-bottom mobile-expand" style="padding:
0px; border-collapse: collapse; color: #fff; padding: 0 15px 0 16px; height: 5px; line-height:
5px; background-color: #fff; border-top: 0; border-left: 1px solid #ccc; border-bottom: 1px
solid #ccc; border-right: 1px solid #ccc; border-bottom-right-radius: 5px; border-bottom-left-radius:
5px; mso-line-height-rule: exactly">
                                &nbsp;
                            </td> 
                        </tr> 
                    </table> 
                </td> 
            </tr> 
            <tr> 
                <td id="footer-pattern" style="padding: 0px; border-collapse: collapse;
padding: 12px 20px"> 
                    <table id="footer-pattern-container" cellspacing="0" cellpadding="0"
border="0" style="border-collapse: collapse; mso-table-lspace: 0pt; mso-table-rspace: 0pt">

                        <tr> 
                            <td id="footer-pattern-text" class="mobile-resize-text" width="100%"
style="padding: 0px; border-collapse: collapse; color: #999; font-size: 12px; line-height:
18px; font-family: Arial, sans-serif; mso-line-height-rule: exactly; mso-text-raise: 2px">
                                 This message was sent by Atlassian JIRA <span id="footer-build-information">(v6.3.15#6346-<span
title="dbc023dd75cecacf443c4b235f66124b15f5c5fe" data-commit-id="dbc023dd75cecacf443c4b235f66124b15f5c5fe}">sha1:dbc023d</span>)</span>

                            </td> 
                            <td id="footer-pattern-logo-desktop-container" valign="top"
style="padding: 0px; border-collapse: collapse; padding-left: 20px; vertical-align: top">

                                <table style="border-collapse: collapse; mso-table-lspace:
0pt; mso-table-rspace: 0pt"> 
                                    <tr> 
                                        <td id="footer-pattern-logo-desktop-padding" style="padding:
0px; border-collapse: collapse; padding-top: 3px"> <img id="footer-pattern-logo-desktop"
src="cid:jira-generated-image-static-footer-desktop-logo-17092124-896c-4f0f-a18f-1ad762e97ceb"
alt="Atlassian logo" title="Atlassian logo" width="169" height="36" class="image_fix" />

                                        </td> 
                                    </tr> 
                                </table> 
                            </td> 
                        </tr> 
                    </table> 
                </td> 
            </tr> 
        </table>   
    </body>
</html>
Mime
View raw message