tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2720) A parser to output universal sentence encodings to text
Date Sun, 02 Sep 2018 22:58:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16601694#comment-16601694
] 

ASF GitHub Bot commented on TIKA-2720:
--------------------------------------

ThejanW commented on issue #248: Fix for TIKA-2720 [WIP]
URL: https://github.com/apache/tika/pull/248#issuecomment-417965702
 
 
   The sentences in the above comment are parsed through the encoder, and it outputs an array
containing 512 floats each and every sentence. Once I have that, I calculates the cosine similarity
between each and every array I get for sentences and here are the highest matched sentence
couples with their cosine similarities. 
   At each segment, you will find the two sentences and then the cosine similarity. For an
example in the first segment, we have the sentences, "How old are you?" and "What is your
age?" having a cosine similarity of 0.8516871929168701, which is the highest, the list goes
on...
   
   ```
   How old are you? 
   What is your age?
   0.8516871929168701
   
   How old are you?
   How old did you turn?
   0.7483202219009399
   
   What is your age?
   How old did you turn?
   0.6784225106239319
   
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying flights, forming
sinkholes
   Recently a lot of hurricanes have hit the US
   0.6395097374916077
   
   The Samsung Galaxy S10 has the potential to be the most exciting phone of 2019
   Android beats iOS in smartphone loyalty, study finds
   0.6229119300842285
   
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying flights, forming
sinkholes
   News showed, violent floodwaters surging down main Streets
   0.6069092154502869
   
   How old are you?
   When is your birthday?
   0.5812650322914124
   
   What is your age?
   When is your birthday?
   0.5723845362663269
   
   Android beats iOS in smartphone loyalty, study finds
   Apple became the world’s first trillion-dollar public company
   0.5713004469871521
   
   Green tea contains bioactive compounds that improve health
   Is paleo better than keto?
   0.5498321652412415
   
   News showed, violent floodwaters surging down main Streets
   Recently a lot of hurricanes have hit the US
   0.534430205821991
   
   The Samsung Galaxy S10 has the potential to be the most exciting phone of 2019
   IPhone X includes a 5.8-inch edge-to-edge display which covers the entire front of the
phone.
   0.5117762088775635
   
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying flights, forming
sinkholes
   Multiple lines of scientific evidence show that the climate system is warming
   0.5018186569213867
   
   Android beats iOS in smartphone loyalty, study finds
   IPhone X includes a 5.8-inch edge-to-edge display which covers the entire front of the
phone.
   0.4970431923866272
   
   Green tea contains bioactive compounds that improve health
   Yoga has been shown to help people reduce anxiety
   0.4776824116706848
   
   How old did you turn?
   When is your birthday?
   0.46567195653915405
   
   The Samsung Galaxy S10 has the potential to be the most exciting phone of 2019
   Apple became the world’s first trillion-dollar public company
   0.4522799849510193
   
   Recently a lot of hurricanes have hit the US
   Multiple lines of scientific evidence show that the climate system is warming
   0.4517837166786194
   
   With roads covered with slippery snow and ice, can challenge even the most experienced
driver.
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying flights, forming
sinkholes
   0.42890870571136475
   
   An ounce of prevention is worth a pound of cure
   Green tea contains bioactive compounds that improve health
   0.38761529326438904
   
   An ounce of prevention is worth a pound of cure
   Yoga has been shown to help people reduce anxiety
   0.38396507501602173
   
   News showed, violent floodwaters surging down main Streets
   Multiple lines of scientific evidence show that the climate system is warming
   0.3623693287372589
   
   IPhone X includes a 5.8-inch edge-to-edge display which covers the entire front of the
phone.
   Apple became the world’s first trillion-dollar public company
   0.361715167760849
   
   With roads covered with slippery snow and ice, can challenge even the most experienced
driver.
   News showed, violent floodwaters surging down main Streets
   0.35203033685684204
   
   Yoga has been shown to help people reduce anxiety
   Is paleo better than keto?
   0.34740278124809265
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> A parser to output universal sentence encodings to text
> -------------------------------------------------------
>
>                 Key: TIKA-2720
>                 URL: https://issues.apache.org/jira/browse/TIKA-2720
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-dl
>            Reporter: Thejan Wijesinghe
>            Priority: Major
>             Fix For: 2.0
>
>
> This parser encodes a text into high dimensional vectors that can be used for text classification,
semantic similarity, clustering and other natural language tasks. The model is trained and
optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.
It is trained on a variety of data sources and a variety of tasks with the aim of dynamically
accommodating a wide variety of natural language understanding tasks. The input is variable
length English text and the output is a 512 dimensional vector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message