lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Sun, 05 Dec 2010 15:40:13 GMT


Robert Muir commented on SOLR-1979:

bq. cause that distance measure is kind of an internal value, not very normalized and is bound
to change in future versions of TIKA.

bq. we can make a new isReasonablyCertain() implementation taking into account the relative
distance between first and second candidate languages...

I don't follow the logic: if its not very normalized then it seems like this approach doesnt
tell you anything... language 1 could be uncertain,
 and language 2 is just completely uncertain, but that tells you nothing: isn't it like trying
to determine if a good lucene search result score is "certainly a hit" and not really the
right way to go?

For example: consider the case where the language isn't supported at all by Tika (i dont see
a list of supported languages anywhere by the way!).
It would be good for us to know that the detection is uncertain at all... how relatively uncertain
it is with regards to the next language, is not very important.

I think its also important we be able to get this uncertainty or whatever different agnostic
of the implementation.
For example, we should be able to somehow think of chaining detectors... 

Its really important to "cheat" and not use heuristics for languages that don't need them.
For example, disregarding some strange theoretical/historical cases, you can simply look at
the unicode properties 
in the document to determine that its in the Greek language, as its basically the only modern
language using the greek alphabet

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message