nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Closed: (NUTCH-894) Move statistical language identification from indexing to parsing step
Date Fri, 01 Oct 2010 18:30:33 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney closed NUTCH-894.
-------------------------------

      Assignee: Doğacan Güney  (was: Julien Nioche)
    Resolution: Fixed

Committed as of rev. 1003608.

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Doğacan Güney
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the indexing step,
whereas the detection based on HTTP header and HTML code is done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if nothing has been
found with the previous methods but as part of the parsing. This would be useful for ParseFilters
which need the language information or to use with ScoringFilters e.g. to focus the crawl
on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely on them
instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message