nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Francesco Capponi (JIRA)" <>
Subject [jira] [Created] (NUTCH-2276) Tika Boilerpipe Parser in combo with RSS items doesn't work
Date Wed, 08 Jun 2016 22:07:21 GMT
Francesco Capponi created NUTCH-2276:

             Summary: Tika Boilerpipe Parser in combo with RSS items doesn't work
                 Key: NUTCH-2276
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.11, 1.12
         Environment: feed parser for RSS
Tika parser with boilerpipe (with ArticleExtractor) for HTML
            Reporter: Francesco Capponi

Sometimes it happens that the text (description) for an RSS item is too short or has characteristics
that Tika with Boilerpipe decide to cut the entire text, resulting in an empty string.

in fact when the feed plugin selects a parser uses the function:
      Parser parser = parserFactory.getParsers(contentType, link)[0];
the content being a HTML returns the Tika Boilerpipe article extractor.

Since the description text of an RSS as far as I know is always html, instead of asking for
the contentType, we could set another mimetype for this specific case
    String contentType = contentMeta.get(Response.CONTENT_TYPE);
 ->String contentType = "text/html-short";

This message was sent by Atlassian JIRA

View raw message