nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Mon, 14 May 2007 17:52:16 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495696
] 

Doğacan Güney commented on NUTCH-443:
-------------------------------------

I am not sure I follow you Andrzej. My patch already does a very similar thing in Fetchers
. Actually, the only difference between our patches - w.r.t Fetcher code - is in your patch
the parsing condition also includes (content != null) check. Beyond that our code is pretty
much the same. (I will send an updated patch that does that, btw). Besides the code change
in Fetchers, we also need to remove the redir != null condition for indexer to work correctly.
See my comment above for a hopefully more understandable description.

Indexer has to read crawl_parse, because that is where ParseSegment pushes sub-urls fetch
datums. So, it is not related to the redirection issue. It is related to the "Oh man, I forgot
to include that line in my patch" issue:).

If reading crawl_parse turns out to be a big burden to Indexer, perhaps we can make ParseSegment
push these datums to another file.  (crawl_late_fetch? Yeah, I know that name sucks:) It would
be awesome if hadoop allowed us to reopen SequenceFiles to append data(so we could have just
pushed them to crawl_fetch). AFAIK, hadoop doesn't have that yet.




> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch,
NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch,
NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch,
parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message