nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinci <vinci.w...@polyu.edu.hk>
Subject Re: [jira] Created: (NUTCH-624) Better parsed text
Date Tue, 01 Apr 2008 09:21:30 GMT

Hi,

Thank you for your feedback.
The default parsed text dumped by readseg utility is just giving the parsed
text in space, that is not easy to process:
I need to process the text in sentence-by-sentence manner.However in most of
page I crawled, there is no footstop or comma appear in the end of sentence
for some sentence (where space is still possible)! That make me need to use
complex regular to broke the line. So I put this in the improvement of JIRA
as I hope the default parser (or the readseg utility) can change the default
behaviour.

Thank you,
Vinci


ogjunk-nutch wrote:
> 
> Vinci,
> 
> Please use the mailing list to ask questions and discuss first, not JIRA. 
> Also, please include an example of what you are describing, if you can.
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> ----- Original Message ----
> From: Vinci (JIRA) <jira@apache.org>
> To: nutch-dev@lucene.apache.org
> Sent: Sunday, March 30, 2008 7:55:24 AM
> Subject: [jira] Created: (NUTCH-624) Better parsed text
> 
> Better parsed text
> ------------------
> 
>                  Key: NUTCH-624
>                  URL: https://issues.apache.org/jira/browse/NUTCH-624
>              Project: Nutch
>           Issue Type: Improvement
>             Reporter: Vinci
> 
> 
> I found the parsed text by default parser Neko is not easy to process - it
> just add a space to the end of the tag. Can neko (or other parser) change
> the behaviour to 
> 1.adding tab (for inline element)
> 2.add a tab+newline  for block level element end
> instead of  space, so we can have a better parsed text?
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28NUTCH-624%29-Better-parsed-text-tp16381323p16416916.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Mime
View raw message