nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1530) Umlauts (üäö) garbled when fetch and parse in separate calls (OK when fetcher.parse is true)
Date Wed, 13 Feb 2013 19:04:13 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577831#comment-13577831
] 

Lewis John McGibbney commented on NUTCH-1530:
---------------------------------------------

I would imagine that setting 

{code}
log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout
{code}

to

{code}
log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout
{code}

then recompile would explicitly mention which parser is being used in the hadoop.log.  
                
> Umlauts (üäö) garbled when fetch and parse in separate calls (OK when fetcher.parse
is true)
> --------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1530
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.1
>         Environment: Using Cassandra-1.2.1 as data store.
>            Reporter: Edward Ackroyd
>             Fix For: 2.2
>
>
> When crawling http://www.spiegel.de (popular German news site) in separate fetch and
parse calls (nutch fetch, then nutch parse, fetcher.parse=false) this lands in Cassandra (umlauts
all garbled, for example '�' instead of 'ö'):
> [default@webpage] list p;
> RowKey: de.spiegel.www:http/
> => (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter
Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik
Deutschland Ausland   Wirtschaft B�rse Verbraucher & Service Unternehmen & M�rkte
Staat & Soziales Jobsuche Immowelt   Panorama Justiz Leute Gesellschaft Partnersuche
Eurojackpot Tarifvergleiche   Sport Wintersport Fu�ball Bundesliga...
> However, when fetcher.parse=true and the fetch call does the parsing, the correct umlauts
land in Cassandra:
> [default@webpage] list p;
> RowKey: de.spiegel.www:http/
> => (column=c, value=SPIEGEL ONLINE - Nachrichten Schlagzeilen Hilfe RSS Newsletter
Mobil Wetter TV-Programm Dienstag, 12. Februar 2013 SPIEGEL ONLINE NACHRICHTEN Home Politik
Deutschland Ausland   Wirtschaft Börse Verbraucher & Service Unternehmen & Märkte
Staat & Soziales Jobsuche Immowelt   Panorama Justiz Leute Gesellschaft Partnersuche
Eurojackpot Tarifvergleiche   Sport Wintersport Fußball Bundesliga...
> Seems the content is over-encoded when fetching/parsing in separate calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message