nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Gobel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'
Date Tue, 08 Jan 2013 08:28:13 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

J. Gobel updated NUTCH-1511:
----------------------------

    Description: 
After applying patch for Metadata parser (NUTCH-1478) I notice that the metadata field just
before the crawl ends is populated with the correct information. However when the crawl is
completely finished the metadata field is populated with 'garbage' _csh_����� 

I notice in my SQL log file that the scoring plugin is overwriting the metadata field in a
final data insertion with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of 'plugin.includes'
property in the nutch-site.xml , the metadata-field is crisp and clear.

MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a fragments
of my MYSQL log file, only the moments when data is written to the METADATA field in the MYSQL
table.


First Insertion .. here I suppose scoring-opic writes its information, _csh_ ?€\0\0\0 

58 Query    INSERT INTO webpage (fetchInterval,fetchTime,id,markers,metadata,score )VALUES
(2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
_csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE fetchInterval=2592000,fetchTime=1357122976493,markers='
dist 0 _injmrk_ y\0',metadata='
_csh_ ?€\0\0\0',score=1.0


Second Insertion - inhere scraped metada is inserted into metadata. 

 81 Query    INSERT INTO webpage (id,markers,metadata,outlinks,parseStatus,signature,text,title
)VALUES ('org.apache.nutch:http/',



The final insertion -  please note that here the metadata field is overwritten with _CSH_\0\0\0\0

90 Query    INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES (1359714995075,'org.apache.nutch:http/','
0http://nutch.apache.org/
Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 __prsmrk__*1357122982-1745626508
_gnmrk_*1357122982-1745626508 _ftcmrk_*1357122982-1745626508\0','
_csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 0http://nutch.apache.org/

  was:
After applying patch for Metadata parser (NUTCH-1478) I notice that the metadata field just
before the crawl ends is populated with the correct information. However when the crawl is
completely finished the metadata field is populated with 'garbage' _csh_����� 

I notice in my SQL log file that the scoring plugin is updating the metadata field with '_csh_
\0\0\0\0\'. When I remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml
, the metadata-field is crisp and clear.

MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a fragments
of my MYSQL log file, only the moments when data is written to the METADATA field in the MYSQL
table.


First Insertion .. here I suppose scoring-opic writes its information, _csh_ ?€\0\0\0 

58 Query    INSERT INTO webpage (fetchInterval,fetchTime,id,markers,metadata,score )VALUES
(2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
_csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE fetchInterval=2592000,fetchTime=1357122976493,markers='
dist 0 _injmrk_ y\0',metadata='
_csh_ ?€\0\0\0',score=1.0


Second Insertion - inhere scraped metada is inserted into metadata. 

 81 Query    INSERT INTO webpage (id,markers,metadata,outlinks,parseStatus,signature,text,title
)VALUES ('org.apache.nutch:http/',



The final insertion -  please note that here the metadata field is updated with _CSH_\0\0\0\0

90 Query    INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES (1359714995075,'org.apache.nutch:http/','
0http://nutch.apache.org/
Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 __prsmrk__*1357122982-1745626508
_gnmrk_*1357122982-1745626508 _ftcmrk_*1357122982-1745626508\0','
_csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 0http://nutch.apache.org/

    
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
>                 Key: NUTCH-1511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1511
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, injector, storage
>    Affects Versions: 2.1
>         Environment: Ubuntu 12.04
>            Reporter: J. Gobel
>              Labels: metadata, mysql, nutch, scoring-opic
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the metadata field
just before the crawl ends is populated with the correct information. However when the crawl
is completely finished the metadata field is populated with 'garbage' _csh_�����

> I notice in my SQL log file that the scoring plugin is overwriting the metadata field
in a final data insertion with '_csh_ \0\0\0\0\'. When I remove 'scoring-opic' out of 'plugin.includes'
property in the nutch-site.xml , the metadata-field is crisp and clear.
> MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see a fragments
of my MYSQL log file, only the moments when data is written to the METADATA field in the MYSQL
table.
> First Insertion .. here I suppose scoring-opic writes its information, _csh_ ?€\0\0\0

> 58 Query    INSERT INTO webpage (fetchInterval,fetchTime,id,markers,metadata,score )VALUES
(2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
> _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE fetchInterval=2592000,fetchTime=1357122976493,markers='
dist 0 _injmrk_ y\0',metadata='
> _csh_ ?€\0\0\0',score=1.0
> Second Insertion - inhere scraped metada is inserted into metadata. 
>  81 Query    INSERT INTO webpage (id,markers,metadata,outlinks,parseStatus,signature,text,title
)VALUES ('org.apache.nutch:http/',
> The final insertion -  please note that here the metadata field is overwritten with _CSH_\0\0\0\0
> 90 Query    INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata )VALUES (1359714995075,'org.apache.nutch:http/','
0http://nutch.apache.org/
> Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 __prsmrk__*1357122982-1745626508
_gnmrk_*1357122982-1745626508 _ftcmrk_*1357122982-1745626508\0','
> _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 0http://nutch.apache.org/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message