nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renato Javier Marroquín Mogrovejo (JIRA) <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1791) Null pointer exceptions with gora-cassandra-0.4
Date Mon, 03 Nov 2014 17:36:34 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194768#comment-14194768
] 

Renato Javier Marroquín Mogrovejo commented on NUTCH-1791:
----------------------------------------------------------

Hey [~lewismc], this is the data evolution problem we have been discussing lately. The main
problem I see here is that we are also making Nutch change the data schema that it uses. I
mean if it uses field A with type AA, and then we decide to write A with type BB, then of
course such problem will arise. 
Gora allows the reader schema view now i.e. it tries to read what you tell it to read, but
you might have some other type of data stored. So one solution is to use an older schema (which
will compile to a correct data bean) and the other one (union specific solution) is to "try"
deserialize with the other type values of the union. But this could lead into bad results
as well, the union field might say types [null, string], but it was actually written as integers.
Gora enforces the reader's schema view, but we need a way to support writer's schema perspective
as well.


> Null pointer exceptions with gora-cassandra-0.4
> -----------------------------------------------
>
>                 Key: NUTCH-1791
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1791
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator, storage
>    Affects Versions: 2.3
>         Environment: dsc-cassandra-2.0.2, dsc-cassandra-2.0.7
>            Reporter: Koen Smets
>             Fix For: 2.4
>
>
> Latest nutch-2.x source checkout fails to run with Cassandra 2.0.2 (and also Cassandra
2.0.7) as storage backend both in normal Nutch operations (inject, generate, fetch) cycle
as in the junit tests {{TestGoraStorage}}
> {code}
> 2014-06-03 11:24:23,495 INFO  connection.CassandraHostRetryService (CassandraHostRetryService.java:<init>(48))
- Downed Host Retry service started with queue size -1 and retry delay 10s
> 2014-06-03 11:24:23,535 INFO  service.JmxMonitor (JmxMonitor.java:registerMonitor(52))
- Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
> Exception in thread "main" java.lang.NullPointerException
> 	at org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
> 	at org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
> 	at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
> 	at org.apache.nutch.storage.TestGoraStorage.readWrite(TestGoraStorage.java:93)
> 	at org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:230)
> {code}
> After injecting:
> {code}
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch inject urls
> InjectorJob: starting at 2014-06-03 11:55:11
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage
class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and filtering: 1
> Injector: finished at 2014-06-03 11:55:13, elapsed: 00:00:02
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
> WebTable statistics start
> Statistics for WebTable:
> min score:	1.0
> retry 0:	1
> jobs:	{db_stats-job_local1403358409_0001={jobID=job_local1403358409_0001, jobName=db_stats,
counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=97,
MAP_INPUT_RECORDS=1, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=12, MAP_OUTPUT_BYTES=53, COMMITTED_HEAP_BYTES=358612992,
CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, COMBINE_INPUT_RECORDS=4, REDUCE_INPUT_RECORDS=6,
REDUCE_INPUT_GROUPS=6, COMBINE_OUTPUT_RECORDS=6, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=6,
VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4}, FileSystemCounters={FILE_BYTES_READ=974145,
FILE_BYTES_WRITTEN=1144369}, File Output Format Counters ={BYTES_WRITTEN=225}}}}
> max score:	1.0
> TOTAL urls:	1
> status 0 (null):	1
> avg score:	1.0
> WebTable statistics: done
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
> key:	http://example.com/
> baseUrl:	null
> status:	0 (null)
> fetchTime:	1401789311270
> prevFetchTime:	0
> fetchInterval:	2592000
> retriesSinceFetch:	0
> modifiedTime:	0
> prevModifiedTime:	0
> protocolStatus:	(null)
> parseStatus:	(null)
> title:	null
> score:	1.0
> markers:	org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
> reprUrl:	null
> metadata _csh_ : 	?�
> {code}
> After generating,
> {code}
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch generate -topN 1
> GeneratorJob: starting at 2014-06-03 11:55:38
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: normalizing: true
> GeneratorJob: topN: 1
> GeneratorJob: finished at 2014-06-03 11:55:40, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1401789338-222512082 containing 1 URLs
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
> WebTable statistics start
> Statistics for WebTable:
> jobs:	{db_stats-job_local73029265_0001={jobID=job_local73029265_0001, jobName=db_stats,
counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6,
MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=358612992,
CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0,
REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0,
VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=974054,
FILE_BYTES_WRITTEN=1144028}, File Output Format Counters ={BYTES_WRITTEN=98}}}}
> TOTAL urls:	0
> WebTable statistics: done
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
> WebTableReader: java.lang.NullPointerException
> 	at org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
> 	at org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
> 	at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
> 	at org.apache.nutch.crawl.WebTableReader.read(WebTableReader.java:238)
> 	at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:494)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:430)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message