nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Class Cast exception
Date Fri, 06 Jan 2006 20:39:14 GMT
Matt Zytaruk wrote:

> Here you go.
> java.lang.ClassCastException: java.util.ArrayList
>        at org.apache.nutch.parse.ParseData.write(
>        at org.apache.nutch.parse.ParseImpl.write(
>        at 
> org.apache.nutch.fetcher.FetcherOutput.write(
>        at 
>        at org.apache.nutch.mapred.MapTask$1.collect(
>        at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
>        at 
> org.apache.nutch.fetcher.Fetcher$

Congratulations! You are the first person to actually use (and suffer 
from) the multiple values in ContentProperties... ;-)

It turns out that ParseData.write() uses its own method for writing out 
metadata, instead of using ContentProperties.write(). It works well if 
you only have single values (then they are stored as Strings), but if 
there are multiple values they are stored in ArrayLists, which ParseData 
accesses directly by the virtue of using metadata.entrySet().iterator().

The fix is easy: please replace the following lines in ParseData.write():

    out.writeInt(metadata.size());                // write metadata
    Iterator i = metadata.entrySet().iterator();
    while (i.hasNext()) {
      Map.Entry e = (Map.Entry);
      UTF8.writeString(out, (String)e.getKey());
      UTF8.writeString(out, (String)e.getValue());

with this:


and the same for reading the metadata field; replace in 
ParseData.readField() this:

    int propertyCount = in.readInt();             // read metadata
    metadata = new ContentProperties();
    for (int i = 0; i < propertyCount; i++) {
      metadata.put(UTF8.readString(in), UTF8.readString(in));

with this:

    metadata = new ContentProperties();
Compile, deploy, test, report ... :-) Please note that this changes the 
on-disk segment format, so you won't be able to read the old segments 
with the new code. You may want to bump the ParseData.VERSION, and leave 
this code to handle older versions...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message