lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <>
Subject RE: HTMLParser choking on Unicode
Date Tue, 08 Apr 2003 17:23:47 GMT
I'm using HTML Parser to parse Japanese content with no troubles. Be sure to set the encoding
when you read the HTML files. I have a method I use to get the Reader object:

    public BufferedReader getReader() throws IOException {
        InputStream in = getInputStream();
        return new BufferedReader(new InputStreamReader(in, getCharset()));

getInputStream() is getting my input stream from a FileInputStream(File) or a JarFile.getInputStream(JarEntry)


getCharset() My object keeps track of the language of the content and in my application all
the content for a given language is required to use a specific encoding, so I keep a Hashtable
of language to encoding. For japanese, we use shift_jis as the encoding and things are working

If you don't know the encoding of your HTML file up front, you have to do some more work to
determine the encoding before you hand the Reader to HTMLParser.

Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639

-----Original Message-----
From: mchaput [] 
Sent: Tuesday, April 08, 2003 12:55 PM
Subject: HTMLParser choking on Unicode

When I try to index Japanese HTML files using HTMLParser, I just get "lexical errors" in every

   Parse Aborted: Lexical error at line 12, column 28.
   Encountered: "\u2030" (8240), after : ""

Is there something I have to do to make HTMLParser work with Unicode?

(I haven't done anything special with readers or encodings (don't really know much about it)...
is that the problem?)



Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7    |   (416) 874-8268
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message