lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mchaput <>
Subject Re: HTMLParser choking on Unicode
Date Tue, 08 Apr 2003 18:04:45 GMT
Excellent! Thanks very much, Eric. Sorry to the list if this was too 
basic... I'm very new to the world of non-Latin LP.


Eric Isakson wrote:
> I'm using HTML Parser to parse Japanese content with no troubles. Be sure to set the
encoding when you read the HTML files. I have a method I use to get the Reader object:
>     public BufferedReader getReader() throws IOException {
>         InputStream in = getInputStream();
>         return new BufferedReader(new InputStreamReader(in, getCharset()));
>     }
> getInputStream() is getting my input stream from a FileInputStream(File) or a JarFile.getInputStream(JarEntry)
> and
> getCharset() My object keeps track of the language of the content and in my application
all the content for a given language is required to use a specific encoding, so I keep a Hashtable
of language to encoding. For japanese, we use shift_jis as the encoding and things are working

Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7    |   (416) 874-8268
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message