tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Lothian <nloth...@educationau.edu.au>
Subject RE: Reading metadata without downloading entire file
Date Thu, 19 Feb 2009 00:23:49 GMT
ID3v2 would be great - it appears ID3v1 is widely used in music MP3 files, but not in Podcast
MP3s.

Anyway, if anyone is having a similar problem here's some code which appears to work using
Apache HttpClient.

Http Range requests for MP3 metadata:

                HttpClient httpClient = new HttpClient();
                httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(10000);
                httpClient.getHttpConnectionManager().getParams().setSoTimeout(10000);

                String address = "http://address of mp3 file here";

                HttpMethod method = new HeadMethod();
                method.setURI(new URI(address,true));

                Header contentLengthHeader = null;
                Header acceptHeader = null;

                httpClient.executeMethod(method);
                try {
                        //System.out.println(Arrays.toString(method.getResponseHeaders()));
                        contentLengthHeader = method.getResponseHeader("Content-Length");
                        acceptHeader = method.getResponseHeader("Accept-Ranges");
                } finally {
                        method.releaseConnection();
                }

                if ((contentLengthHeader != null) && (acceptHeader != null) &&
"bytes".equals(acceptHeader.getValue())) {
                        long contentLength = Long.parseLong(contentLengthHeader.getValue());
                        long metaDataStartRange = contentLength - 128;
                        if (metaDataStartRange > 0) {
                                method = new GetMethod();
                                method.setURI(new URI(address,true));
                                method.addRequestHeader("Range", "bytes=" + metaDataStartRange
+ "-" + contentLength);
                                System.out.println(Arrays.toString(method.getRequestHeaders()));
                                httpClient.executeMethod(method);
                                try {
                                        Parser parser = new AutoDetectParser();

                                        Metadata metadata = new Metadata();
                                        metadata.set(Metadata.RESOURCE_NAME_KEY, address);
                                        InputStream stream = method.getResponseBodyAsStream();
                                        try {
                                                parser.parse(stream, new DefaultHandler(),
metadata);
                                        } catch (Exception e) {
                                                e.printStackTrace();
                                        } finally {
                                                stream.close();
                                        }
                                        System.out.println(Arrays.toString(metadata.names()));
                                        System.out.println("Title: " + metadata.get("title"));
                                        System.out.println("Author: " + metadata.get("Author"));
                                } finally {
                                        method.releaseConnection();
                                }
                        }
                } else {
                        System.err.println("Range not supported. Headers were: ");
                        System.err.println(Arrays.toString(method.getResponseHeaders()));
                }


-----Original Message-----
From: Jonathan Koren [mailto:jonathan@soe.ucsc.edu]
Sent: Thursday, 19 February 2009 8:44 AM
To: tika-dev@lucene.apache.org
Subject: Re: Reading metadata without downloading entire file

id3v1 is exactly 128 bytes [ http://en.wikipedia.org/wiki/ID3#Layout ]
In my copious free time, I might add id3v2 support, unless of course
some else does.

On Feb 18, 2009, at 2:04 PM, Nick Lothian wrote:

> Well that would explain it then!
>
> Has anyone had any experience with using http-range requests for the
> metadata? How many bytes from the end does the metadata start?
>
> Nick
>
> -----Original Message-----
> From: Jonathan Koren [mailto:jonathan@soe.ucsc.edu]
> Sent: Wednesday, 18 February 2009 5:30 PM
> To: tika-dev@lucene.apache.org
> Subject: Re: Reading metadata without downloading entire file
>
>
> You're closing the stream before the metadata arrives.
>
> Tika supports ID3v1 which is at the end of the file, not the
> beginning.
>
> On Feb 17, 2009, at 10:22 PM, Nick Lothian wrote:
>
>> I'm trying to get MP3 Metadata without downloading an entire MP3.
>>
>> I've setup a FilterInputStream which throws an
>> InterruptedIOException after a given amount of a file is downloaded.
>>
>> If I point this at an HTML page it works - I can get the title from
>> the metadata.
>>
>> If I point it at an MP3 file it doesn't give me any metadata at all
>> (except the Metadata.RESOURCE_NAME_KEY which I set), even if I set
>> the download length to be just less than the length of the file. If
>> I download the whole file it works
>>
>> (JPGs don't seem to work either)
>>
>> Why is this so? My understanding was that Tika would work with
>> streams?
>
>
>
> --
> Jonathan Koren
> jonathan@soe.ucsc.edu
> http://www.soe.ucsc.edu/~jonathan/
>
>
>
> IMPORTANT: This e-mail, including any attachments, may contain
> private or confidential information. If you think you may not be the
> intended recipient, or if you have received this e-mail in error,
> please contact the sender immediately and delete all copies of this
> e-mail. If you are not the intended recipient, you must not
> reproduce any part of this e-mail or disclose its contents to any
> other party. This email represents the views of the individual
> sender, which do not necessarily reflect those of Education.au
> except where the sender expressly states otherwise. It is your
> responsibility to scan this email and any files transmitted with it
> for viruses or any other defects. education.au limited will not be
> liable for any loss, damage or consequence caused directly or
> indirectly by this email.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



IMPORTANT: This e-mail, including any attachments, may contain private or confidential information.
If you think you may not be the intended recipient, or if you have received this e-mail in
error, please contact the sender immediately and delete all copies of this e-mail. If you
are not the intended recipient, you must not reproduce any part of this e-mail or disclose
its contents to any other party. This email represents the views of the individual sender,
which do not necessarily reflect those of Education.au except where the sender expressly states
otherwise. It is your responsibility to scan this email and any files transmitted with it
for viruses or any other defects. education.au limited will not be liable for any loss, damage
or consequence caused directly or indirectly by this email.

Mime
View raw message