Apache Tika- A Content Extraction Framework

Apache Tika is a content extraction framework that can extract the metadata (data about data) of much compatible file type. Tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats. Tika is buid on many open source libraries like Apache PDFBox, Apache PI for Ms-office files, Tagoup for HTML files, ImageIO metadata –extractor for IMagefile i.e. JPG, bmp etc.., and common compress for compress files like zip, tar,Gz etc…

Tika Usage

In the following example, I will show how to extract text content and metadata by using Tika.  Parser interface is the key concept of Tika framework. Tika provides a simple parser interface for user to call for parsing content. Various parser classes have been developed by wrapping up the complexity of using different external libraries to do the text extraction task. Users can also implement their own parser class.

try {

    parser.parse(InputStream, ContentHandler, Metadata, ParseContext);

}

catch {….}

finally {

    InputStream.close();

}

We can use the associated parser object to parse an input stream from a document. If we already know the document format, we can pick the parser class directly like PDFParser for PDF document. But best part is, we can use auto detected parser and have it select the associated parser for us automatically (means, no need to worry about the file type). On use of the AutoDetectParser user need not have to think about different Parser at all, Tika will take care of that.

AutoDetectParser parser = newAutoDetectParser(); 

Some use of Tika API

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

Example :

 

private static void parseDoc(final String resourceLocation) throws IOException,

            SAXException, TikaException {

        InputStream input = new FileInputStream(new File(resourceLocation));

        ContentHandler textHandler = new BodyContentHandler();

        Metadata metadata = new Metadata();

/* for Parsing image file

        ImageParser parser = new ImageParser();

//*/

/* for parsing PDF

        PDFParser parser = new PDFParser();

 //*/

        AutoDetectParser parser = new AutoDetectParser();

        parser.parse(input, textHandler, metadata);

        input.close();

         out.println(“Tika Parser starts……\n”);

        out.println(“file name: “+resourceLocation);

         out.println(“Title: ” + metadata.get(“title”));

        out.println(“Author: ” + metadata.get(“Author”));

        out.println(“content: ” + textHandler.toString());

         out.println(“Tika Parser stops……”);

    }

using this example we can parse any compatible file by calling parseDoc() providing file name in parameter (with location )

In the above example, I first create a FileInputStream containing the document to parse. Then I use a Tika content handler called BodyContentHandler that internally construct s content handler decorator of type XHTML to TextContextHandler . The decorator is actually forming the plain text output from the SAX event that the Parser emits. Next I instantiate a AutoDetectParser directly, call the parse method and close the stream. It is required to call close method of InputStream since it is not the responsibility of Parser to call it for user.

Tika provides some readymade ContentHandler implementations that can be useful while parsing content with Tika.

Finally, the metadata (input/output) parameter provides additional data to the parser as input and can return additional metadata out from the document. Examples of metadata include things like author name, number of pages, creation date, etc.

History of Apache Tika

 

2006- Initial Discussion started

2007- Project started

2008- Release 0.1, 0.2

2009- Release 0.3, 0.4, 0.5

2010-  Release 0.6, 0.7, 0.8

2011- Release 0.9, 0.10, 1.0

2012 till date- Release 1.1

Other bug fixes and improvements are listed in Apache site on the CHANGES.txt file for all releases.

 

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s