Monday, May 10, 2010

Parsing and reading text from PDF files

One of my development teams was looking for a PDF parsing library. They essentially wanted to search and extract data from PDF files. At first, I thought that OCR is the only way to achieve this, but there are libraries available to help us :)

PDFBox : This seems to be most popular library for extracting text out of PDF files. This is a Java library, but also has a .NET wrapper around it using iKVM.NET
Simple examples using this library can be found here and here.

iText & iTextSharp : These libraries are very popular for PDF generation and can also be used for extracting text from PDF files. Sample example can be found here.

I have heard that OpenOffice.org also provides you with a Java API that can be used to create and manipulate PDF files, but have not tried it yet.