Tech Talk: Parsing and reading text from PDF files

Monday, May 10, 2010

Parsing and reading text from PDF files

One of my development teams was looking for a PDF parsing library. They essentially wanted to search and extract data from PDF files. At first, I thought that OCR is the only way to achieve this, but there are libraries available to help us :)

PDFBox : This seems to be most popular library for extracting text out of PDF files. This is a Java library, but also has a .NET wrapper around it using iKVM.NET
Simple examples using this library can be found here and here.

iText & iTextSharp : These libraries are very popular for PDF generation and can also be used for extracting text from PDF files. Sample example can be found here.

I have heard that OpenOffice.org also provides you with a Java API that can be used to create and manipulate PDF files, but have not tried it yet.

Tech Talk

Monday, May 10, 2010

Parsing and reading text from PDF files

No comments:

Post a Comment

About Me

Search This Blog

Total Pageviews

Categories

Blog Archive

Links