Reading PDF files with C#
Recently I needed to grab some text values from a number of pdf files. Instead of having to manually open each and every pdf file I just knew there had to be an easier way.
After a quick search, I found the solution; iTextSharp, an open source C# library that allows you to do a host of awesome stuff with pdf files. It is a port of iText which is a Java library. You can find more info about iText on their website at www.itextpdf.com . I just knew this library is something else when I saw they had an entire book dedicated to it.
Manipulating and reading pdf files is no trivial task, but luckily for me the pdf files I needed to read were fairly straight forward and I used the following code to return the contents of the file as one big string:
private string ParsePdf(string filePath)
{
string text = string.Empty;
PdfReader reader = new iTextSharp.text.pdf.PdfReader(filePath);
byte[] streamBytes = reader.GetPageContent(1);
PRTokeniser tokenizer = new PRTokeniser(streamBytes);
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
{
text += tokenizer.StringValue;
}
}
return text;
}
From there I used some string manipulation to grab the values I needed and perform some additional logic. Easy!
Links from this post: