Reading PDF files with C#

Photo by CURVD® on Unsplash

Reading PDF files with C#

Recently I needed to grab some text values from a number of pdf files. Instead of having to manually open each and every pdf file I just knew there had to be an easier way.

After a quick search, I found the solution; iTextSharp, an open source C# library that allows you to do a host of awesome stuff with pdf files. It is a port of iText which is a Java library. You can find more info about iText on their website at www.itextpdf.com . I just knew this library is something else when I saw they had an entire book dedicated to it.

Manipulating and reading pdf files is no trivial task, but luckily for me the pdf files I needed to read were fairly straight forward and I used the following code to return the contents of the file as one big string:

private string ParsePdf(string filePath)
{
    string text = string.Empty;

    PdfReader reader = new iTextSharp.text.pdf.PdfReader(filePath);
    byte[] streamBytes = reader.GetPageContent(1);
    PRTokeniser tokenizer = new PRTokeniser(streamBytes);

    while (tokenizer.NextToken())
    {
        if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
        {
            text += tokenizer.StringValue;
        }
    }
    return text;
}

From there I used some string manipulation to grab the values I needed and perform some additional logic. Easy!

Links from this post: