In this post, we will look at extracting tabular data from PDF files.

The context for this post is a project we are working on for the real estate industry.

We are developing an application to serve realtors and help them in developing an interface to customers and also them to manage their customer contacts and CRM campaigns.

One of the best source for realtor information is the NAR website. NAR has published statewise monthly realtor membership dating back to 1908!

Unfortunately, the data is buried in PDF files which need to be extracted perform for the data analysis. This post demonstrates getting the information out using three different libraries, along with a couple of minor tips to ensure proper extraction of tabular data.

Using Camelot

Installation

conda install -c conda-forge camelot-py

code

import camelot

# read the data
tables = camelot.read_pdf('./data/2018-membership-count-by-state.pdf')

# inspect the tables
tables

# output of first table
tables[0].df

# read ALL tables
tables = camelot.read_pdf('./data/2018-membership-count-by-state.pdf', pages='all')

# print the last table ;-)
tables[10].df

Using tabula

Installation

pip install tabula-py

code

import tabula

df = tabula.read_pdf('./data/2018-membership-count-by-state.pdf')

# the above code didn't work well ... need to get rid of headers!

# credit: https://github.com/chezou/tabula-py/issues/38

# read first page
df = tabula.read_pdf('./data/2018-membership-count-by-state.pdf', pandas_options={'header': None})

# read all pages
df = tabula.read_pdf('./data/2018-membership-count-by-state.pdf', pandas_options={'header': None}, pages='all')

# show the last table

import pandas as pd
pd.DataFrame(df[10])

Resources