Extracting data from PDF files using Python (Camelot and Tabula)

In this post, we will look at extracting tabular data from PDF files.
The context for this post is a project we are working on for the real estate industry.
We are developing an application to serve realtors and help them in developing an interface to customers and also them to manage their customer contacts and CRM campaigns.
One of the best source for realtor information is the NAR website. NAR has published statewise monthly realtor membership dating back to 1908!
Unfortunately, the data is buried in PDF files which need to be extracted perform for the data analysis. This post demonstrates getting the information out using three different libraries, along with a couple of minor tips to ensure proper extraction of tabular data.
Using Camelot
Installation
conda install -c conda-forge camelot-py
code
import camelot
# read the data
tables = camelot.read_pdf('./data/2018-membership-count-by-state.pdf')
# inspect the tables
tables
# output of first table
tables[0].df
# read ALL tables
tables = camelot.read_pdf('./data/2018-membership-count-by-state.pdf', pages='all')
# print the last table ;-)
tables[10].df
Using tabula
Installation
pip install tabula-py
code
import tabula
df = tabula.read_pdf('./data/2018-membership-count-by-state.pdf')
# the above code didn't work well ... need to get rid of headers!
# credit: https://github.com/chezou/tabula-py/issues/38
# read first page
df = tabula.read_pdf('./data/2018-membership-count-by-state.pdf', pandas_options={'header': None})
# read all pages
df = tabula.read_pdf('./data/2018-membership-count-by-state.pdf', pandas_options={'header': None}, pages='all')
# show the last table
import pandas as pd
pd.DataFrame(df[10])
Resources
- 3 ways to scrape tables from PDFs with Python - Open Source Automation
- Extracting Tabular Data from PDF Documents with Py... - Alteryx Community
- How to extract table as text from the PDF using Python? - Stack Overflow
- Python for Pdf - Umer Farooq - Medium
