Pulling Data from a Database with Pandas

·

8 min read

Pulling data from a database is one of the most fundamental tasks a lot of us face. Whether you're a software engineer, analyst, data scientist, or someone just trying to learn the ins and outs of coding.

Probably one of the most well-known Python modules, Pandas is fairly comprehensive in its functionality for data manipulation and is probably one of the gold standards for working with data in Python. As well, with a little bit of setup, Pandas has in-built functionality for pulling data from a database directly into a DataFrame! It's a tool that belongs in EVERYONE's toolbelt!

This particular functionality in Pandas is pandas.read_sql(). It takes multiple arguments, but the primary/required ones are a query and a connection object. That's it. If you pass these things, you can pull data from a database directly into a Pandas DataFrame!

In this post, I'm going to walk through 2 ways of connecting to a database and pulling data into Python.

Requirements

Below are the high-level requirements. Next to the corresponding modules are the associated bash commands to install them:

  • Python3

  • Pandas python -m pip install pandas

  • SQLAlchemy python -m pip install sqlalchemy

  • Python database module for your respective provider

    • Microsoft SQL Server - pyodbc
      python -m pip install pyodbc

    • Postgresql - psycopg2 (or psycopg2-binary if you have issues installing)
      python -m pip install psycopg2

    • mysql-connector-python - MySQL
      python -m pip install mysql-connector-python

    • The above ones are pretty common, however, if you have another database provider, do some research for their corresponding database module

The Use Case

For this post, I'm utilizing Microsoft SQL Server and I'm leveraging Microsoft's free practice database for the fictional company AdventureWorks.

I'm going to pull the names and addresses of every employee with a birthday in the current month. However, I have some additional automation steps after I pull this, so I need to pull this data into Python.

There are two relatively easy ways to tackle this. One is with SQLAlchemy, and another is utilizing a Python database module.

select
    p.FirstName
    , p.LastName 
    , e.BirthDate
    , a.AddressLine1 
    , a.AddressLine2 
    , a.PostalCode 
from HumanResources.Employee e 
    inner join Person.Person p on e.BusinessEntityID = p.BusinessEntityID 
    inner join Person.BusinessEntityAddress bea on e.BusinessEntityID = bea.BusinessEntityID 
    inner join Person.Address a on bea.AddressID = a.AddressID 
where datepart(month,birthdate) = datepart(month,getdate())

Method 1: SQLAlchemy

SQLAlchemy is an incredibly popular and powerful Python module and is the recommended method for connecting to a database and pulling data with Pandas. There are a variety of in-built features that I won't necessarily cover in this post, but feel free to refer to the link in the resources section if you'd like to learn more.

To pull data with SQLAlchemy, we first need to import the modules we need to use:

import pandas as pd
import sqlalchemy

To use SQLAlchemy, you need to have the corresponding database module installed. So for this example, I already have pyodbc installed.

To connect with SQLAlchemy, we have to create a url for SQLAlchemy to connect with. It adheres to the following pattern:

  • {drivername}://{username}:{password}@{host}:{port}/{database}

If that seems a bit intimidating, we can utilize a nifty SQLAlchemy module URL to build our url from input elements. For our connection to our database, we'll use the sqlalchemy.URL.create() method to create the connection url:

#create our connection url
connection_url = sqlalchemy.URL.create(
                     drivername="mssql+pyodbc"
                     ,username="demo.user"
                     ,password="Demo.Password"
                     ,host="linux-server-home"
                     ,database="adventureworks"
                     ,query={"driver":"ODBC Driver 18 for SQL Server"}
                     )

Besides drivername and the query arguments, all of these should likely be pretty straightforward. A quick explanation of each below:

  • drivername - this is specific for the database provider and database module you are using - you may have to do some research to determine the right combination for you - if you're struggling to find the right one, check the SQLAlchemy dialects page

    • Microsoft SQL Server mssql+pyodbc

    • Postgresql postgresql+psycopg2

    • MySQL mysql+mysqlconnector

  • username - this is the username of the login you are using

  • password - this is the password of the login you are using

  • host - this is the database host name (DNS) or the IP address of the server

  • database - this is the database you will connect to

  • query - this just adds additional connection details to the generated url - additional arguments may vary based on the distribution

    • Specifically with this example, to connect to Microsoft SQL Server via pyodbc, you need to pass the driver, "ODBC Driver 18 for SQL Server"

Note: If you're using pyodbc and don't know what drivers you have, you can use pyodbc.drivers() to get a list of drivers. Just look for a 'SQL Server' driver. If you don't have any ODBC drivers, make sure to install them from Microsoft's website.

Once we have our connection_url, we will create our SQLAlchemy engine using sqlalchemy.create_engine() and passing the connection_url. We will utilize this engine to create our connection to the database.

engine = sqlalchemy.create_engine(connection_url)

Once we have the engine, we have to convert our raw query to a SQLAlchemy friendly version. We will utilize sqlalchemy.text() for this.

#our raw query
raw_query = """
select
    p.FirstName
    , p.LastName 
    , e.BirthDate
    , a.AddressLine1 
    , a.AddressLine2 
    , a.PostalCode 
from HumanResources.Employee e 
    inner join Person.Person p on e.BusinessEntityID = p.BusinessEntityID 
    inner join Person.BusinessEntityAddress bea on e.BusinessEntityID = bea.BusinessEntityID 
    inner join Person.Address a on bea.AddressID = a.AddressID 
where datepart(month,birthdate) = datepart(month,getdate())
"""
#create sqlalchemy friendly version of our query
sqlalchemy_query = sqlalchemy.text(raw_query)

Once we've converted our query, we're ready to connect and pull the data into a Pandas DataFrame! We will just do the following:

  • Create a connection with engine.connect()

  • Pull the data into a DataFrame with pandas.read_sql() by passing the query and connection

#connect with the engine
with engine.connect() as con:
    #pull the data with pandas into a dataframe
    df = pd.read_sql(sqlalchemy_query,con)

That's it! If everything works, we should now have our DataFrame pulled into Python, and it's ready for us to start using!

Method 2: Database Module

While it is recommended to utilize SQLAlchemy when you pull data from a database with Pandas, there is an alternative if for whatever reason SQLAlchemy doesn't work for you. You can use the database module directly.

Python has a set of standards for modules that work with databases (DBAPI 2.0). If you're curious about these standards, please see the resource at the bottom of the post. What this means for us is any module that utilizes these standards should have similar functionality and should be able to be used relatively interchangeably. As well, we should be able to just use this module directly with pandas.read_sql().

Note: If you don't use SQLAlchemy, Pandas will likely throw a warning. Just ensure the results match your expectations, and you should be fine.

Even though there are standards (as far as functionality), connecting to each database can look a little different. Refer to your documentation to know how to establish this connection.

Pulling the data using the database module is going to be very similar, except we won't have to do as much setup as we did with SQLAlchemy.

We'll import our required modules:

import pyodbc
import pandas as pd

Then we'll set up our connection details via a dictionary, it's going to seem very similar to the SQLAlchemy arguments we passed to create our url:

#our connection details
connection_details = {
    "server":"linux-server-home"
    ,"database":"adventureworks"
    ,"user":"demo.user"
    ,"password":"Demo.Password"
    ,"driver":"{ODBC Driver 18 for SQL Server}"
    }

When we utilize a database module directly, you should be able to utilize your raw query directly without any sort of manipulation that we had to do with SQLAlchemy.

Finally, similarly to SQLAlchemy, we'll do the following:

  • Create a connection by passing the dictionary to the pyodbc.connect()

    • The ** in front of the connection details passes all keys and values as arguments to the pyodbc.connect() method
  • Pull the data into a DataFrame with pandas.read_sql() by passing the query and connection

#our raw query
raw_query = """
select
    p.FirstName
    , p.LastName 
    , e.BirthDate
    , a.AddressLine1 
    , a.AddressLine2 
    , a.PostalCode 
from HumanResources.Employee e 
    inner join Person.Person p on e.BusinessEntityID = p.BusinessEntityID 
    inner join Person.BusinessEntityAddress bea on e.BusinessEntityID = bea.BusinessEntityID 
    inner join Person.Address a on bea.AddressID = a.AddressID 
where datepart(month,birthdate) = datepart(month,getdate())
"""
#connecting with our connection details
with pyodbc.connect(**connection_details) as con:
    #pull the data with pandas into a dataframe
    df = pd.read_sql(raw_query,con)

That's it! Our results should be identical to SQLAlchemy method, but we utilized the database module directly.

Conclusion

Almost everyone works with data, and Pandas makes it incredibly simple to pull data from a database via the inbuilt feature pandas.read_sql() utilizing either SQLAlchemy or a module for your corresponding database provider.

From there you can utilize the Pandas DataFrame however you need to for whatever use case you have!

Thanks for reading, and I hope you found this tutorial helpful!

Resources

Python DBAPI 2.0

SQLAlchemy

Full Solutions

SQLAlchemy

import pandas as pd
import sqlalchemy
#create our connection url
connection_url = sqlalchemy.URL.create(
                     drivername="mssql+pyodbc"
                     ,username="demo.user"
                     ,password="Demo.Password"
                     ,host="linux-server-home"
                     ,database="adventureworks"
                     ,query={"driver":"ODBC Driver 18 for SQL Server"}
                     )
engine = sqlalchemy.create_engine(connection_url)
#our raw query
raw_query = """
select
    p.FirstName
    , p.LastName 
    , e.BirthDate
    , a.AddressLine1 
    , a.AddressLine2 
    , a.PostalCode 
from HumanResources.Employee e 
    inner join Person.Person p on e.BusinessEntityID = p.BusinessEntityID 
    inner join Person.BusinessEntityAddress bea on e.BusinessEntityID = bea.BusinessEntityID 
    inner join Person.Address a on bea.AddressID = a.AddressID 
where datepart(month,birthdate) = datepart(month,getdate())
"""
#create sqlalchemy friendly version
sqlalchemy_query = sqlalchemy.text(raw_query)
#connect with the engine
with engine.connect() as con:
    #pull the data with pandas into a dataframe
    df = pd.read_sql(sqlalchemy_query,con)

Database Module

import pyodbc
import pandas as pd

#our connection details
connection_details = {
    "server":"linux-server-home"
    ,"database":"adventureworks"
    ,"user":"demo.user"
    ,"password":"Demo.Password"
    ,"driver":"{ODBC Driver 18 for SQL Server}"
    }
#our raw query
raw_query = """
select
    p.FirstName
    , p.LastName 
    , e.BirthDate
    , a.AddressLine1 
    , a.AddressLine2 
    , a.PostalCode 
from HumanResources.Employee e 
    inner join Person.Person p on e.BusinessEntityID = p.BusinessEntityID 
    inner join Person.BusinessEntityAddress bea on e.BusinessEntityID = bea.BusinessEntityID 
    inner join Person.Address a on bea.AddressID = a.AddressID 
where datepart(month,birthdate) = datepart(month,getdate())
"""
#connecting with our connection details
with pyodbc.connect(**connection_details) as con:
    #pull the data with pandas into a dataframe
    df = pd.read_sql(raw_query,con)