Read and import file with Pandas to MySQL database

Pandas is an open-source library for python. I am going to use this library to read a large file with pandas library. The file is around 7 GB in size and i need to extract and filter the data from the file and save it to the MySQL database.

Of cause only pandas library is not enough, we need to establish a MySQL connection through python and then have to insert the filtered data into the database.

Required packages

Let’s start

First of all, we need to import the required libraries like pandas, mysql-connector and sqlalchemy.

import pandas as pd
import mysql.connector
from sqlalchemy import types, create_engine

I am not going in details for create_engine or types. But if you want to know in details then please following the given links which redirects you to the documentation part.

Setup MySQL connection with Python

Now let’s establish a database connection so that we direct insert the filtered data into the database.

# MySQL Connection
MYSQL_USER 		= 'root'
MYSQL_PASSWORD 	= 'root'
MYSQL_HOST_IP 	= '127.0.0.1'
MYSQL_PORT		= 3306
MYSQL_DATABASE	= 'database_name'

engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+':'+MYSQL_PASSWORD+'@'+MYSQL_HOST_IP+':'+MYSQL_PORT+'/'+MYSQL_DATABASE, echo=False)

Read large file with Pandas

A bit challanging part is to read the large file and the inserting data into the database. We can’t do it at once because it takes a lot of time and we can’t insert billions of records at once.

To handle this kind of situation we have read the file in chunks and use that chunk data to insert.

pd.read_csv("large_file.txt", usecols=['column_one', 'column_two'], sep=" ")

Let’s understand the above code first. I have used the read_csv() method to read the TXT file. Let’s apply chunk on the read_csv() method and insert data with to_sql() method.

chunksize = 500

idx = 1
for df in pd.read_csv("large_file.txt", chunksize=chunksize, usecols=['column_one', 'column_two'], sep=" "):

	if idx == 1: 
		exists = 'replace'
	else:
		exists = 'append'


	df.to_sql(name='database_table_name', con=engine, if_exists=exists, index=False, chunksize=chunksize)


	print(str(chunksize * idx)+" Processed");
	idx = idx+1

Complete Codes

Please find the complete set of codes.

import pandas as pd
import mysql.connector
from sqlalchemy import types, create_engine


# MySQL Connection
MYSQL_USER 		= 'root'
MYSQL_PASSWORD 	= 'root'
MYSQL_HOST_IP 	= '127.0.0.1'
MYSQL_PORT		= 3306
MYSQL_DATABASE	= 'geolocation'

engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+':'+MYSQL_PASSWORD+'@'+MYSQL_HOST_IP+':'+MYSQL_PORT+'/'+MYSQL_DATABASE, echo=False)


chunksize = 500

idx = 1
for df in pd.read_csv("large_file.txt", chunksize=chunksize, usecols=['column_one', 'column_two'], sep=" "):

	if idx == 1: 
		exists = 'replace'
	else:
		exists = 'append'


	df.to_sql(name='database_table_name', con=engine, if_exists=exists, index=False, chunksize=chunksize)


	print(str(chunksize * idx)+" Processed");
	idx = idx+1

This is a basic concept to handling large file with Pandas library and insert the data directly to MySQL database. Of cause the code could improve more. Please feel free to add your thoughts in the comment section.