Pandas is an open-source library for python. I am going to use this library to read a large file with pandas library. The file is around 7 GB in size and i need to extract and filter the data from the file and save it to the MySQL database.
Of cause only pandas library is not enough, we need to establish a MySQL connection through python and then have to insert the filtered data into the database.
Required packages
Let’s start
First of all, we need to import the required libraries like pandas,
import pandas as pd
import mysql.connector
from sqlalchemy import types, create_engine
I am not going in details for create_engine or types. But if you want to know in details then please following the given links which
Setup MySQL connection with Python
Now let’s establish a database connection so that we direct insert the filtered data into the database.
# MySQL Connection
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_HOST_IP = '127.0.0.1'
MYSQL_PORT = 3306
MYSQL_DATABASE = 'database_name'
engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+':'+MYSQL_PASSWORD+'@'+MYSQL_HOST_IP+':'+MYSQL_PORT+'/'+MYSQL_DATABASE, echo=False)
Read large file with Pandas
A bit challanging part is to read the large file and the inserting data into the database. We can’t do it at once because it takes a lot of time and we can’t insert billions of records at once.
To handle this kind of situation we have read the file in chunks and use that chunk data to insert.
pd.read_csv("large_file.txt", usecols=['column_one', 'column_two'], sep=" ")
Let’s understand the above code first. I have used the read_csv() method to read the TXT file. Let’s apply chunk on the read_csv() method and insert data with to_sql() method.
chunksize = 500
idx = 1
for df in pd.read_csv("large_file.txt", chunksize=chunksize, usecols=['column_one', 'column_two'], sep=" "):
if idx == 1:
exists = 'replace'
else:
exists = 'append'
df.to_sql(name='database_table_name', con=engine, if_exists=exists, index=False, chunksize=chunksize)
print(str(chunksize * idx)+" Processed");
idx = idx+1
Complete Codes
Please find the complete set of codes.
import pandas as pd
import mysql.connector
from sqlalchemy import types, create_engine
# MySQL Connection
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_HOST_IP = '127.0.0.1'
MYSQL_PORT = 3306
MYSQL_DATABASE = 'geolocation'
engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+':'+MYSQL_PASSWORD+'@'+MYSQL_HOST_IP+':'+MYSQL_PORT+'/'+MYSQL_DATABASE, echo=False)
chunksize = 500
idx = 1
for df in pd.read_csv("large_file.txt", chunksize=chunksize, usecols=['column_one', 'column_two'], sep=" "):
if idx == 1:
exists = 'replace'
else:
exists = 'append'
df.to_sql(name='database_table_name', con=engine, if_exists=exists, index=False, chunksize=chunksize)
print(str(chunksize * idx)+" Processed");
idx = idx+1
This is a basic concept to handling large file with Pandas library and insert the data directly to MySQL database. Of cause the code could improve more. Please feel free to add your thoughts in the comment section.