Introduction
In the world of data analysis, extracting useful information from tabular data can be a difficult task. Conventional approaches typically require manual exploration and analysis of data, which can be requires a significant amount of effort, time, or workforce to complete.
The emergence of advanced language models such as ChatGPT has introduced a promising and innovative approach to extracting useful information from tabular data. OpenAI announced an official API for ChatGPT which is powered by gpt-3.5-turbo, OpenAI’s most advanced language model.
This blog describes you the process of extracting useful information from tabular data using ChatGPT API.
Dataset
For all illustrations in this post, We will be utilizing the following data. These data are taken from the following: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
Note that we have only taken into account 30 records from the dataset.
Please see the data provided below, which will be used for the purpose of this blog.
gender | race | parentallevelofeducation | lunch | testpreparationcourse | mathscore | readingscore | writingscore |
female | group B | bachelor | standard | none | 72 | 72 | 74 |
female | group C | some college | standard | completed | 69 | 90 | 88 |
female | group B | master | standard | none | 92 | 95 | 93 |
male | group A | associate | free/reduced | none | 47 | 57 | 44 |
male | group C | some college | standard | none | 76 | 78 | 75 |
female | group B | associate | standard | none | 71 | 83 | 78 |
female | group B | some college | standard | completed | 88 | 95 | 92 |
male | group B | some college | free/reduced | none | 40 | 43 | 39 |
male | group D | high school | free/reduced | completed | 64 | 64 | 67 |
female | group B | high school | free/reduced | none | 38 | 60 | 50 |
male | group C | associate | standard | none | 58 | 54 | 52 |
male | group D | associate | standard | none | 40 | 52 | 43 |
female | group B | high school | standard | none | 65 | 81 | 73 |
male | group A | some college | standard | completed | 78 | 72 | 70 |
female | group A | master | standard | none | 50 | 53 | 58 |
female | group C | some high school | standard | none | 69 | 75 | 78 |
male | group C | high school | standard | none | 88 | 89 | 86 |
female | group B | some high school | free/reduced | none | 18 | 32 | 28 |
male | group C | master | free/reduced | completed | 46 | 42 | 46 |
female | group C | associate | free/reduced | none | 54 | 58 | 61 |
male | group D | high school | standard | none | 66 | 69 | 63 |
female | group B | some college | free/reduced | completed | 65 | 75 | 70 |
male | group D | some college | standard | none | 44 | 54 | 53 |
female | group C | some high school | standard | none | 69 | 73 | 73 |
male | group D | bachelor | free/reduced | completed | 74 | 71 | 80 |
male | group A | master | free/reduced | none | 73 | 74 | 72 |
male | group B | some college | standard | none | 69 | 54 | 55 |
female | group C | bachelor | standard | none | 67 | 69 | 75 |
male | group C | high school | standard | none | 70 | 70 | 65 |
Data Extraction using ChatGPT API
ChatGPT relies solely on natural language processing (NLP) techniques to understand and extract information from tabular data. It can analyze the text-based input provided by the user, interpret the query, and generate a response based on the content of the tabular data.
Here’s an example of how you can utilize the ChatGPT API to extract information from tabular data:
Step 1: Prepare Input
We have stored our tabular data in a CSV file, you can read the CSV file using “Pandas” Python library and pass the data to the ChatGPT API for information extraction.
import pandas as pd
read_csv=pd.read_csv("Student.csv")
Step 2: Use the ChatGPT API
Before we begin utilizing the ChatGPT API, please make sure that you have installed OpenAI Python library in your system.
pip install openai
You can extract information by providing the tabular data and input text to the ChatGPT API. This can be done by reading the tabular data from a CSV file, preparing the input for the API, and passing it along with the input text. The API will then extract the relevant information from the data and provide it in the response.
import openai
import pandas as pd
openai.api_key = ''
read_csv = pd.read_csv("Student.csv")
input_text='''What is the average math score for male students?'''
prompt = """Please regard the following data:\n {}. Answer the following question and please return only value: {}""".format(read_csv, input_text)
request = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[
{"role": "user", "content": prompt},
]
)
result = request['choices'][0]['message']['content']
print("ChatGPT Response=>",result)
The following is the response that was received from the ChatGPT API:
Now let’s review the responses to a few questions based on the extracted information.
Question | ChatGPT Response | Analysis |
What is the gender of the student who scored 72 in math? | female | Correct |
Which race/ethnicity group does the student belong to who scored the highest in reading? | group C | Incorrect It should be “group B” |
Did the student who scored 47 in math complete the test preparation course? | No | Correct |
What is the average math score for male students? | 65.4 | Incorrect It should be “62.2” |
What is the writing score for the student who belongs to group C race/ethnicity and has a master’s degree? | The writing score for the specified student is not given in the data. | Incorrect It should be “46” |
We observed that ChatGPT is incapable of performing aggregations, such as summing or averaging the table entries, and occasionally struggles to respond to basic questions.
You can add your table and test it right away using ChatGPT playground without using the API. Please visit the below link
https://chat.openai.com/
SQL-based Data Extraction from Database using ChatGPT API
We can now utilize ChatGPT again, but this time not for direct table analysis. Instead, we can use ChatGPT to generate SQL statements for a database that contains the data. SQL is capable of handling filtering, sorting, aggregation, and summation logic, which can help in performing the required calculations accurately.
We utilized SQLite as the database engine and employed the sqlite3 Python library to interact with it.
Step 1: Create SQLite database and table
Here’s a Python code that creates a database and a table in SQLite:
import sqlite3
# Connect to SQLite database (this will create a new database file if it doesn't exist)
conn = sqlite3.connect("chatgpt.db")
cursor = conn.cursor()
# Create a table
cursor.execute("""
CREATE TABLE IF NOT EXISTS student (
gender TEXT,
race TEXT,
parentallevelofeducation TEXT,
lunch TEXT,
testpreparationcourse TEXT,
mathscore INTEGER,
readingscore INTEGER,
writingscore INTEGER
)
""")
# Commit the transaction and close the connection
conn.commit()
conn.close()
Step 2: Adding Data to a Database
Here’s a Python code that read data from CSV file using pandas and insert it into a SQLite database:
import sqlite3
import pandas as pd
df=pd.read_csv("Student.csv")
# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')
# Insert DataFrame into SQLite database
df.to_sql('student', conn, if_exists='replace', index=False)
# Close database connection
conn.close()
Step 3: Use ChatGPT API
You can extract information by providing the database table name, its corresponding columns, and input text to the ChatGPT API. ChatGPT API will generate an SQL query from the given input text and then use that query to retrieve data from the database.
Here’s an example of how you can extract information using the ChatGPT API by providing the database table name, its corresponding columns, and input text:
import sqlite3
import openai
# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')
cursor = conn.cursor()
openai.api_key = ''
# Function to get table columns from SQLite database
def get_table_columns(table_name):
cursor.execute("PRAGMA table_info({})".format(table_name))
columns = cursor.fetchall()
print(columns)
return [column[1] for column in columns]
# Function to generate SQL query from input text using ChatGPT
def generate_sql_query(table_name,text,columns):
prompt = """You are a ChatGPT language model that can generate SQL queries. Please provide a natural language input text, and I will generate the corresponding SQL query for you.The table name is {} and corresponding columns are {}.\nInput: {}\nSQL Query:""".format(table_name,columns,text)
print(prompt)
request = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[
{"role": "user", "content": prompt},
]
)
sql_query = request['choices'][0]['message']['content']
return sql_query
# Function to execute SQL query on SQLite database
def execute_sql_query(query):
cursor.execute(query)
result = cursor.fetchall()
return result
text="What is the average math score for male students?"
table_name = 'student'
columns = get_table_columns(table_name)
sql_query=generate_sql_query(table_name,text,columns)
print("Generated SQL query: ",sql_query)
if sql_query:
result=execute_sql_query(sql_query)
print("ChatGPT Response=>",result)
# Close database connection
cursor.close()
conn.close()
After sending the input text to the ChatGPT API, you will receive a response containing the generated SQL query. You can then use this query to retrieve the desired data from the database.
The response of this approach looks as below:
Question | SQL query generated by ChatGPT | Answer | Analysis |
What is the gender of the student who scored 72 in math? | SELECT gender FROM student WHERE mathscore=72; | female | Correct |
Which race/ethnicity group does the student belong to who scored the highest in reading? | SELECT race AS “Race/Ethnicity”, MAX(readingscore) AS “Highest Reading Score” FROM student GROUP BY race ORDER BY “Highest Reading Score” DESC LIMIT 1; | group B | Correct |
Did the student who scored 47 in math complete the test preparation course? | SELECT testpreparationcourse FROM student WHERE mathscore = 47 | None | Correct |
What is the average math score for male students? | SELECT AVG(mathscore) FROM student WHERE gender = ‘male’ | 62.2 | Correct |
What is the writing score for the student who belongs to group C race/ethnicity and has a masters degree? | SELECT writingscore FROM student WHERE race = ‘group C’ AND parentallevelofeducation = ‘master’ | 46 | Correct |
ChatGPT without SQL relies solely on NLP techniques to understand and generate responses based on tabular data, which may result in incorrect responses. However, when ChatGPT is combined with SQL capabilities, it can perform more advanced and flexible interactions with tabular data by directly executing SQL queries. This allows for more accurate and precise retrieval of information from databases, enhancing the overall capabilities of ChatGPT in dealing with tabular data.
Ready to harness the power of ChatGPT for your chatbot or NLP application? Partner with our expert developers to bring your dream product to life. Connect with us at letstalk@pragnakalp.com or share your requirements here to get started on your journey to success!