Enhancing Productivity with AI: Step by Step guide to Make ChatGPT APIs Analyse your Excel file

Santhosh Kumar Setty
6 min readJun 15, 2023

--

Enhancin

Introduction

In an era where data is abundant, harnessing its power can fuel smarter decision-making and productivity enhancement. This is where AI, particularly language models like ChatGPT, comes into play. Imagine being able to ask natural language questions about your data and receiving clear, intelligent responses. No complex SQL queries, no elaborate report generation — just you, your data, and AI.

In this article, I am not focussing on any ready made plugins or using ChatGPT to write Excel formulas. However, I am going to explain how to use Python and Open AI’s APIs directly to upload CSVs and analyse data that require extensive research.

Why Use ChatGPT to Analyse Excel or CSV Files?

Simple tasks such as calculating column averages or looking up values don’t necessitate the aid of ChatGPT. AI should be employed only when the manual effort required is substantial, and time is of the essence.The goal is to make data accessibility and understanding easier. Instead of manually combing through rows and columns or relying on hard-coded formulas, you can utilize AI to process your data intelligently. ChatGPT can provide insights, identify trends, and even generate reports. The marriage of AI and data analysis can take your productivity to the next level. To demonstrate this, I will analyze a publicly available Netflix dataset to assess the quality of its film content, showcasing the potential of combining AI and data analysis for insightful research.

How To Use ChatGPT to Analyse Your CSV Files?

Step 1: Prepare Your Data

Start by organizing your CSV file. Ensure the data is clean, without any incomplete or inaccurate entries. Give meaningful names to your columns, which will act as data labels for your AI assistant. In this example, I am using publicly available Netflix dataset from Kaggle. Download the data and save the file in .CSV format. Using this data, I am going to identify the distribution of good content vs mediocre content in Netflix. This dataset has some basic information such as Show type, title, director, cast, genre, description etc.

Step 2: Install Python

Python is favored for data analysis due to its simplicity, extensive data-centric libraries, and seamless integration with AI and machine learning tools. In our case, we mostly use python to connect to OpenAI’s APIs.

Refer to this step by step guide to install python (and access it using Jupyter notebook) and get started.

Step 3: Importing Necessary Libraries

Install the necessary Python libraries. In a new instance of Jupyter notebook, run

import openai
import pandas as pd

Here, two Python libraries are imported.

  • openai is the official OpenAI library that provides the Python interface to the OpenAI API. It allows the program to interact with the GPT-3 model.
  • pandas is a library offering data manipulation and analysis tools. It’s used in this script to handle the CSV data.

Step 2: OpenAI API key Configuration

openai.api_key = 'ENTER YOUR API KEY HERE'

This line sets the API key for OpenAI, which is necessary for making API calls to OpenAI’s servers.

Step 3: Load the .csv file into Python

Use the pandas library to read the .csv file.

file_path = r'C:\XXXX\YYYY\Documents\ZZZZ\netflix.csv'
df = pd.read_csv(file_path)

The script reads a CSV file from a given path using the pd.read_csv function, and stores the result in a pandas DataFrame.

Step 4: Filtering the Data

df_movies = df[df['type'] == 'Movie']

Here, the code filters out movies from the DataFrame by checking where ‘type’ equals ‘Movie’.

Step 5: Initialize Lists for Movie Titles and Predicted Categories

movie_titles = [
predicted_categories = []]

Two lists are initialized here. One is for storing movie titles and the other for storing the categories predicted by the OpenAI API.

Step 6: Batching

batch_size = 500
num_batches = len(df_movies) // batch_size + 1

The code uses batching to process the data and make predictions. This allows the script to handle larger datasets without exceeding the OpenAI API’s rate limits or output length limits. Also calling OpenAI APIs multiple times can be quite expensive.

Step 7: Batch Processing and Predicting Categories

The script enters a loop to process the data in batches. For each batch, it gets the titles of the movies, sets up a prompt for the GPT-3 model based on those titles, uses the OpenAI API to generate predicted categories, and extracts those categories from the API response. Feel free to use other models as in the API reference documentation.

7.1 Setting up Start and End Indices

start_idx = batch_num * batch_size
end_idx = (batch_num + 1) * batch_size

Here, the script sets up the start and end indices for each batch of data. This will determine which movies are included in each batch.

7.2 Getting Batch Titles

batch_titles = df_movies.iloc[start_idx:end_idx]['title'].tolist()
movie_titles.extend(batch_titles)

In this step, it selects the titles of the movies in the current batch, converts them to a list and then extends the ‘movie_titles’ list with these batch titles.

7.3 Setting up the Prompt

prompt = "\n".join([f"Classify the movie '{title}' into the categories: Very Good, Good, Average, Bad, or Very Bad based on its rating.
for title in batch_titles])"

This step sets up the prompt for the GPT-3 model. For each movie title in the batch, it generates a statement asking the model to classify the movie into one of the five categories.

7.4 Using OpenAI API to Generate Predicted Categories

response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=len(batch_titles) * 5,
temperature=0.3,
n=len(batch_titles),
stop=None,
)

This part is where the OpenAI GPT-3 model is actually called.

  • engine specifies the GPT-3 model variant to use. In this case, it’s “text-davinci-003”.
  • prompt passes the generated prompt to the model.
  • max_tokens is the maximum length of the output, set to five times the number of titles in the batch.
  • temperature controls the randomness of the model’s output. Lower values (like 0.3 here) make the output more deterministic, while higher values make it more random.
  • n asks the model to generate a separate completion for each title in the batch.
  • stop can be used to set stopping conditions for the model’s text generation, but here it’s not used.

7.5 Extracting Predicted Categories from the API Response

predicted_batch_categories = [choice.text.strip() for choice in response.choices]
predicted_categories.extend(predicted_batch_categories)

This step extracts the predicted categories from the API response and then extends the ‘predicted_categories’ list with these batch categories.

So, that completes the step-by-step breakdown of this Python script using OpenAI GPT-3 for movie categorization. By applying machine learning to movie categorization, this script offers a novel approach to organizing and understanding movie data.

Step 8: Adding Predicted Categories to DataFrame

df_movies['predicted_category'] = predicted_categories[:len(df_movies)]

After the loop finishes processing all batches, it adds the predicted categories to the original DataFrame.

Step 9: Printing the Result

print(df_movies[['title', 'predicted_category']])

Finally, the script prints the DataFrame with movie titles and their predicted categories.

The resulting dataset would have the list of movie titles and their rating category based on their rating score, which is the result we want. The result looks as follows.

When I exported the resulting dataset to do a simple analysis in excel it gave me the results as follows

The results indicate that a significant number of Netflix’s movie offerings tend to exhibit a more moderate quality in terms of content.

This analysis sheds light on the immense power of OpenAI’s models in driving data-driven research. Imagine the laborious task of manually obtaining rating data or making API calls to platforms like IMDb for every single movie. However, by leveraging OpenAI’s cutting-edge language models, researchers can unlock a world of possibilities. With a well-defined prompt, a wealth of information can be effortlessly extracted, empowering researchers to derive insightful conclusions and explore endless avenues of combining AI with data analysis.

The fusion of AI and data analysis holds tremendous potential, revolutionizing the way we approach research and opening up new horizons for discovery.

Disclosure: The content in this article was aided by artificial intelligence tools. The image within this piece was rendered by Midjourney, a platform leveraging AI for image generation. Segments of the shared code were devised using ChatGPT, an advanced AI language model from OpenAI. For ensuring grammatical accuracy, I utilized the AI-driven tool Grammarly.

--

--

Santhosh Kumar Setty

Product Led Growth Expert | AI Enthusiast | ex-Alibaba, Wayfair , Delivery Hero & Rocket Internet. https://www.linkedin.com/in/santhoshsetty/