PROBLEM STATEMENT¶
In this notebook, we explore the potential of OpenAI's GPT-4 Large Language Model (LLM) for intent classification in conversational AI.
As businesses increasingly turn to chatbots and virtual assistants, the demand for robust Natural Language Understanding (NLU) systems is more critical than ever.
Traditional NLU systems require extensive resources and large, labeled datasets, presenting significant challenges.
The introduction of LLMs like GPT-4 offers an innovative solution to streamline this process. Despite concerns about generating inaccurate responses, or "hallucinations," focusing LLMs on intent classification with predefined, compliant responses could significantly enhance our approach to customer interactions in the meantime.
Importing Necessary Libraries and Dependencies¶
In [1]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.2f' % x) # To supress numerical display in scientific notations
# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')
Loading the Data¶
In [2]:
from google.colab import drive
drive.mount('/content/drive')
path_to_file = '/content/drive/My Drive/ce/telco/'
Mounted at /content/drive
In [3]:
df = pd.read_excel(path_to_file + 'intent_classification_results.xlsx')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300 entries, 0 to 299 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 assistant 300 non-null object 1 user_response 300 non-null object 2 actual_intent 300 non-null object 3 openai_intent 300 non-null object 4 classification 300 non-null object 5 openai_completion_tokens 300 non-null int64 6 openai_completion_cost 300 non-null float64 7 openai_prompt_tokens 300 non-null int64 8 openai_prompt_cost 300 non-null float64 9 openai_total_tokens 300 non-null int64 10 openai_total_cost 300 non-null float64 dtypes: float64(3), int64(3), object(5) memory usage: 25.9+ KB
Data Overview¶
View the first and last 5 rows of the dataset¶
In [ ]:
pd.set_option('display.float_format', lambda x: '%.5f' % x) # To supress numerical display in scientific notations
df.head()
In [ ]:
df.tail()
View the shape of the dataset¶
In [6]:
df.shape
Out[6]:
(300, 11)
data types of the columns¶
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300 entries, 0 to 299 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 assistant 300 non-null object 1 user_response 300 non-null object 2 actual_intent 300 non-null object 3 openai_intent 300 non-null object 4 classification 300 non-null object 5 openai_completion_tokens 300 non-null int64 6 openai_completion_cost 300 non-null float64 7 openai_prompt_tokens 300 non-null int64 8 openai_prompt_cost 300 non-null float64 9 openai_total_tokens 300 non-null int64 10 openai_total_cost 300 non-null float64 dtypes: float64(3), int64(3), object(5) memory usage: 25.9+ KB
Statistical Analysis¶
In [8]:
df.describe().T
Out[8]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
openai_completion_tokens | 300.00000 | 6.43000 | 0.86463 | 5.00000 | 6.00000 | 6.00000 | 7.00000 | 8.00000 |
openai_completion_cost | 300.00000 | 0.00019 | 0.00003 | 0.00015 | 0.00018 | 0.00018 | 0.00021 | 0.00024 |
openai_prompt_tokens | 300.00000 | 684.86333 | 18.48505 | 644.00000 | 672.00000 | 684.00000 | 696.00000 | 775.00000 |
openai_prompt_cost | 300.00000 | 0.00685 | 0.00018 | 0.00644 | 0.00672 | 0.00684 | 0.00696 | 0.00775 |
openai_total_tokens | 300.00000 | 691.29333 | 18.56167 | 650.00000 | 678.00000 | 690.00000 | 702.25000 | 782.00000 |
openai_total_cost | 300.00000 | 0.00704 | 0.00019 | 0.00662 | 0.00691 | 0.00703 | 0.00717 | 0.00796 |
Unique data¶
In [9]:
df.nunique()
Out[9]:
assistant 200 user_response 297 actual_intent 15 openai_intent 15 classification 2 openai_completion_tokens 4 openai_completion_cost 4 openai_prompt_tokens 78 openai_prompt_cost 78 openai_total_tokens 77 openai_total_cost 114 dtype: int64
Missing Value¶
In [10]:
df.isnull().sum()
Out[10]:
assistant 0 user_response 0 actual_intent 0 openai_intent 0 classification 0 openai_completion_tokens 0 openai_completion_cost 0 openai_prompt_tokens 0 openai_prompt_cost 0 openai_total_tokens 0 openai_total_cost 0 dtype: int64
In [11]:
#finding the number of missing data
df.isna().sum()
Out[11]:
assistant 0 user_response 0 actual_intent 0 openai_intent 0 classification 0 openai_completion_tokens 0 openai_completion_cost 0 openai_prompt_tokens 0 openai_prompt_cost 0 openai_total_tokens 0 openai_total_cost 0 dtype: int64
Exploratory Data Analysis¶
(1) What is the total cost of the intent classification?¶
In [12]:
print('Total cost for',df['openai_total_cost'].count(),'API calls is $',df['openai_total_cost'].sum())
Total cost for 300 API calls is $ 2.11246
(2) Plot Functions¶
In [13]:
def labeled_barplot(data, feature, perc=False, n=None, order=True):
"""
Barplot with labels at the top, with an option to sort by frequency or category
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
order: if True, sort based on frequency (y); if False, sort based on category (x)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
# Determine the order of categories
if order:
order = data[feature].value_counts().index[:n] # Sort by frequency
else:
order = sorted(data[feature].unique())[:n] # Sort by category
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=order,
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
(3) Distribtion of actual intents¶
In [14]:
labeled_barplot(df, 'actual_intent', perc=False)
In [15]:
# Actual Intent Unique Values
df['actual_intent'].value_counts()
Out[15]:
actual_intent confirmation 32 reschedule 28 tech call before arrival 22 wrong person 22 wrong time 21 out of scope 20 call request 19 already rescheduled 19 where is the tech 19 issue fixed 18 issue not fixed 18 cancellation 17 contact details provided 16 stop communications 15 different time slot 14 Name: count, dtype: int64
(4) List of distincty (unique) intents¶
In [16]:
df['actual_intent'].unique()
Out[16]:
array(['call request', 'out of scope', 'wrong time', 'tech call before arrival', 'confirmation', 'contact details provided', 'cancellation', 'issue fixed', 'different time slot', 'reschedule', 'issue not fixed', 'already rescheduled', 'stop communications', 'wrong person', 'where is the tech'], dtype=object)
In [17]:
labels_all = ['call request', 'out of scope', 'wrong time',
'tech call before arrival', 'confirmation',
'contact details provided', 'cancellation', 'issue fixed',
'different time slot', 'reschedule', 'issue not fixed',
'already rescheduled', 'stop communications', 'wrong person',
'where is the tech']
In [18]:
sorted_labels_all = sorted(labels_all)
(5) Diustribution of predicted intents¶
In [19]:
labeled_barplot(df, 'openai_intent', perc=False)
Overall Performance of AI Model¶
In [20]:
#from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
def model_performance_classification_sklearn(pred, target):
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred, average='macro') # to compute Recall
precision = precision_score(target, pred, average='macro') # to compute Precision
f1 = f1_score(target, pred, average='macro') # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
In [21]:
pd.set_option('display.float_format', lambda x: '%.2f' % x) # To supress numerical display in scientific notations
# Calculate Confusion Matrix
conf_mat = confusion_matrix(df['actual_intent'], df['openai_intent'], labels=sorted_labels_all)
# Initialize ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=conf_mat, display_labels=sorted_labels_all)
# Plotting
fig, ax = plt.subplots(figsize=(16, 10)) # Set figure size
disp.plot(cmap='YlGnBu', values_format='d', ax=ax)
# Enhancements
ax.set_title('Confusion Matrix', fontsize=16)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12) # Adjusted font size
plt.xticks(fontsize=8, rotation=90) # Adjusted font size and rotation
plt.yticks(fontsize=8)
plt.show()
In [22]:
pd.set_option('display.float_format', lambda x: '%.2f' % x) # To supress numerical display in scientific notations
model_performance_classification_sklearn(df['openai_intent'],df['actual_intent'])
Out[22]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.94 | 0.93 | 0.95 | 0.94 |
AI Performance per Intent¶
In [23]:
# Function to calculate and print metrics
def calculate_metrics(group):
accuracy = accuracy_score(group['actual_intent'], group['openai_intent'])
recall = recall_score(group['actual_intent'], group['openai_intent'], average='weighted', zero_division=0)
precision = precision_score(group['actual_intent'], group['openai_intent'], average='weighted', zero_division=0)
f1 = f1_score(group['actual_intent'], group['openai_intent'], average='weighted', zero_division=0)
return pd.Series({'accuracy':accuracy,'Recall': recall,'Precision': precision, 'F1 Score': f1})
# Group by date and calculate metrics for each group
metrics = df.groupby('actual_intent').apply(calculate_metrics)
metrics
Out[23]:
accuracy | Recall | Precision | F1 Score | |
---|---|---|---|---|
actual_intent | ||||
already rescheduled | 1.00 | 1.00 | 1.00 | 1.00 |
call request | 0.95 | 0.95 | 1.00 | 0.97 |
cancellation | 0.94 | 0.94 | 1.00 | 0.97 |
confirmation | 1.00 | 1.00 | 1.00 | 1.00 |
contact details provided | 0.75 | 0.75 | 1.00 | 0.86 |
different time slot | 0.93 | 0.93 | 1.00 | 0.96 |
issue fixed | 1.00 | 1.00 | 1.00 | 1.00 |
issue not fixed | 0.89 | 0.89 | 1.00 | 0.94 |
out of scope | 0.90 | 0.90 | 1.00 | 0.95 |
reschedule | 0.96 | 0.96 | 1.00 | 0.98 |
stop communications | 0.93 | 0.93 | 1.00 | 0.97 |
tech call before arrival | 0.91 | 0.91 | 1.00 | 0.95 |
where is the tech | 0.89 | 0.89 | 1.00 | 0.94 |
wrong person | 0.95 | 0.95 | 1.00 | 0.98 |
wrong time | 1.00 | 1.00 | 1.00 | 1.00 |