Overview¶
In this project, I analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure.
One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.
The dataset for this project can be found on the UCI Machine Learning Repository.
For the purposes of this project, the features 'Channel'
and 'Region'
will be excluded in the analysis — with focus instead on the six product categories recorded for customers.
This project was submitted as part of the requisites required to obtain Machine Learning Engineer Nanodegree from Udacity.
Loading Datasets¶
We run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. The size of the dataset is reported.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
print "Dataset could not be loaded. Is the dataset missing?"
Data Exploration¶
In this section, I explore the data through visualizations and code to understand how each feature is related to the others. We observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset.
We run the code block below to observe a statistical description of the dataset. We note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'.
We consider what each category represents in terms of products that can be purchased.
# Display a description of the dataset
display(data.describe())
Selecting Samples¶
To get a better understanding of the customers and how their data will transform through the analysis, we select a few sample data points and explore them in more detail.
In the code block below, we add three indices to the indices
list which will represent the customers to track.
We try different sets of samples until obtaining customers that vary significantly from one another.
# TODO: Select three indices to sample from the dataset
indices = [309, 216, 22]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
Assessment 1¶
Let us consider the total purchase cost of each product category and the statistical description of the dataset above the chose customers sample.
We want to answer the following question:
- What kind of establishment (customer) could each of the three samples we've chosen represent?
Examples of establishments include places like markets, cafes, delis, wholesale retailers, among many others.
We use the mean spending values for reference to compare the samples. The mean values are as follows:
- Fresh: 12000.2977
- Milk: 5796.2
- Grocery: 7951.27
- Frozen: 3071.9
- Detergents_paper: 2881.4
- Delicatessen: 1524.8
With this information, we want to know the following:
How do the samples compare?
Does this help in driving our insights into what kind of establishments they might be?
#Generates a bar plot per sample,
#indicating the ratio (Annual Spending per Product) / (Mean Spending per Product) = ASP/MSP.
#Helps identifying spending behaviors per product.
#J.E.Rolon
stats = data.describe()
ratios = []
for index in range(len(samples)):
rlist = []
for feat in cols:
rvalue = float(samples[feat][index])/float(stats[feat]['mean'])
rlist.append(rvalue)
ratios.append(rlist)
groups = cols
n_groups = len(groups)
ind = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8
plt.figure(1, figsize=(16, 6))
nrws = 1
ncol = 3
for m in range(len(ratios)):
plt.subplot(nrws, ncol, m+1)
plt.bar(ind, ratios[m], bar_width, alpha=opacity, color='b', label=None)
plt.xlabel('Product Categories')
plt.ylabel('Annual Spending Ratios (Threshold = 1.0)')
plt.title('Sample {}'.format(m))
plt.xticks(ind, groups, rotation='vertical')
plt.legend(frameon=False, loc='upper right', fontsize='small')
plt.tight_layout()
plt.show()
Analysis:
To aid in the answer of the above questions I generated a bar plot per sample, indicating the ratio (Annual Spending per Product) / (Mean Spending per Product) = ASP/MSP.
Bar heights > 1.0 indicate values above average. Bar heights < 1.0 indicate values below average.
Customer 0 (Cafe):
The first bar plot above shows that this customer is spending above average mostly on "Milk" products (~3.56 MSP), followed by "Detergents_Paper" (~2.37 MSP) and "Grocery" (~1.7 MSP). In my opinion, this customer can represent a Cafe.
Customer 1 (Mini Market): The middle bar plot above shows that this customer is spending mostly above average on "Grocery" (4.58 MSP), "Detergents_Paper" (~4.61 MSP) and "Milk" (~2.86 MSP). In my opinion, this customer would represent a Mini Market.
Customer 2 (Deli Restaurant): The third bar plot on the rigth shows that this customer is spending mostly above average on "Frozen" products (~ 3.06 MSP), "Fresh" products (~ 2.60 MSP) and "Delicatessen" (~2.84 MSP). Given the above, this customer is likely a Deli Restaurant.
Feature Relevance¶
One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products?
We can make the above determination by training a supervised regression learning algorithm on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.
In the code block below, we implement the following:
- Assigning
new_data
a copy of the data by removing a selected feature using theDataFrame.drop
function. - Using
sklearn.cross_validation.train_test_split
to split the dataset into training and testing sets. - Using the removed feature as the target label. Setting a
test_size
of0.25
and set arandom_state
. Importing a decision tree regressor, setting a
random_state
, and fitting the learner to the training data.We report the prediction score of the testing set using the regressor's
score
function.
from sklearn.cross_validation import train_test_split
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
frozen_data = data['Frozen']
features = data.drop('Frozen', axis = 1)
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(features, frozen_data, test_size = 0.25, random_state = 0)
# TODO: Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor(random_state=0)
learner = reg.fit(X_train, y_train)
y_predict = learner.predict(X_test)
# TODO: Report the score of the prediction using the testing set
from sklearn.metrics import r2_score
score = r2_score(y_test, y_predict)
print "R2 Prediction Score: {}".format(score)
#Attempts to predict each feature
#using a Decision Tree Regressor
#J.E. Rolon
cols = list(data.columns)
for feat in cols:
feat_data = data[feat]
features = data.drop(feat, axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features,
feat_data, test_size = 0.25, random_state = 0)
reg = DecisionTreeRegressor(random_state=0)
learner = reg.fit(X_train, y_train)
y_predict = learner.predict(X_test)
score = r2_score(y_test, y_predict)
print "R2 prediction score for feature {} : {}".format(feat, score)
Assessment 2¶
- Select a feature for prediction
- Report the prediction score
- Assess whether the selected feature is necessary for identifying customers' spending habits
Analysis:
We use the coefficient of determination, R^2
, which is scored between 0 and 1, with 1 being a perfect fit. A negative R^2
implies the model fails to fit the data. If we get a low score for a particular feature, that lends us to beleive that that feature point is hard to predict using the other features, thereby making it an important feature to consider when considering relevance.
Analysis: Let us attempt predicting the "Frozen" feature.
The prediction R^2 score was: 0.2538. This feature is important to consider as it is hard to predict using the remaining features.
We also predict other features by following the same procedure (see results of code above). In particular, we can see that "Detergents_Paper" and "Grocery" had moderately good R2 scores, which means that these features exhibited some linear dependence on other features.
Visualizing Customer Spending Distributions¶
To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data.
If the feature we attempt to predict is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others.
Conversely, if we believe the selected feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.
We run the code block below to produce a scatter matrix.
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
Assessment 3¶
We use the scatter matrix as a reference to discuss the distribution of the dataset, specifically to talk about the normality, outliers, large number of data points near 0 among others.
We want to answer the following questions:
Are there any pairs of spending features which exhibit some degree of correlation?
Does this confirm or deny our suspicions about the relevance of the feature we attempted to predict?
How is the data for those features distributed?
Analysis:
We test the data for normality, i.e. whether is normally distributed. We want to know where do most of the data points lie.
We can use corr() to get the feature correlations and then visualize them using a heatmap(the data that would be fed into the heatmap would be the correlation values, for eg: data.corr()
) to gain further insight.
Testing for normality for each of the feature distributions: Visual inspection indicates that all feature distributions as given are positively skewed and do not exhibit normality.
We can use SciPy's Normaltest (scipy.stats.normaltest) to test whether each feature sample distribution differs from a normal distribution.
With this function we test the null hypothesis that a sample comes from a normal distribution. We accept this hypothesis if the p-value >= 0.05. We reject this hypothesis if the p-value <= 0.05. In the latter case, we conclude that the distribution is not normal with 95% confidence. See results below:
#Ans to Assessment 3
#This code cell implements normality
#tests for each feature distribution
#J.E. Rolon
from scipy.stats import normaltest
cols = list(data.columns)
for feat in cols:
statistic, p = normaltest(data[feat].values)
if p >= 0.05:
print "The feature {} likely has a normal distribution".format(feat)
else:
print "The feature {} likely doesn't have a normal distribution".format(feat)
#Ans to Question 3
#This code cell generates a heat map
#of the correlation matrix between features
#J.E. Rolon
import matplotlib.pyplot as plt
import seaborn as sns
cols = list(data.columns)
corr_matrix = np.corrcoef(data[cols].values.T)
plt.figure(1, figsize=(10, 9))
sns.set(font_scale=1.5)
heat_map = sns.heatmap(corr_matrix, cbar=True, annot=True, square=True, fmt='.2f',
annot_kws = {'size': 15}, yticklabels=cols, xticklabels=cols)
plt.xticks(rotation='vertical')
plt.yticks(rotation='horizontal')
plt.tight_layout()
plt.show()
Testing for correlations among features:
A heat map representation of the feature-feature correlation matrix is shown above. Its matrix elements are the Pearson's product-moment correlation coefficients, r, which measure the linear dependence between pairs of features.
Two features have a perfect positive correlation if r = 1, no correlation if r = 0 , and a perfect negative correlation if r=1, respectively.
According to this map, we can attempt to rank features according to their inter-correlations as follows:
Tier 1: "Grocery" and "Detergents_Paper" are highly correlated ... r=0.92 "Grocery" and "Milk" are highly correlated ............... r=0.73 "Milk" and "Detergents_Paper" are moderately correlated .. r=0.66
Tier 2: "Milk" and "Delicatessen" are weakly-to-moderately correlated .... r=0.41 "Delicatessen" and "Frozen" are weakly-to-moderately correlated .. r=0.39 "Frozen" and "Fresh" are weakly-to-moderately correlated ......... r=0.35
The remaining feature pairings are either weakly correlated or nearly uncorrelated.
Regarding our previous expectations about feature relevance, we can see that "Grocery" and "Detergents_Paper" are amongst the most hihgly correlated features. This confirms our suspicions of their linear dependence on other features according to R2 scores discussed in question 2. Observe that "Frozen" and "Milk" were weakly correlated to other features, which indeed confirms our previous suspicions that these are important features to consider.