JUAN ROLON
  • Home
  • Machine Learning
  • Physics
  • About
customer_segments

Customer Segmentation Analysis via Unsupervised Learning¶

Juan E. Rolon, 2017.¶

Overview¶

In this project, I analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure.

One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository.

For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

This project was submitted as part of the requisites required to obtain Machine Learning Engineer Nanodegree from Udacity.

Loading Datasets¶

We run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. The size of the dataset is reported.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
    print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration¶

In this section, I explore the data through visualizations and code to understand how each feature is related to the others. We observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset.

We run the code block below to observe a statistical description of the dataset. We note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'.

We consider what each category represents in terms of products that can be purchased.

In [2]:
# Display a description of the dataset
display(data.describe())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Selecting Samples¶

To get a better understanding of the customers and how their data will transform through the analysis, we select a few sample data points and explore them in more detail.

In the code block below, we add three indices to the indices list which will represent the customers to track.

We try different sets of samples until obtaining customers that vary significantly from one another.

In [71]:
# TODO: Select three indices to sample from the dataset
indices = [309, 216, 22]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 918 20655 13567 1465 6846 806
1 2532 16599 36486 179 13308 674
2 31276 1917 4469 9408 2381 4334

Assessment 1¶

Let us consider the total purchase cost of each product category and the statistical description of the dataset above the chose customers sample.

We want to answer the following question:

  • What kind of establishment (customer) could each of the three samples we've chosen represent?

Examples of establishments include places like markets, cafes, delis, wholesale retailers, among many others.

We use the mean spending values for reference to compare the samples. The mean values are as follows:

  • Fresh: 12000.2977
  • Milk: 5796.2
  • Grocery: 7951.27
  • Frozen: 3071.9
  • Detergents_paper: 2881.4
  • Delicatessen: 1524.8

With this information, we want to know the following:

  • How do the samples compare?

  • Does this help in driving our insights into what kind of establishments they might be?

In [95]:
#Generates a bar plot per sample, 
#indicating the ratio (Annual Spending per Product) / (Mean Spending per Product) = ASP/MSP.
#Helps identifying spending behaviors per product.
#J.E.Rolon

stats = data.describe()
ratios = []
for index in range(len(samples)):
    rlist = []
    for feat in cols:
        rvalue = float(samples[feat][index])/float(stats[feat]['mean'])
        rlist.append(rvalue)
    ratios.append(rlist)
groups = cols
n_groups = len(groups)
ind = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8

plt.figure(1, figsize=(16, 6))
nrws = 1
ncol = 3
for m in range(len(ratios)):
    plt.subplot(nrws, ncol, m+1)
    plt.bar(ind, ratios[m], bar_width, alpha=opacity, color='b', label=None)
    plt.xlabel('Product Categories')
    plt.ylabel('Annual Spending Ratios (Threshold = 1.0)')
    plt.title('Sample {}'.format(m))
    plt.xticks(ind, groups, rotation='vertical')
    plt.legend(frameon=False, loc='upper right', fontsize='small')
plt.tight_layout()
plt.show()

Analysis:

To aid in the answer of the above questions I generated a bar plot per sample, indicating the ratio (Annual Spending per Product) / (Mean Spending per Product) = ASP/MSP.

Bar heights > 1.0 indicate values above average. Bar heights < 1.0 indicate values below average.

Customer 0 (Cafe):
The first bar plot above shows that this customer is spending above average mostly on "Milk" products (~3.56 MSP), followed by "Detergents_Paper" (~2.37 MSP) and "Grocery" (~1.7 MSP). In my opinion, this customer can represent a Cafe.

Customer 1 (Mini Market): The middle bar plot above shows that this customer is spending mostly above average on "Grocery" (4.58 MSP), "Detergents_Paper" (~4.61 MSP) and "Milk" (~2.86 MSP). In my opinion, this customer would represent a Mini Market.

Customer 2 (Deli Restaurant): The third bar plot on the rigth shows that this customer is spending mostly above average on "Frozen" products (~ 3.06 MSP), "Fresh" products (~ 2.60 MSP) and "Delicatessen" (~2.84 MSP). Given the above, this customer is likely a Deli Restaurant.

Feature Relevance¶

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products?

We can make the above determination by training a supervised regression learning algorithm on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In the code block below, we implement the following:

  • Assigning new_data a copy of the data by removing a selected feature using the DataFrame.drop function.
  • Using sklearn.cross_validation.train_test_split to split the dataset into training and testing sets.
  • Using the removed feature as the target label. Setting a test_size of 0.25 and set a random_state.
  • Importing a decision tree regressor, setting a random_state, and fitting the learner to the training data.

  • We report the prediction score of the testing set using the regressor's score function.

In [5]:
from sklearn.cross_validation import train_test_split

# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
frozen_data = data['Frozen']
features = data.drop('Frozen', axis = 1)

# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.
X_train, X_test, y_train, y_test = train_test_split(features, frozen_data, test_size = 0.25, random_state = 0)

# TODO: Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor(random_state=0)
learner = reg.fit(X_train, y_train) 
y_predict = learner.predict(X_test)


# TODO: Report the score of the prediction using the testing set
from sklearn.metrics import r2_score
score = r2_score(y_test, y_predict)
print "R2 Prediction Score: {}".format(score)
R2 Prediction Score: 0.253973446697
/Users/juanerolon/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [8]:
#Attempts to predict each feature
#using a Decision Tree Regressor
#J.E. Rolon

cols = list(data.columns)
for feat in cols:
    feat_data = data[feat]
    features = data.drop(feat, axis = 1)
    X_train, X_test, y_train, y_test = train_test_split(features, 
                   feat_data, test_size = 0.25, random_state = 0)

    reg = DecisionTreeRegressor(random_state=0)
    learner = reg.fit(X_train, y_train) 
    y_predict = learner.predict(X_test)
    score = r2_score(y_test, y_predict)
    print "R2 prediction score for feature {} : {}".format(feat, score)
R2 prediction score for feature Fresh : -0.252469807688
R2 prediction score for feature Milk : 0.365725292736
R2 prediction score for feature Grocery : 0.602801978878
R2 prediction score for feature Frozen : 0.253973446697
R2 prediction score for feature Detergents_Paper : 0.728655181254
R2 prediction score for feature Delicatessen : -11.6636871594

Assessment 2¶

  • Select a feature for prediction
  • Report the prediction score
  • Assess whether the selected feature is necessary for identifying customers' spending habits

Analysis: We use the coefficient of determination, R^2, which is scored between 0 and 1, with 1 being a perfect fit. A negative R^2 implies the model fails to fit the data. If we get a low score for a particular feature, that lends us to beleive that that feature point is hard to predict using the other features, thereby making it an important feature to consider when considering relevance.

Analysis: Let us attempt predicting the "Frozen" feature.

The prediction R^2 score was: 0.2538. This feature is important to consider as it is hard to predict using the remaining features.

We also predict other features by following the same procedure (see results of code above). In particular, we can see that "Detergents_Paper" and "Grocery" had moderately good R2 scores, which means that these features exhibited some linear dependence on other features.

Visualizing Customer Spending Distributions¶

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data.

If the feature we attempt to predict is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others.

Conversely, if we believe the selected feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.

We run the code block below to produce a scatter matrix.

In [3]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
/Users/juanerolon/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:2: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  

Assessment 3¶

We use the scatter matrix as a reference to discuss the distribution of the dataset, specifically to talk about the normality, outliers, large number of data points near 0 among others.

We want to answer the following questions:

  • Are there any pairs of spending features which exhibit some degree of correlation?

  • Does this confirm or deny our suspicions about the relevance of the feature we attempted to predict?

  • How is the data for those features distributed?

Analysis:

We test the data for normality, i.e. whether is normally distributed. We want to know where do most of the data points lie.

We can use corr() to get the feature correlations and then visualize them using a heatmap(the data that would be fed into the heatmap would be the correlation values, for eg: data.corr()) to gain further insight.

Testing for normality for each of the feature distributions: Visual inspection indicates that all feature distributions as given are positively skewed and do not exhibit normality.

We can use SciPy's Normaltest (scipy.stats.normaltest) to test whether each feature sample distribution differs from a normal distribution.

With this function we test the null hypothesis that a sample comes from a normal distribution. We accept this hypothesis if the p-value >= 0.05. We reject this hypothesis if the p-value <= 0.05. In the latter case, we conclude that the distribution is not normal with 95% confidence. See results below:

In [6]:
#Ans to Assessment 3
#This code cell implements normality
#tests for each feature distribution
#J.E. Rolon

from scipy.stats import normaltest
cols = list(data.columns)
for feat in cols:
    statistic, p = normaltest(data[feat].values)
    if p >= 0.05:
        print "The feature {} likely has a normal distribution".format(feat)
    else:
        print "The feature {} likely doesn't have a normal distribution".format(feat)
The feature Fresh likely doesn't have a normal distribution
The feature Milk likely doesn't have a normal distribution
The feature Grocery likely doesn't have a normal distribution
The feature Frozen likely doesn't have a normal distribution
The feature Detergents_Paper likely doesn't have a normal distribution
The feature Delicatessen likely doesn't have a normal distribution
In [15]:
#Ans to Question 3
#This code cell generates a heat map
#of the correlation matrix between features
#J.E. Rolon

import matplotlib.pyplot as plt
import seaborn as sns

cols = list(data.columns)
corr_matrix = np.corrcoef(data[cols].values.T)

plt.figure(1, figsize=(10, 9))
sns.set(font_scale=1.5)
heat_map = sns.heatmap(corr_matrix, cbar=True, annot=True, square=True, fmt='.2f',
               annot_kws = {'size': 15}, yticklabels=cols, xticklabels=cols)
plt.xticks(rotation='vertical')
plt.yticks(rotation='horizontal')
plt.tight_layout()
plt.show()

Testing for correlations among features:

A heat map representation of the feature-feature correlation matrix is shown above. Its matrix elements are the Pearson's product-moment correlation coefficients, r, which measure the linear dependence between pairs of features.

Two features have a perfect positive correlation if r = 1, no correlation if r = 0 , and a perfect negative correlation if r=1, respectively.

According to this map, we can attempt to rank features according to their inter-correlations as follows:

Tier 1: "Grocery" and "Detergents_Paper" are highly correlated ... r=0.92 "Grocery" and "Milk" are highly correlated ............... r=0.73 "Milk" and "Detergents_Paper" are moderately correlated .. r=0.66

Tier 2: "Milk" and "Delicatessen" are weakly-to-moderately correlated .... r=0.41 "Delicatessen" and "Frozen" are weakly-to-moderately correlated .. r=0.39 "Frozen" and "Fresh" are weakly-to-moderately correlated ......... r=0.35

The remaining feature pairings are either weakly correlated or nearly uncorrelated.

Regarding our previous expectations about feature relevance, we can see that "Grocery" and "Detergents_Paper" are amongst the most hihgly correlated features. This confirms our suspicions of their linear dependence on other features according to R2 scores discussed in question 2. Observe that "Frozen" and "Milk" were weakly correlated to other features, which indeed confirms our previous suspicions that these are important features to consider.

CONTINUE TO PART 2

Powered by Create your own unique website with customizable templates.
  • Home
  • Machine Learning
  • Physics
  • About