PCA Biplot¶
A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1
and Dimension 2
).
In addition, the biplot shows the projection of the original features along the components. A biplot aids in the interpretation of the reduced dimensions of the data, and helps discovering relationships between the principal components and original features.
We run the code cell below to generate a biplot of the reduced-dimension data.
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
Observation¶
Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk'
, 'Grocery'
and 'Detergents_Paper'
, but not so much on the other product categories.
From the biplot, we need to determine which of the original features are most strongly correlated with the first component, and separately with the second component.
We need to assess whether these observations agree with the pca_results plot obtained earlier.
Analysis:
The biplot indicates that "Detergents_Paper," "Grocery" and "Milk" are most strongly correlated with first component (PC1), while "Frozen", "Fresh" and "Delicatessen" are most strongly correlated to the second component (PC2). Indeed, these results agree with our PCA analysis discussed in question 5.
Clustering¶
In this step we evaluate a K-Means clustering algorithm and a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data.
With these procedures we recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.
Assessment 6¶
At this point we need to answer the following questions:
What are the advantages to using a K-Means clustering algorithm?
What are the advantages to using a Gaussian Mixture Model clustering algorithm?
Given the observations on the wholesale customer data, which of the two algorithms yields better results?
Analysis:
Advantages of K-Means clustering
- 1.- Given enough time, K-means will always converge (caution: some times to a local minimum).
- 2.- Has a very large scalability respect to the number of samples.
- 3.- It is computationally very efficient compared to other clustering algorithms.
- 4.- It is easy to implement and to understand.
- 5.- It is used across a large range of applications in many different fields (popular).
Advantages of Gaussian Mixture clustering
- 1.- It represents a generalization of K-means.
- 2.- Incorporates information about the covariance structure of the data.
- 3.- It is the fastest algorithm for learning mixture models.
- 4.- It will not bias the means towards zero.
- 5.- It will not bias the cluster sizes to have specific structures.
In my opinion, given the data observations, it will be enough to use K-means for the following reasons:
1.- The variance of the data seems to be controlled by a small number of PCs and latent features. This suggests to use a model of moderate complexity.
2.- The dimensionality of the data seems appropiate for K-means.
3.- We have pre-processed the data by feature scaling and have removed outliers which caused extreme asymetrical deviations from sphericity (we don't expect clusters to be spherical, though).
4.- Although we don't expect perfect symmetrical clusters, we do expect approximately well defined boundaries between clusters as we expect some degree of discrimination explained by different spending behaviors.
Cluster Generation.¶
Depending on the problem, the number of clusters expected to appear in the data may already be known.
When the number of clusters is not known a priori, there is no guarantee that a given number of clusters will yield the best segments of the data, since it is unclear what structure exists in the data — if any.
In any case, we can quantify the "goodness" of a clustering by calculating the silhouette coefficient of each data point.
The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
In the code block below, we implement the following:
- Fitting a clustering algorithm to the
reduced_data
and assign it toclusterer
.
We predict the cluster for each data point in
reduced_data
usingclusterer.predict
and assign them topreds
.We find the cluster centers using the algorithm's respective attribute and assign them to
centers
.We predict the cluster for each sample data point in
pca_samples
and assign themsample_preds
.Importing
sklearn.metrics.silhouette_score
and calculating the silhouette score ofreduced_data
againstpreds
.Assign the silhouette score to
score
and print the result.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
print "K-Means Silhouette Scoring Tests:\n"
for kn in range(2,9):
# TODO: Apply the clustering algorithm to the reduced data
clm = KMeans(n_clusters=kn, random_state=0)
clm.fit(reduced_data)
# TODO: Predict the cluster for each data point
preds = clm.predict(reduced_data)
# TODO: Find the cluster centers
centers = clm.cluster_centers_
# TODO: Predict the cluster for each transformed sample data point
sample_preds = clm.predict(pca_samples)
# TODO: Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds, random_state=10)
print "Number of clusters = {}, Score = {}".format(kn, np.round(score,4))
Assessment 7¶
- When performing K-means, it is important to report the silhouette score for several cluster numbers and to determine for each of these the number of clusters has the best silhouette score.
Analysis: According to the results shown above, 2 clusters yield the best silhouette score.
Cluster Visualization¶
After determining the optimal number of clusters as input to the clustering algorithm, according to the selected scoring metric, we can proceed to visualize the results by executing the code block below.
#Continuation.. Ans to Question 7
#I added this piece of code to get the correct
#update of the K-means processed data; just a
#particular case of the loop implementation above.
#J.E. Rolon
clm = KMeans(n_clusters=2, random_state=0)
clm.fit(reduced_data)
preds = clm.predict(reduced_data)
centers = clm.cluster_centers_
sample_preds = clm.predict(pca_samples)
score = silhouette_score(reduced_data, preds, random_state=10)
print "Number of clusters = {}, Score = {}".format(2, np.round(score,4))
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)
Data Recovery¶
Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters.
When creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.
In the code block below, we implement the following:
Applying the inverse transform to
centers
usingpca.inverse_transform
and assign the new centers tolog_centers
.Applying the inverse function of
np.log
tolog_centers
usingnp.exp
and assign the true centers totrue_centers
.
# TODO: Inverse transform the centers
log_centers = pca.inverse_transform(centers)
# TODO: Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
Assessment 8¶
In the following we consider the total purchase cost of each product category for the representative data points above, and reference the statistical description of the dataset at the beginning of our analysis. We look specifically at the mean values for the various feature points.
Given the above, we want to answer the following question:
What set of establishments could each of the customer segments represent?
Analysis:
A customer who is assigned to
'Cluster X'
should best identify with the establishments represented by the feature set of'Segment X'
.We want to know what each segment represents in terms of their values for the feature points chosen. In addition, we want to reference these values with the mean values to get some perspective into what kind of establishment they represent.
#Ans to Question 8
#I generated this code cell to generate a bar plot per segment
#for the ratio (Annual Spending per Product) / (Mean Spending per Product) = ASP/MSP
#J.E. Rolon
cols = list(data.columns)
stats = data.describe()
ratios = []
for index in range(len(true_centers)):
rlist = []
for feat in cols:
rvalue = float(true_centers[feat][index])/float(stats[feat]['mean'])
rlist.append(rvalue)
ratios.append(rlist)
print np.round(rlist,2)
groups = cols
n_groups = len(groups)
ind = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8
plt.figure(1, figsize=(16, 6))
nrws = 1
ncol = 3
for m in range(len(ratios)):
plt.subplot(nrws, ncol, m+1)
plt.bar(ind, ratios[m], bar_width, alpha=opacity, color='g', label=None)
plt.xlabel('Product Categories')
plt.ylabel('Annual Spending Ratios (Threshold = 1.0)')
plt.title('Segment {}'.format(m))
plt.xticks(ind, groups, rotation='vertical')
plt.legend(frameon=False, loc='upper right', fontsize='small')
plt.tight_layout()
plt.show()
Analysis: To aid in the answer of this question I generated a bar plot per segment, indicating the ratio (Annual Spending per Product) / (Mean Spending per Product) = ASP/MSP.
Bar heights > 1.0 indicate values above average. Bar heights < 1.0 indicate values below average.
Segment 0 (Deli Restaurant): This customer spends mostly on "Fresh" products (~0.79 MSP), "Frozen" products (~ 0.72 MSP) and "Delicatessen" (~0.51 MSP). When taking into account our results in Question 1, this segment is closer to Deli Restaurant type of establishment.
Segment 1 (Mini Market): This customer spends mostly on "Detergents_Paper" products (~1.54 MSP), "Grocery" products (~ 1.45 MSP) and "Milk" (~1.34 MSP). When taking into account our results in Question 1, this segment is closer to a Mini Market type of establishment.
Assessment 9¶
In this step we want to determine which customer segment best represents each sample point. Furthermore, we want to know whether the predictions for each sample point is consistent with this representation.
By executing the code block below we find which cluster each sample point is predicted to be.
# Display the predictions
for i, pred in enumerate(sample_preds):
print "Sample point", i, "predicted to be in Cluster", pred
Analysis:
- Sample point 0 is best represented by Segment 1.
- Sample point 1 is best represented by Segment 1.
- Sample point 2 is best represented by Segment 0.
The predictions for each sample is consistent with this result and the results analyzed in question 8. Note, that although Sample 0 ("Cafe") was assigned to Cluster 1 along with Sample 1 ("Mini Market"), it is clear from the cluster visualization that "Cafe" is closer to "Deli Restaurant" (located in Cluster 0) than "Mini Market".
Unraveling Additional Structures Present in the Data¶
In this section we investigate useful ways to make use of the clustered data.
Firstly, we consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme.
Secondly, we consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data.
Finally, we compare the customer segments to a hidden variable present in the data, to see whether the clustering identified certain relationships.
A/B Testing¶
We want to run an A/B test to determine the effects of making small changes the products or services being analyzed.
The A/B test will help us determine whether making that change will affect the customers positively or negatively.
In the case being analyzed, the wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.
Given the above we want to answer the following question:
- How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?*
Analysis:
We have seen that atomistically not all customers are alike. However, clustering helps in identifying customer segments that represent customers whose similar spending behaviors are correlated due to some underlying dynamics. With this information, we can estimate in advance which changes to the delivery schedule can benefit specific customer segments while affecting negatively others.
In the contest of A/B testing let us define
- A : Delivery service remains the same (5d/week)
- B : Delivery service changes to (3d/week)
Let us define a conversion as the event in which a customer provides a positive reaction. The decision of switching the service from A to B to a specific segment will depend on the delivery version that produces maximum conversion rate.
Using segmentation prior to conducting an A/B test can help us concentrate on the features that could yield the most conversions. The segmentation structure (clustering pattern) can identify which features are associated to the customers who reacted positively. Therefore, we will gain a better understanding on the type of customer segments we should prioritize for the A/B tests.
In my opinion, the A/B test should be directed primarily to customers with the largest spending volume, e.g. Segment 1 in our case (Markets, Retailers, ..), as this segment represents customers that depend critically on product supply chains and are most likely to be affected logistically and financially by delivery changes. Conversely, Segment 0 (Cafes, Bakeries,..) will be impacted less, leading to a greater conversion rate (positive feedback) if it is the case that a reduction in weekly deliveries (B) benefits them logistically or financially.
Assessment 11¶
Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider 'customer segment' as an engineered feature for the data.
Now, the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a customer segment to determine the most appropriate delivery service.
In other words, the wholesale distributor wants to know how to label the new customers using only their estimated product spending and the customer segment data.
Analysis:
Once the cluster membership is available for each data point in the dataset (either for the original or the cleaned-up data -free from extreme outliers) the in-house data analytics team of the wholesale distributor can use this membership as a label. They can build the training dataset as the join between the original cleaned-up data (features) and the predictions data (labels) resulting from K-means.
The following code snippet illustrates a simplified procedure. It trains a decision tree classifier over the cleaned-up original data joined with labels generated by K-means. It test the classifer over a portion of the joined dataset and also over the original three samples data points.
#Ans to Question 11
#I added this code cell to train a supervised learner
#over the original data and the labels generated by K-means.
#It also shows the acc. and f-test scores on a test data subset.
#J.E. Rolon
#Recover original cleaned-up data point from its log-transformed version
true_good_data = np.exp(good_data)
#Split this data into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(true_good_data, preds, test_size = 0.25, random_state = 0)
#Import classifier and desired performance metrics modules
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import fbeta_score
from sklearn.metrics import accuracy_score
#Instatiate classifier
dt = DecisionTreeClassifier(random_state=0)
#Train classifier learner
learner = dt.fit(X_train, y_train)
#Generate a set of predictions for training and test dataset
predictions_train = learner.predict(X_train)
predictions_test = learner.predict(X_test)
#Compute desired performance metrics for predictions carried out above
acc_train_score = accuracy_score(y_train, predictions_train)
acc_test_score= accuracy_score(y_test, predictions_test)
f_train_score = fbeta_score(y_train, predictions_train, beta=0.5)
f_test_score = fbeta_score(y_test, predictions_test, beta=0.5)
#Print performance metrics of test data set predictions
print "Classification Accuracy = {}, F-Score = {} ".format(acc_test_score, f_test_score)
print ""
#Test model on sample data previously defined and used in K-means
predictions_samples = learner.predict(samples)
#Print predictions (Decision tree classifier predictions should be consistent with K-means predictions)
print "The segment membership for original sample data points are: {}".format(predictions_samples)
As shown by our results above, the segment membership predicted by the decision tree classifier agrees with the results of answer to question 9.
Moreover, we can simulate 10 additional sample points using a separate Gaussian sample generator for each feature. Let us test the DT classifier using 10 additional points and visualize them embbeded in the scatter plot in reduced space (see below).
#Ans to Question 11
#I added this code cell to generate 10 sample data points
#using a Gaussian sample generator for each feature
#J.E. Rolon
cols = list(data.columns)
stats_good_data = true_good_data.describe()
simdata = []
#Simulate a gaussian process for each feature
#Gaussian parameters set by simplified descriptive mean and std
#Consider only positive values...(crude step, works as long as it doesn't create outliers)
for feat in cols:
mu = stats_good_data[feat]['mean']
sigma = stats_good_data[feat]['std']
rs = list(np.abs(np.random.normal(mu, sigma, 10)))
simdata.append(rs)
frame = []
for i in range(10):
tmp =[]
for s in simdata:
tmp.append(s[i])
frame.append(tmp)
#Build pandas dataframe
simdata_df = pd.DataFrame.from_records(frame, columns=cols)
#Gaussian generated samples:"
print "Simulated data samples in full feature space"
display(simdata_df)
#Ans to Question 11
#I added this code cell to transform the simulated
#data points above,feed the to the PCA and to
#predict their segment membership using the DT classifier.
#It also creates a visualization as before.
#J.E. Rolon
#Transform data and feed it to the PCA
log_simdata = np.log(simdata_df)
pca2 = PCA(n_components=2)
pca2.fit(good_data)
pca2_simdata = pca2.transform(log_simdata)
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca2_simdata)
#Test classifier model on simulated data
predictions_simdata = learner.predict(simdata_df)
#Predictions (Decision tree classifier predictions should be consistent with K-means predictions)
print "Segment membership for the simulated data in reduced space:"
dfx = pd.DataFrame(np.round(pca2_simdata, 4), columns = ['Dimension 1', 'Dimension 2'])
dfx = dfx.assign(Segment=pd.DataFrame(predictions_simdata).values)
display(dfx)
As shown by the visualization and classification results above, the decision tree predictions for the membership of the 10 Gaussian-generated points agrees with the K-means predictions in two dimensions.
Visualizing Underlying Distributions¶
In this step we wnat to reintroduce the 'Channel'
feature to the dataset, and discover whether an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.
We run the code block below to see how each data point is labeled either 'HoReCa'
(Hotel/Restaurant/Cafe) or 'Retail'
the reduced space.
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)
Assessment 12¶
From our analysis perspective we want to answer the following questions:
How well does the clustering algorithm and number of clusters chosen compare to the underlying distribution of Hotel/Restaurant/Cafe customers to Retailer customers?
Are there customer segments that would be classified as purely 'Retailers' or 'Hotels/Restaurants/Cafes' by this distribution?
Are these classifications as consistent with the previous definition of the customer segments?
Analysis:
There is an overall consistency between the previous customer segmentation results and those presented here. The only variation arises from considering data points located at overlapping regions between the two clusters.
I ran the previous analysis with the full dataset and found that the first two dimensions still provide most of the explained variance. Moreover, the silhouette scores changed slightly when considering the full set of underlying features and only two clusters in the K-means algorithm.
I also implemented a Gaussian mixture model instead of K-means, and found some improvement on the silhouette score for the third cluster, indicating that a Gaussian mixture model would be slightly better than K-means in the present situation.
There are regions within the clusters that contain data points that can be classified as purely 'Retailers' or 'Hotels/Restaurants/Cafes', but as mentioned earlier we should be careful with this assertion in regions when there is considerable overlap between the clusters.