JUAN ROLON
  • Home
  • Machine Learning
  • Physics
  • About
customer_segments

Customer Segmentation Analysis via Unsupervised Learning¶

Part 2¶

Juan E. Rolon, 2017.¶

Data Preprocessing¶

In this step, we preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers.

Preprocessing data is often times a critical step in assuring that results we obtain from our analysis are significant and meaningful.

Feature Scaling¶

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In the code block below, we implement the following:

  • Assigning a copy of the data to log_data after applying logarithmic scaling. We use the np.log function for this.
  • Assigning a copy of the sample data to log_samples after applying logarithmic scaling. We again, use np.log.
In [73]:
#sns.reset_orig()
# TODO: Scale the data using the natural logarithm
log_data = np.log(data)

# TODO: Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
/Users/juanerolon/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:9: FutureWarning: pandas.scatter_matrix is deprecated. Use pandas.plotting.scatter_matrix instead
  if __name__ == '__main__':

Observation¶

After applying a natural logarithm scaling to the data, the distribution of each feature appears more normally distributed. For any pairs of identified features as being correlated, we observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

We run the code below to see how the sample data has changed after having the natural logarithm applied to it.

In [74]:
# Display the log-transformed sample data
display(log_samples)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 6.822197 9.935713 9.515396 7.289611 8.831420 6.692084
1 7.836765 9.717098 10.504684 5.187386 9.496121 6.513230
2 10.350606 7.558517 8.404920 9.149316 7.775276 8.374246

Outlier Detection¶

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis.

The presence of outliers can often skew results which take into consideration these data points. Here, we use Tukey's Method for identfying outliers:

  • An outlier step is calculated as 1.5 times the interquartile range (IQR).

  • A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In the code block below, we implement the following:

  • Assigning the value of the 25th percentile for the given feature to Q1. We use np.percentile for this.

  • Assigning the value of the 75th percentile for the given feature to Q3. We again, use np.percentile.

  • Assigning the calculation of an outlier step for the given feature to step.

  • Optionally we remove data points from the dataset by adding indices to the outliers list.

After this implementation, the dataset will be stored in the variable good_data.

In [13]:
import collections

# For each feature find the data points with extreme high or low values
outliers  = []
for feature in log_data.keys():
    
    # TODO: Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = np.percentile(log_data[feature], 25)
    
    # TODO: Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = np.percentile(log_data[feature], 75)
    
    # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    step = 1.5 * (Q3-Q1)
    
    # Display the outliers
    print "Data points considered outliers for the feature '{}':".format(feature)
    feat_outliers = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    outliers += list(feat_outliers.index.values)
    display(feat_outliers)
    
# OPTIONAL: Select the indices for data points we want to remove
common_outliers = [item for item, count in collections.Counter(outliers).items() if count > 1]
print 'Outlier data idexes common to features: {}'.format(common_outliers)

# Remove the outliers, if any were specified
outliers = list(np.unique(np.asarray(outliers)))
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
Data points considered outliers for the feature 'Fresh':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
Data points considered outliers for the feature 'Milk':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
Data points considered outliers for the feature 'Grocery':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
Data points considered outliers for the feature 'Frozen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
Data points considered outliers for the feature 'Detergents_Paper':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
Data points considered outliers for the feature 'Delicatessen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
Outlier data idexes common to features: [128, 154, 65, 66, 75]
In [12]:
#To help answering questions 3 and 4
#Generates box plots for each feature data
#before and after removal of outliers
#J.E. Rolon

#sns.reset_orig()
plt.figure(1, figsize=(16, 8))

plt.subplot(1, 2, 1)
plt.title('Box plot scaled data - includes all outliers')
log_data.boxplot(showfliers=True)
plt.ylim(0,15)

plt.subplot(1, 2, 2)
plt.title('Box plot scaled data - most outliers removed')
good_data.boxplot(showfliers=True)
plt.ylim(0,15)

plt.tight_layout()
plt._show()
In [49]:
#Display sample of common outlier data
outlier_sample = pd.DataFrame(data.loc[common_outliers], columns = data.keys())
print "Sample data of common outliers:"
display(outlier_sample)
Sample data of common outliers:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
128 140 8847 3823 142 1062 3
154 622 55 137 75 7 8
65 85 20959 45828 36 24231 1423
66 9 1534 7417 175 3468 27
75 20398 1137 3 4407 3 975

Assessment 4¶

We want to answer the following questions:

  • Are there any data points considered outliers for more than one feature based on the definition above?

  • Should these data points be removed from the dataset?

Analysis:

When having datapoints that are outliers in multiple categories we need to think about why that may be and if they warrant removal.

We need to assess how k-means is affected by outliers and whether or not this plays a factor in the analysis.

Analysis: As shown in the table above, there are five data points with indexes [128, 154, 65, 66, 75] which are common outliers to all features.

We can spot several of the outliers appearing below the bottom whiskers of all box-whisker plots in the preceding figure; these points align along the same horizontal line.

These common outliers are extreme outliers as they are located far beyond the IQR of most features. They definitely need to be removed as they can skew and mislead the training process, produce longer training times, less accurate models and ultimately poorer results.

As for the remaining outliers, we may as well remove them except for those perhaps at the boundary of the IQR inteval, which could be integrated as part of the observations. The final decision should come after experimentation on whether their integration or removal improves algorithm performance.

Now, in clustering algorithms such as K-means, outliers do not belong to any of the clusters and are typically defined to be points of non-agglomerative behavior. That is, the neighborhoods of outliers are generally sparse compared to points in clusters, and the distance of an outlier to the nearest cluster is comparatively higher than the distances among points in bonafide clusters themselves.

The above indicates that outliers, due to their larger distances from other points, tend to merge less with other points and their clustering behavior lags at a much slower rate than points in actual clusters.

In K-means, the above behavior would cause the algorithm to try to reduce the sum of squared distances (from each point to its assigned cluster centroid) by choosing sometimes the outliers themselves to be one of the centroids, and placing the other centroid somewhere in the middle of the remaining data. As a consequence, the resulting clustering pattern would not be representative of the the underlying distribution, but an anomalous pattern owing to the presence of outliers.

Feature Transformation¶

We use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Implementation: PCA¶

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved.

In addition to finding these dimensions, PCA yields the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

Furthermore, a component (dimension) from PCA can be considered a new "feature", which in fact is a linear combination of the original features present in the data.

In the code block below, we implement the following:

  • Importing sklearn.decomposition.PCA and assigning the results of fitting PCA in six dimensions with good_data to pca.
  • Applying a PCA transformation of log_samples using pca.transform, and assign the results to pca_samples.
In [75]:
from sklearn.decomposition import PCA
# TODO: Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=data.shape[1])
pca.fit(good_data)

# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
In [19]:
#I added this code cell to complement the based on the plots above
#J.E. Rolon

print "Cumulative sum of of explained variance by dimension:"
print pca_results['Explained Variance'].cumsum()
print""
print "PCA detailed results:"
print pca_results
Cumulative sum of of explained variance by dimension:
Dimension 1    0.4993
Dimension 2    0.7252
Dimension 3    0.8301
Dimension 4    0.9279
Dimension 5    0.9767
Dimension 6    1.0000
Name: Explained Variance, dtype: float64

PCA detailed results:
             Explained Variance   Fresh    Milk  Grocery  Frozen  \
Dimension 1              0.4993 -0.0976  0.4109   0.4511 -0.1280   
Dimension 2              0.2259  0.6008  0.1370   0.0852  0.6300   
Dimension 3              0.1049 -0.7452  0.1544  -0.0204  0.2670   
Dimension 4              0.0978  0.2667  0.1375   0.0710 -0.7133   
Dimension 5              0.0488  0.0114  0.7083   0.3168  0.0671   
Dimension 6              0.0233 -0.0543 -0.5177   0.8267  0.0471   

             Detergents_Paper  Delicatessen  
Dimension 1            0.7595        0.1579  
Dimension 2           -0.0376        0.4634  
Dimension 3           -0.2349        0.5422  
Dimension 4           -0.3157        0.5445  
Dimension 5           -0.4729       -0.4120  
Dimension 6           -0.2080       -0.0094  
In [34]:
    #I added this code cell to generate a pair of heat maps
    #illustrating feature loadings correlation matrix (left map)
    #and variance percentages by feature (right map)
    #J.E. Rolon
  
    import seaborn as sns
    cols = list(pca_results.columns)[1:]
    rows = list(pca_results.index)
    pca_matrix = pca.components_
    pca_squared_matrix = np.square(pca.components_)
    
    plt.figure(1, figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    sns.set(font_scale=1.2)
    heat_map = sns.heatmap(pca_matrix, cbar=True, annot=True, square=True, fmt='.4f',
                   annot_kws = {'size': 12}, yticklabels=rows, xticklabels=cols)
    plt.title("Feature loadings correlation matrix")
    plt.xticks(rotation='vertical')
    plt.yticks(rotation='horizontal')
    
    plt.subplot(1, 2, 2)
    sns.set(font_scale=1.2)
    heat_map = sns.heatmap(pca_squared_matrix, cmap="YlGnBu", cbar=True, annot=True, square=True, fmt='.4f',
                           annot_kws={'size': 12}, yticklabels=rows, xticklabels=cols)
    plt.title("Variance percentages explained by feature")
    plt.xticks(rotation='vertical')
    plt.yticks(rotation='horizontal')

    plt.tight_layout()
    plt.show()

Assessment 5¶

In this assessment, we answer the following questions:

  • How much variance in the data is explained in total by the first and second principal components?

  • How much variance in the data is explained by the first four principal components?

Analysis:

With the visualization provided above, we can assess each dimension and its cumulative explained variance explained. We can also determine which features are well represented by each dimension(both in terms of positive and negative explained variance.

A positive increase in a specific dimension corresponds with an increase of the positive-weighted features and a decrease of the negative-weighted features. The rate of increase or decrease is based on the individual feature weights.

Analysis: The total explained variance contributed by the first and second principal components is 0.7252 or 72.52%.

The total explained variance cotributed by the first four principal components is 0.9279 or 92.79%.

According to the analysis below, the first four dimensions indicate that spending behavior is dominated by the features with the highest correlations to these dimensions:

  • "Detergents_Paper"
  • "Frozen"
  • "Fresh"
  • "Delicatessen"

The above features appear to be the most important features relevant to spending behavior.

Le us discuss now the results of the bar plots and heat maps given above:

Dimension 1 (PC1)

~49.93% of the dataset variance lies along this axis which is the largest contribution to the variance, making it the first principal component (PC1). This contribution is further split in variations among the several features. As shown in the bar plots and the pair of heat maps above, "Detergents_Paper" is the feature that has the highest correlation (0.7595) with PC1 and explains 57.69% of the variance along PC1.

"Milk" and "Grocery" have moderate correlations with PC1 (0.41, 0.45, respectively) with ~16.8% and ~20.3% explained var. percentages along PC1.

"Frozen" and "Fresh" are weakly and negatively correlated (-0.12, -0.09, respectively) to PC1, while "Delicatessen" is weakly correlated with PC1. They explain each just about ~1% of variance along PC1.

Dimension 2 (PC2)

~22.59% of the dataset variance lies along this axis; it corresponds to PC2. Here "Frozen" has the highest correlation to PC2 (0.63), followed by "Fresh" (0.60) and "Delicatessen" (0.46). Each of these features explain ~39.7%, ~36% and ~21.5% of the variance along PC2, respectively. Interestingly, in this group "Detergents paper" is weakly and nearly uncorrelated to this axis (corr. coeff. ~ -0.03) explaning just ~0.14% of var. along this axis.

Dimension 3 (PC3)

~10.49% of the dataset variance lies along this axis; it corresponds to PC3. Here, "Fresh" has the largest (and negative) correlation with PC3 (-0.745), explaining ~55.5% of the variance along PC3. "Delicatessen" is positively correlated (0.54) with PC3 and explains ~29.4% of the variance along PC3. "Frozen" is weakly-moderately and negatively correlated (~0.26) with PC3 and explains ~7% of the variance along PC3. The remaining features are weakly correlated to PC3 and explain little of the variance.

Dimension 4 (PC3)

~9.7% of the dataset variance lies along this axis. it corresponds to PC4. Here, "Frozen" has the largest (negative) correlation to PC4 (-0.71) and explains 50.8% of the variance along this axis. On the other hand, "Delicatessen" correlates positively with PC4 (0.54) and explains 29.6% of the variance along this axis. On the other hand, "Fresh" is weakly-moderately and positvely correlated (~0.26) with PC4 and explains 7.1% of variance along this axis. The remaining features are weakly correlated to PC3 and explain little of the variance.

Observation¶

We run the code below to see how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions.

We assess the numerical value for the first four dimensions of the sample points, and consider whether this is consistent with the initial interpretation of the sample points.

In [76]:
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6
0 2.9993 -1.1418 1.2547 -0.8408 0.6775 -0.3489
1 4.0025 -1.9102 -0.3696 0.6622 0.4659 0.2915
2 0.4026 2.5486 -0.0626 -0.3826 -1.3865 0.0638

Dimensionality Reduction¶

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained.

Because of the above, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem.

Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

In the code block below, we implement the following:

  • Assigning the results of fitting PCA in two dimensions with good_data to pca.

  • Applying a PCA transformation of good_data using pca.transform, and assign the results to reduced_data.

  • Applying a PCA transformation of log_samples using pca.transform, and assign the results to pca_samples.
In [78]:
# TODO: Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2)
pca.fit(good_data)

# TODO: Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)

# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

Observation¶

We run code below to determine how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions.

We observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.

In [79]:
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
Dimension 1 Dimension 2
0 2.9993 -1.1418
1 4.0025 -1.9102
2 0.4026 2.5486

CONTINUE TO PART 3

Powered by Create your own unique website with customizable templates.
  • Home
  • Machine Learning
  • Physics
  • About