## Data Preprocessing¶

In this step, we preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers.

Preprocessing data is often times a critical step in assuring that results we obtain from our analysis are significant and meaningful.

### Feature Scaling¶

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In the code block below, we implement the following:

- Assigning a copy of the data to
`log_data`

after applying logarithmic scaling. We use the`np.log`

function for this. - Assigning a copy of the sample data to
`log_samples`

after applying logarithmic scaling. We again, use`np.log`

.

```
#sns.reset_orig()
# TODO: Scale the data using the natural logarithm
log_data = np.log(data)
# TODO: Scale the sample data using the natural logarithm
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

### Observation¶

After applying a natural logarithm scaling to the data, the distribution of each feature appears more normally distributed. For any pairs of identified features as being correlated, we observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

We run the code below to see how the sample data has changed after having the natural logarithm applied to it.

```
# Display the log-transformed sample data
display(log_samples)
```

### Outlier Detection¶

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis.

The presence of outliers can often skew results which take into consideration these data points. Here, we use Tukey's Method for identfying outliers:

An

*outlier step*is calculated as 1.5 times the interquartile range (IQR).A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In the code block below, we implement the following:

Assigning the value of the 25th percentile for the given feature to

`Q1`

. We use`np.percentile`

for this.Assigning the value of the 75th percentile for the given feature to

`Q3`

. We again, use`np.percentile`

.Assigning the calculation of an outlier step for the given feature to

`step`

.Optionally we remove data points from the dataset by adding indices to the

`outliers`

list.

After this implementation, the dataset will be stored in the variable `good_data`

.

```
import collections
# For each feature find the data points with extreme high or low values
outliers = []
for feature in log_data.keys():
# TODO: Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data[feature], 25)
# TODO: Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data[feature], 75)
# TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = 1.5 * (Q3-Q1)
# Display the outliers
print "Data points considered outliers for the feature '{}':".format(feature)
feat_outliers = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
outliers += list(feat_outliers.index.values)
display(feat_outliers)
# OPTIONAL: Select the indices for data points we want to remove
common_outliers = [item for item, count in collections.Counter(outliers).items() if count > 1]
print 'Outlier data idexes common to features: {}'.format(common_outliers)
# Remove the outliers, if any were specified
outliers = list(np.unique(np.asarray(outliers)))
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
```

```
#To help answering questions 3 and 4
#Generates box plots for each feature data
#before and after removal of outliers
#J.E. Rolon
#sns.reset_orig()
plt.figure(1, figsize=(16, 8))
plt.subplot(1, 2, 1)
plt.title('Box plot scaled data - includes all outliers')
log_data.boxplot(showfliers=True)
plt.ylim(0,15)
plt.subplot(1, 2, 2)
plt.title('Box plot scaled data - most outliers removed')
good_data.boxplot(showfliers=True)
plt.ylim(0,15)
plt.tight_layout()
plt._show()
```

```
#Display sample of common outlier data
outlier_sample = pd.DataFrame(data.loc[common_outliers], columns = data.keys())
print "Sample data of common outliers:"
display(outlier_sample)
```

### Assessment 4¶

We want to answer the following questions:

Are there any data points considered outliers for more than one feature based on the definition above?

Should these data points be removed from the dataset?

** Analysis: **

When having datapoints that are outliers in multiple categories we need to think about why that may be and if they warrant removal.

We need to assess how k-means is affected by outliers and whether or not this plays a factor in the analysis.

**Analysis:**
As shown in the table above, there are five data points with indexes [128, 154, 65, 66, 75] which are common outliers to all features.

We can spot several of the outliers appearing below the bottom whiskers of all box-whisker plots in the preceding figure; these points align along the same horizontal line.

These **common outliers** are extreme outliers as they are located far beyond the IQR of most features. **They definitely need to be removed** as they can skew and mislead the training process, produce longer training times, less accurate models and ultimately poorer results.

As for the remaining outliers, we may as well remove them except for those perhaps at the boundary of the IQR inteval, which could be integrated as part of the observations. The final decision should come after experimentation on whether their integration or removal improves algorithm performance.

Now, in clustering algorithms such as K-means, outliers do not belong to any of the clusters and are typically defined to be points of non-agglomerative behavior. That is, the neighborhoods of outliers are generally sparse compared to points in clusters, and the distance of an outlier to the nearest cluster is comparatively higher than the distances among points in bonafide clusters themselves.

The above indicates that outliers, due to their larger distances from other points, tend to merge less with other points and their clustering behavior lags at a much slower rate than points in actual clusters.

In K-means, the above behavior would cause the algorithm to try to reduce the sum of squared distances (from each point to its assigned cluster centroid) by choosing sometimes the outliers themselves to be one of the centroids, and placing the other centroid somewhere in the middle of the remaining data. As a consequence, the resulting clustering pattern would not be representative of the the underlying distribution, but an anomalous pattern owing to the presence of outliers.

## Feature Transformation¶

We use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

### Implementation: PCA¶

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the `good_data`

to discover which dimensions about the data best maximize the variance of features involved.

In addition to finding these dimensions, PCA yields the *explained variance ratio* of each dimension — how much variance within the data is explained by that dimension alone.

Furthermore, a component (dimension) from PCA can be considered a new "feature", which in fact is a linear combination of the original features present in the data.

In the code block below, we implement the following:

- Importing
`sklearn.decomposition.PCA`

and assigning the results of fitting PCA in six dimensions with`good_data`

to`pca`

.

- Applying a PCA transformation of
`log_samples`

using`pca.transform`

, and assign the results to`pca_samples`

.

```
from sklearn.decomposition import PCA
# TODO: Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=data.shape[1])
pca.fit(good_data)
# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
```

```
#I added this code cell to complement the based on the plots above
#J.E. Rolon
print "Cumulative sum of of explained variance by dimension:"
print pca_results['Explained Variance'].cumsum()
print""
print "PCA detailed results:"
print pca_results
```

```
#I added this code cell to generate a pair of heat maps
#illustrating feature loadings correlation matrix (left map)
#and variance percentages by feature (right map)
#J.E. Rolon
import seaborn as sns
cols = list(pca_results.columns)[1:]
rows = list(pca_results.index)
pca_matrix = pca.components_
pca_squared_matrix = np.square(pca.components_)
plt.figure(1, figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.set(font_scale=1.2)
heat_map = sns.heatmap(pca_matrix, cbar=True, annot=True, square=True, fmt='.4f',
annot_kws = {'size': 12}, yticklabels=rows, xticklabels=cols)
plt.title("Feature loadings correlation matrix")
plt.xticks(rotation='vertical')
plt.yticks(rotation='horizontal')
plt.subplot(1, 2, 2)
sns.set(font_scale=1.2)
heat_map = sns.heatmap(pca_squared_matrix, cmap="YlGnBu", cbar=True, annot=True, square=True, fmt='.4f',
annot_kws={'size': 12}, yticklabels=rows, xticklabels=cols)
plt.title("Variance percentages explained by feature")
plt.xticks(rotation='vertical')
plt.yticks(rotation='horizontal')
plt.tight_layout()
plt.show()
```

### Assessment 5¶

In this assessment, we answer the following questions:

How much variance in the data is explained

**in total**How much variance in the data is explained by the first four principal components?

**Analysis:**

With the visualization provided above, we can assess each dimension and its cumulative explained variance explained. We can also determine which features are well represented by each dimension(both in terms of positive and negative explained variance.

A positive increase in a specific dimension corresponds with an *increase* of the *positive-weighted* features and a *decrease* of the *negative-weighted* features. The rate of increase or decrease is based on the individual feature weights.

**Analysis:**
The total explained variance contributed by the first and second principal components is 0.7252 or 72.52%.

The total explained variance cotributed by the first four principal components is 0.9279 or 92.79%.

According to the analysis below, the first four dimensions indicate that spending behavior is dominated by the features with the highest correlations to these dimensions:

- "Detergents_Paper"
- "Frozen"
- "Fresh"
- "Delicatessen"

The above features appear to be the most important features relevant to spending behavior.

Le us discuss now the results of the bar plots and heat maps given above:

**Dimension 1 (PC1)**

~49.93% of the dataset variance lies along this axis which is the largest contribution to the variance, making it the first principal component (PC1). This contribution is further split in variations among the several features. As shown in the bar plots and the pair of heat maps above, *"Detergents_Paper"* is the feature that has the highest correlation (0.7595) with PC1 and explains 57.69% of the variance along PC1.

"Milk" and "Grocery" have moderate correlations with PC1 (0.41, 0.45, respectively) with ~16.8% and ~20.3% explained var. percentages along PC1.

"Frozen" and "Fresh" are weakly and negatively correlated (-0.12, -0.09, respectively) to PC1, while "Delicatessen" is weakly correlated with PC1. They explain each just about ~1% of variance along PC1.

**Dimension 2 (PC2)**

~22.59% of the dataset variance lies along this axis; it corresponds to PC2. Here *"Frozen"* has the highest correlation to PC2 (0.63), followed by "Fresh" (0.60) and "Delicatessen" (0.46). Each of these features explain ~39.7%, ~36% and ~21.5% of the variance along PC2, respectively. Interestingly, in this group "Detergents paper" is weakly and nearly uncorrelated to this axis (corr. coeff. ~ -0.03) explaning just ~0.14% of var. along this axis.

**Dimension 3 (PC3)**

~10.49% of the dataset variance lies along this axis; it corresponds to PC3. Here, *"Fresh"* has the largest (and negative) correlation with PC3 (-0.745), explaining ~55.5% of the variance along PC3. "Delicatessen" is positively correlated (0.54) with PC3 and explains ~29.4% of the variance along PC3. "Frozen" is weakly-moderately and negatively correlated (~0.26) with PC3 and explains ~7% of the variance along PC3. The remaining features are weakly correlated to PC3 and explain little of the variance.

**Dimension 4 (PC3)**

~9.7% of the dataset variance lies along this axis. it corresponds to PC4. Here, *"Frozen"* has the largest (negative) correlation to PC4 (-0.71) and explains 50.8% of the variance along this axis. On the other hand, "Delicatessen" correlates positively with PC4 (0.54) and explains 29.6% of the variance along this axis. On the other hand, "Fresh" is weakly-moderately and positvely correlated (~0.26) with PC4 and explains 7.1% of variance along this axis. The remaining features are weakly correlated to PC3 and explain little of the variance.

### Observation¶

We run the code below to see how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions.

We assess the numerical value for the first four dimensions of the sample points, and consider whether this is consistent with the initial interpretation of the sample points.

```
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
```

### Dimensionality Reduction¶

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained.

Because of the above, the *cumulative explained variance ratio* is extremely important for knowing how many dimensions are necessary for the problem.

Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

In the code block below, we implement the following:

Assigning the results of fitting PCA in two dimensions with

`good_data`

to`pca`

.Applying a PCA transformation of

`good_data`

using`pca.transform`

, and assign the results to`reduced_data`

.

- Applying a PCA transformation of
`log_samples`

using`pca.transform`

, and assign the results to`pca_samples`

.

```
# TODO: Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2)
pca.fit(good_data)
# TODO: Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)
# TODO: Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
```

### Observation¶

We run code below to determine how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions.

We observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.

```
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
```