Machine Learning with PySpark - Feature Selection using PCC


Machine Learning with PySpark

Feature Selection using Pearson correlation coefficient


For the most part, this post will be based on my previous post about linear regression with spark. However, we'll be adding a very important step to our model building, that is feature selection.

In this post, we'll focus on selecting the features based on their correlation with the label data. To achieve that, we'll be using Pearson correlation coefficient.

Again, we'll be using Databrick's notebook, so make sure you follow the steps 1 through 7 from the previous blog.

  1. On top of the previous imports, we need some extra libraries.
from import Correlation
import pandas as pd
  1. Now we should be ready to prepare our data for correlation checking.
# prepare the data
features = ["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity"]
lr_data ="energy_output").alias("label"), *features).dropna()

columns = lr_data.columns
  1. Once again the beautiful part of Spark pipelines. We will put the whole data set into one vector and then we will scale all numerical values.
vector = VectorAssembler(inputCols=columns, outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

stages = [vector, scaler]

pipe = Pipeline(stages=stages)

# we'll be using this data frame
data_for_correlation ="scaled_features")

Remember that each stage's output should be the input for the next stage. The column scaled_features will contain all the data that we need for our correlation evaluation.

  1. The correlation step
correlation = Correlation.corr(data_for_correlation, "scaled_features", "pearson").collect()[0][0].toArray()

 # rename _1, _2 ... columns to their original name
df = pd.DataFrame(correlation)
df["features"] = pd.Series(columns)

 # let's see the results
display(spark.createDataFrame(df, schema=columns))

To display this nicely, chose a pivot table from under cell and customize it as shown below.


The label was our energy_output column. As once can see, there is a strong negative correlation between the label and temperature and exhaust_vacuum. And a positive correlation with the ambient_pressure and relative_humidity.

The closer to 0 (from either + or -) the correlation value, the less significant that feature is. The interpretation of the exact values is up for grabs, but normally, values between -0.2 and +0.2 are showing no correlation.

It goes without saying that domain expertise is needed when evaluating which features to keep and which features to eliminate based on this method. In this case, it kind'of makes sense that the colder the temperature, the more energy output required.

From this point on, the regular ML procedure can be followed (or step 9 and onward from the previous blog).

For the following features features = ["temperature", "exhaust_vacuum", "ambient_pressure"], our model will have an R^2 value of 0.918

While for the following features features = ["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity"], the R^2 value will be 0.929, which is slightly better. So we chose to keep all the features in this case.

[+] Useful links
  • [Machine Learning with PySpark - Linear Regression](
  • [Feature Selection](

  • > ==Disclaimer==: This is by no means an original work it is merely meant to serve as a compilation of thoughts, code snippets and teachings from different sources.