Machine Learning with PySpark
Feature Selection using Pearson correlation coefficient
- At the minimum a community edition account with Databricks.
- Once the above is done, configure the cluster settings of Databricks Runtime Version to
- Combined Cycle Power Plant data set from UC Irvine site
- Read my previous post because we build on that.
In this post, we'll focus on selecting the features based on their correlation with the label data. To achieve that, we'll be using Pearson correlation coefficient.
- On top of the previous imports, we need some extra libraries.
from pyspark.ml.stat import Correlation import pandas as pd
- Now we should be ready to prepare our data for correlation checking.
# prepare the data features = ["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity"] lr_data = data.select(col("energy_output").alias("label"), *features).dropna() columns = lr_data.columns
- Once again the beautiful part of Spark pipelines. We will put the whole data set into one vector and then we will scale all numerical values.
vector = VectorAssembler(inputCols=columns, outputCol="features") scaler = StandardScaler(inputCol="features", outputCol="scaled_features") stages = [vector, scaler] pipe = Pipeline(stages=stages) # we'll be using this data frame data_for_correlation = pipe.fit(lr_data).transform(lr_data).select("scaled_features")
Remember that each stage's output should be the input for the next stage. The column
scaled_features will contain all the data that we need for our correlation evaluation.
- The correlation step
correlation = Correlation.corr(data_for_correlation, "scaled_features", "pearson").collect().toArray() # rename _1, _2 ... columns to their original name df = pd.DataFrame(correlation) df["features"] = pd.Series(columns) # let's see the results display(spark.createDataFrame(df, schema=columns))
To display this nicely, chose a pivot table from under cell and customize it as shown below.
The label was our
energy_output column. As once can see, there is a strong negative correlation between the label and
exhaust_vacuum. And a positive correlation with the
The closer to 0 (from either
-) the correlation value, the less significant that feature is. The interpretation of the exact values is up for grabs, but normally, values between -0.2 and +0.2 are showing no correlation.
It goes without saying that domain expertise is needed when evaluating which features to keep and which features to eliminate based on this method. In this case, it kind'of makes sense that the colder the temperature, the more energy output required.
From this point on, the regular ML procedure can be followed (or step 9 and onward from the previous blog).
For the following features
features = ["temperature", "exhaust_vacuum", "ambient_pressure"], our model will have an
R^2 value of 0.918
While for the following features
features = ["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity"], the
R^2 value will be 0.929, which is slightly better. So we chose to keep all the features in this case.
[+] Useful links
> ==Disclaimer==: This is by no means an original work it is merely meant to serve as a compilation of thoughts, code snippets and teachings from different sources.