Machine Learning with PySpark
Feature Ranking using Random Forest Regressor
- At the minimum a community edition account with Databricks.
- Once the above is done, configure the cluster settings of Databricks Runtime Version to
- Combined Cycle Power Plant data set from UC Irvine site
- Read my previous post on feature selection and the one on linear regression because we build on those two.
This post is building on previous two posts about machine learning with pyspark (see the links above).
Feature ranking resembles to some extent to feature selection, in the sense that by ordering features from the most influential (which explains the most variability in the model) to the least influential, one can chose to discard (reduce, eliminate) the latter without impacting too much the final result.
Once again, when you eliminate (reduce, discard) features from your dataset using one or both of these methods, it is important to have domain knowledge and not rely solely on the statistical and/or mathematical results.
With that disclaimer in mind, we'll be looking at how to rank features using Random Forest Regressor and PySpark. The dataset is the same used in the previous two posts (please see the link above).
We'll be using Databrick's notebook, and steps 1 through 7 from my first blog on machine learning with PySpark are the same.
- We have to import some extra libraries
from pyspark.ml.regression import RandomForestRegressor from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator import pandas as pd
- And prepare the data
features = ["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity"] lr_data = data.select(col("energy_output").alias("label"), *features) lr_data.printSchema()
- Next we'll prepare the data pipeline for our random forest regressor
vector = VectorAssembler(inputCols=features, outputCol="features") scaler = StandardScaler(inputCol="features", outputCol="scaled_features") rfr = RandomForestRegressor(labelCol="label", featuresCol="scaled_features") stages = [vector, scaler, rfr] pipe = Pipeline(stages=stages)
These are all numerical features, so the
StandardScaler and the
VectorAssembler are enough to build our pipeline. However, if we had categorical data in our dataset, we would have to use
OneHotEncoder also. We'll have an example with that also in a future post.
- Now we can build a grid validator and decide on
R^2metric to evaluate our random forest results. Again, domain knowledge is important here when we chose the evaluation metric. Sometimes
MAEor mean absolute error may produce better results.
estimatorParam = ParamGridBuilder() \ .addGrid(rfr.maxDepth, [4, 6, 8]) \ .addGrid(rfr.maxBins, [5, 10, 20, 40]) \ .addGrid(rfr.impurity, ["variance"]) \ .build() evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
- The next step is to put everything together and run the cross validation in order to find out which one is the best model out of all the models from the grid.
crossval = CrossValidator(estimator=pipe, estimatorParamMaps=estimatorParam, evaluator=evaluator, numFolds=3) cvmodel = crossval.fit(lr_data)
This part will take a bit of time (probably like 5 min or so) since we have to go through count(
impurity) worth of model validation.
- We are done. We can get now the best model to show us which features are the most important.
model = pd.DataFrame(cvmodel.bestModel.stages[-1].featureImportances.toArray(), columns=["values"]) features_col = pd.Series(features) model["features"] = features_col model
The output to the above code will be something like:
The sum of all the feature's importance values should add up to 1. Obviously all features combined cannot count more than 100%. If that is the case, something is wrong in the pipeline.
- Let's use the embedded plotting feature of Databricks notebook to show in a nicer way the ranking of the features.
from pyspark.sql.types import StructType, StructField, DoubleType, StringType new_schema = StructType([ StructField("values", DoubleType(), False), StructField("features", StringType(), False) ]) feature_importance = spark.createDataFrame(model, schema=new_schema) display(feature_importance.orderBy("values", ascending=False))
This is what we get. Chose Bar plot from the drop down options.
temperature is the feature with the highest rank, and this feature will explain about 73% of the variability of the model. Next is
exhaust_vacuum with 18% and so on.
It is up to the domain knowledge expert to decide if the remaining features should be included or not in the model, since combined they count for less than 10%.
From this point onward, follow the steps 9 to 13 from the first post on machine learning with PySpark.
If you play a bit with the features, you can get to results that look like the table below. Where,
T = temperature V = exhaust_vacuum AP = ambient_pressure RH = relative_humidity
Obviously we want to keep the first two features (
exhaust_vacuum) in the model, because they are ranked the highest. Also, from our previous post on feature selection, we can see that they have the highest correlation with the label too.
As for the rest of the features, we can try to remove/add them and see how our model changes.
This is a small dataset with very few features and we can keep all of them. But when we have hundreds of features and lots of data points, it makes sense to use the feature selection and/or feature ranking to slim down a bit our analysis. Oh! don't forget the domain knowledge.
[+] Useful links
> ==Disclaimer==: This is by no means an original work it is merely meant to serve as a compilation of thoughts, code snippets and teachings from different sources.