Machine Learning with Spark (3/5) - model validation

Scala
3/5

Machine Learning with Spark

Validating a model

Prerequisites:

  • spark 2.0.0 or higher, preferable with pre-built hadoop. Download link
  • scala 2.11.8 or higher. Download link

This is a generic How To on Model Validation with Spark.

The following tutorial will be performed entirely in the spark-shell, although it is absolutely possible to wrap up everything in a function and run it as a compiled object (See this Scala tutorial).

This will be a short blog that builds on top of Part 1/5 Machine Learning with Spark, therefore I'll skip the loading data part. We assume that our data is in a dataframe called df, already in a format of two columns representing the label and the features, as shown below.

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

Here are the packages that we will use throughout the analysis:

import org.apache.spark.ml.evaluation.{RegressionEvaluator => RE}
import org.apache.spark.ml.regression.{LinearRegression => LR}
import org.apache.spark.ml.tuning.{ParamGridBuilder => PGB, TrainValidationSplit => TVS}

Now let's split our dataframe into a training and testing subsets and instantiate a linear regression model.

val Array(training, test) = df.randomSplit(Array(.7, .3), seed = 196)
val lr = new LR()

The linear regression model has lots of parameters that we can set and tune. In order to retrieve their current settings, run the following command and you will see something similar to what is shown below.

scala> lr.extractParamMap
res105: org.apache.spark.ml.param.ParamMap =
{
        linReg_db39bbba502d-elasticNetParam: 0.0,
        linReg_db39bbba502d-featuresCol: features,
        linReg_db39bbba502d-fitIntercept: true,
        linReg_db39bbba502d-labelCol: label,
        linReg_db39bbba502d-maxIter: 100,
        linReg_db39bbba502d-predictionCol: prediction,
        linReg_db39bbba502d-regParam: 0.0,
        linReg_db39bbba502d-solver: auto,
        linReg_db39bbba502d-standardization: true,
        linReg_db39bbba502d-tol: 1.0E-6
}

To see a brief explanation of each of the parameters above, run lr.explainParams.

All of these parameters can be fine tuned so that our model will improve its predictions. This is where the library ParamGridBuilder comes into play. For more on hyperparameter tuning, check this link.

Let's build a grid of parameters.

val gridParams = new PGB().
	addGrid(lr.regParam, Array(.1, .2, .01, .02)).
	addGrid(lr.fitIntercept).
	addGrid(lr.elasticNetParam, Array(.0, .5, .9)).
	addGrid(lr.maxIter, Array(10, 20, 30)).
	addGrid(lr.tol, Array(.1, .2, .3)).
	addGrid(lr.solver, Array("l-bfgs")).
	build()

Next we will use TrainValidationSplit library for tuning. This will evaluate each combination of parameters once and return the best model. The drawback of this approach is that, unless the dataset is not sufficiently large, the results may not be as reliable as if we were to use a CrossValidator method.

TrainValidationSplit will use the training and test subsets that we created earlier and by setting the value.setTrainRatio(.75), 75% of the data will be used for training and 25% for validation.

The estimator used in this particular case is the linear regression, the evaluator is RegressionEvaluator with r2 as the metric. Feel free to play with these values and methods. Regression Evaluator APIs, Model Tuning APIs.

val trainValidationSplit = new TVS().
	setEstimator(lr).
	setEvaluator(new RE("r2")).
	setEstimatorParamMaps(gridParams).
	setTrainRatio(.75)

Now let's fit the model and then run it against the test data set.

val model = trainValidationSplit.fit(training)

val results = model.transform(test)

model.transform(test).
	select("features", "label", "prediction").
	show(10)

And the winner of the title The best model is ...

val best = model.bestModel

If you are interested in extracting the values of its parameters run best.extractParamMap().

To check the values of RMSE, MSE, MAE and R2:

val eval = new RE().setLabelCol("label").setPredictionCol("prediction")
println(s"RMSE: ${eval.setMetricName("rmse").evaluate(results)}")
println(s"MSE: ${eval.setMetricName("mse").evaluate(results)}")
println(s"MAE: ${eval.setMetricName("mae").evaluate(results)}")
println(s"R2: ${eval.setMetricName("r2").evaluate(results)}")

And in the end some useful links:


[+] Useful links
  • [Download Spark](https://spark.apache.org/downloads.html)
  • [Machine Learning Guide](http://spark.apache.org/docs/latest/ml-guide.html)
  • [Introduction to statistical learning](http://www-bcf.usc.edu/~gareth/ISL/)
  • [Part 1/5 Machine Learning with Spark](https://blog.epigno.systems/2017/01/01/machine-learning-with-spark-1/)
  • [Part 2/5 Machine Learning with Spark](https://blog.epigno.systems/2017/01/03/machine-learning-with-spark-2/)
  • [ML Tuning: model selection and hyperparameter tuning](https://spark.apache.org/docs/latest/ml-tuning.html)

  • > ==Disclaimer==: This is by no means an original work it is merely meant to serve as a compilation of thoughts, code snippets and teachings from different sources.