Machine Learning with Spark (4/5) - unsupervised learning

Scala
4/5

Machine Learning with Spark

Unsupervised learning

Prerequisites:

  • spark 2.0.0 or higher, preferable with pre-built hadoop. Download link
  • scala 2.11.8 or higher. Download link

This will be a brief introduction to Spark and KMeans Clustering. KMeans it's a very popular method utilized in data mining which in short it aims at partitioning the observations (data points) into k number of clusters, each belonging to the cluster with the nearest mean. This link explains how KMeans Clustering works.

To work with KMeans in Spark, we will need to import the following libraries.

import org.apache.spark.ml.clustering.{KMeans => KM}
import org.apache.spark.ml.feature.{VectorAssembler => VA}

Assuming that our data set is loaded into a dataframe called df, next we will create a features vector using the VectorAssembler and prepare the data. Since this is unsupervised learning, we have no label column, only features.

val assembler = new VA().setInputCols(df.columns).setOutputCol("features")
val data = assembler.transform(df).select("features")

Next, we need to instantiate KM and we chose the number of clusters, in this case three. Setting the seed is optional and it's being used only for reproducibility purposes.

val kmeans = new KM().setK(3).setSeed(196L)

There is no right or wrong number of deciding how many clusters your data set should be divided into. However, there are a couple of methods which will help in deciding an appropriate number. One of the most used is the Elbow method.

Next we will fit the data and then we will evaluate the clustering by calculating the sum of squared errors.

val model = kmeans.fit(data)

val sse = model.computeCost(data)
println(s"Within Set Sum of Squared Errors ${sse}")

To check the cluster's parameters and its values, you could run model.extractParamMap. Other metrics are available at model.summary.

To printout the centers of the cluster: model.clusterCenters.foreach(println).

And in the end some useful links:


[+] Useful links
  • [Download Spark](https://spark.apache.org/downloads.html)
  • [Machine Learning Guide](http://spark.apache.org/docs/latest/ml-guide.html)
  • [15 hours of machine learning videos](https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/)
  • [Part 1/5 Machine Learning with Spark](https://blog.epigno.systems/2017/01/01/machine-learning-with-spark-1/)
  • [Part 2/5 Machine Learning with Spark](https://blog.epigno.systems/2017/01/03/machine-learning-with-spark-2/)
  • [Part 3/5 Machine Learning with Spark](https://blog.epigno.systems/2017/01/03/machine-learning-with-spark-3/)
  • [Clustering with Spark](https://spark.apache.org/docs/latest/ml-clustering.html)

  • > ==Disclaimer==: This is by no means an original work it is merely meant to serve as a compilation of thoughts, code snippets and teachings from different sources.