K-Means algorithm in TensorFlow
In everyday life, we often group objects according to a certain similarity, from clothes in the closet to food on the shelves in the supermarket. Google also groups all the documents it contains in its huge database by keywords.
The goal of grouping is to simplify the way we relate to objects so that, for example, it is easier to find them when needed. The clustering activity is, in technical jargon, called clustering. In machine learning, clustering is one of the most important techniques for grouping unlabeled objects.
Clustering can be performed using various methodologies. One of the most common is based on the k-means algorithm which will be discussed in detail next.
k-means is a very often used algorithm in practice. One of the most recent applications concerns the identification of locations where criminal activities are most likely [1].
The k-means algorithm has the advantage of generally being computationally light and fast. However, the critical points concern the choice of the number of clusters which is often done by trial and error by minimizing a criterion of total variation [2]. Furthermore, a random choice of centroids can produce different, and suboptimal, clustering results for different initial random choices.
K-means algorithm: Theory
K-means is an unsupervised learning algorithm that determines a predetermined number of clusters in a data set, i.e., groups that divide objects according to a relationship of similarity between them. The number of clusters is chosen a priori before the algorithm is run.
The similarity relationship between the elements of the dataset is expressed by defining a centroid for each cluster. The centroid can coincide with an element of the dataset or be a “virtual” element.
In the image below, a two-dimensional example is shown in which the elements of the dataset are represented by red dots, while the centroids for a possible division into three clusters are depicted by the three stars colored blue, red and green.
K-means is a three-steps algorithm:
- Initialization: the number of clusters K and K randomly arranged initial centroids are defined. The choice of the centroids also identifies a first division into clusters obtained by associating each point of the dataset to the closest centroid.
- Assigning the dataset elements to the clusters: the cluster to assign each data point is chosen by identifying the closest centroid to the data point itself. In other words, for a data point x, the centroid is chosen as:
where c is the generic centroid, S is the set of centroids and d(c, x) is the distance between c and x. In the above equation, the points of the dataset and the centroids are considered as multidimensional vectors with M dimensions, i.e. x = (x1, x2, …, xM). A common choice, which will also be considered in the following subsection, consists in choosing for d the Euclidean distance, that is
- Centroids update: after the previous step, it is likely that new clusters have formed, as data points will have been assigned or removed from the previous ones. Consequently, the position of the centroids is recalculated as the average of all the data points that have been assigned to the new cluster, i.e.
where Si represents the i— th cluster and |Si| the number of its elements.
After initialization, the algorithm repeats steps 2 and 3 until the stop condition is reached. Among the possible stop conditions, we can consider a maximum number of iterations or the fact that the centroids, and therefore the clusters, do not change, thus reaching convergence.
After the theory, we see a practical application of the k-means algorithm in TensorFlow 2.0.
K-means algorithm: Practice
Now is the time to discuss simple implementations of the k-means algorithm in TensorFlow 2.0. We will present both a customized version of k-means and the native version in TensorFlow 2.0, arbitrarily choosing the number of clusters to a value equal to the actual number of clusters. Let’s start by discussing the customized version first and then we will move on to the native one.
Customized k-means
As usual, the first step is to define the imports. In addition to the usual numpy
, tensorflow
and matplotlib
, for this case we will also need
to generate random clusters.
Immediately after the imports, we define the algorithm parameters, i.e., the number of features, namely, the number of dimensions of each dataset element, the number of clusters, the dataset sample number and the number of iterations:
The number of iterations will be used for the native implementation of the approach.
Now is the time to generate a random dataset. This is done first of all by defining the centroids around which the clusters will have to develop:
Following the definition of the centroids, the dataset is generated thanks to the skd.make_blobs
routine as:
Note that the number of elements of the dataset, the centroids and the degree of dispersion of the clusters around the centroids must be provided as inputs to the routine. The skd.make_blobs
routine outputs the samples and labels of their corresponding clusters.
Once the dataset is created, the centroids, labels and dataset samples are cast as TensorFlow tensors, while a numpy
copy of the samples is kept:
We now present the main loop forthe customized approach:
The stopping rule is activated when the centroids at the current step have not changed much in percentage as compared to the previous step. For this reason, the variable that evaluates the percentage deviation, that is updateNorm
, is initialized to infinity. The iterations are handled by a while
loop which stops if updateNorm
becomes smaller than 1/10⁴.
The initializeCentroids
function performs step #1 to the k-means algorithm and is shown below:
The purpose is to initialize the centers by randomly taking elements of the dataset. For this reason, a random vector of indices from 0
to numSamples-1
( tf.range(0, numSamples)
) is initially generated in it via tf.random.shuffle
. Next, a slice of numClusters
size is extracted starting from the begin of randomIndices
by using tf.slice
and forming centroidIndices
. Finally, the elements of the dataset corresponding to centroidIndices
are collected in initialCentroids
via tf.gather
.
In addition to the centroids, an array is also initialized which will contain, iteration after iteration, the value of the percentage difference between the centroids at the current step and the previous one with the aim of showing, at the end of the loop, the progress of the iterations.
The assignment of the elements of the dataset to the clusters is done thanks to the assign2NearestCentroid
function shown below:
The tf.expand_dims
function is used to expand the dimensions of both samples
and centroids
. In other words, samples
is a {(numSamples, numFeatures)}
dimension tensor, while expandedSamples
has {(1, numSamples, numFeatures)}
dimensions. Similarly, centroids
is a {(numClusters, numFeatures)}
dimension tensor , while expandedCentroids
has dimensions (numClusters, 1, numFeatures)
. In this way, an implicit meshgrid
is defined for the subsequent evaluation of the squared distances between the database points and the centroids. In other words, the dimensions of the distances
tensor are (numClusters, numSamples)
. The indices of the cluster closest to each sample can be deduced by determining the position of the minima of distances
along the columns using tf.argmin(distances, 0)
.
The updateCentroids
routine implements the second step of the k-means approach, which is the assignment of the elements of the dataset to the clusters:
Specifically, after casting the nearestIndices
to integers, these are used to partition the dataset into a numClusters
number of partitions. The result is a list
of tensors stored in partitions
. The new centroids are computed by reducing each tensor in partitions
list along the columns and then collecting them by tf.concat
.
The last relevant function of the customized approach is the plotClusters
function
This function exploits the rainbow
colormap by fetching numCentroids
“equally spaced” colors between 0
and 1
. Each color is then used to plot a different cluster. A possible example of a result is shown in the following figure:
A possible trend of the percentage variation of the centroids with respect to the previous step is:
This completes the description of the customized k-means approach. Let’s now look at the native one in TensorFlow.
Native TensorFlow k-means
In the native TensorFlow case, the first step is to define a clustering estimator such as
In the example shown, mini batching is disallowed.
The next step is to define a train loop of numIter
iterations as
At each iteration, the actual training is performed by kmeans.train
which requires as input a function, input_fn
in this case, which returns the dataset on which to operate. The input_fn
function is defined, in this case, as follows:
and returns the entire dataset, cast to a single precision tensor, a number num_epochs=1
of times. After the training phase, the cluster centers are updated as
Once the training loop is finished, a list of labels is created corresponding to the clusters to which each element of the dataset belongs to by
Finally, the cluster centers, for each dataset element, are listed and plotted by:
References
[1] V. Jain, Y. Sharma, A. Bhatia, V. Arora, “Crime prediction using k-means algorithm,” Global Res. Dev. J. for Eng., vol. 2, n. 5, pp. 206–209, Apr. 2017.
[2] P. Pareek, “All about “k-means’’ clustering,” Medium.