# K-Means algorithm in TensorFlow

In everyday life, we often group objects according to a certain similarity, from clothes in the closet to food on the shelves in the supermarket. Google also groups all the documents it contains in its huge database by keywords.

The goal of grouping is to simplify the way we relate to objects so that, for example, it is easier to find them when needed. The clustering activity is, in technical jargon, called *clustering*. In machine learning, clustering is one of the most important techniques for grouping unlabeled objects.

Clustering can be performed using various methodologies. One of the most common is based on the *k-means algorithm* which will be discussed in detail next.

k-means is a very often used algorithm in practice. One of the most recent applications concerns the identification of locations where criminal activities are most likely [1].

The k-means algorithm has the advantage of generally being computationally light and fast. However, the critical points concern the choice of the number of clusters which is often done by trial and error by minimizing a criterion of total variation [2]. Furthermore, a random choice of centroids can produce different, and suboptimal, clustering results for different initial random choices.

# K-means algorithm: Theory

K-means is an unsupervised learning algorithm that determines a predetermined number of clusters in a data set, i.e., groups that divide objects according to a relationship of similarity between them. The number of clusters is chosen a priori before the algorithm is run.

The similarity relationship between the elements of the dataset is expressed by defining a centroid for each cluster. The centroid can coincide with an element of the dataset or be a “virtual” element.

In the image below, a two-dimensional example is shown in which the elements of the dataset are represented by red dots, while the centroids for a possible division into three clusters are depicted by the three stars colored blue, red and green.

K-means is a three-steps algorithm:

*Initialization*: the number of clusters*K*and*K*randomly arranged initial centroids are defined. The choice of the centroids also identifies a first division into clusters obtained by associating each point of the dataset to the closest centroid.*Assigning the dataset elements to the clusters*: the cluster to assign each data point is chosen by identifying the closest centroid to the data point itself. In other words, for a data point*x*, the centroid is chosen as:

where *c* is the generic centroid, *S* is the set of centroids and *d*(*c*, *x*) is the distance between *c* and *x*. In the above equation, the points of the dataset and the centroids are considered as multidimensional vectors with *M* dimensions, i.e. *x* = (*x*1, *x*2, …, *x*M). A common choice, which will also be considered in the following subsection, consists in choosing for *d* the Euclidean distance, that is

- Centroids update: after the previous step, it is likely that new clusters have formed, as data points will have been assigned or removed from the previous ones. Consequently, the position of the centroids is recalculated as the average of all the data points that have been assigned to the new cluster, i.e.

where *Si* represents the *i*— th cluster and |*Si*| the number of its elements.

After initialization, the algorithm repeats steps 2 and 3 until the stop condition is reached. Among the possible stop conditions, we can consider a maximum number of iterations or the fact that the centroids, and therefore the clusters, do not change, thus reaching convergence.

After the theory, we see a practical application of the k-means algorithm in TensorFlow 2.0.

# K-means algorithm: Practice

Now is the time to discuss simple implementations of the k-means algorithm in TensorFlow 2.0. We will present both a customized version of k-means and the native version in TensorFlow 2.0, arbitrarily choosing the number of clusters to a value equal to the actual number of clusters. Let’s start by discussing the customized version first and then we will move on to the native one.

## Customized k-means

As usual, the first step is to define the imports. In addition to the usual `numpy`

, `tensorflow`

and `matplotlib`

, for this case we will also need

to generate random clusters.

Immediately after the imports, we define the algorithm parameters, i.e., the number of features, namely, the number of dimensions of each dataset element, the number of clusters, the dataset sample number and the number of iterations:

The number of iterations will be used for the native implementation of the approach.

Now is the time to generate a random dataset. This is done first of all by defining the centroids around which the clusters will have to develop:

Following the definition of the centroids, the dataset is generated thanks to the `skd.make_blobs`

routine as:

Note that the number of elements of the dataset, the centroids and the degree of dispersion of the clusters around the centroids must be provided as inputs to the routine. The `skd.make_blobs`

routine outputs the samples and labels of their corresponding clusters.

Once the dataset is created, the centroids, labels and dataset samples are cast as TensorFlow tensors, while a `numpy`

copy of the samples is kept:

We now present the main loop forthe customized approach:

The stopping rule is activated when the centroids at the current step have not changed much in percentage as compared to the previous step. For this reason, the variable that evaluates the percentage deviation, that is `updateNorm`

, is initialized to infinity. The iterations are handled by a `while`

loop which stops if `updateNorm`

becomes smaller than 1/10⁴.

The `initializeCentroids`

function performs step #1 to the k-means algorithm and is shown below:

The purpose is to initialize the centers by randomly taking elements of the dataset. For this reason, a random vector of indices from `0`

to `numSamples-1`

( `tf.range(0, numSamples)`

) is initially generated in it via `tf.random.shuffle`

. Next, a slice of `numClusters`

size is extracted starting from the begin of `randomIndices`

by using `tf.slice`

and forming `centroidIndices`

. Finally, the elements of the dataset corresponding to `centroidIndices`

are collected in `initialCentroids`

via `tf.gather`

.

In addition to the centroids, an array is also initialized which will contain, iteration after iteration, the value of the percentage difference between the centroids at the current step and the previous one with the aim of showing, at the end of the loop, the progress of the iterations.

The assignment of the elements of the dataset to the clusters is done thanks to the `assign2NearestCentroid`

function shown below:

The `tf.expand_dims`

function is used to expand the dimensions of both `samples`

and `centroids`

. In other words, `samples`

is a `{(numSamples, numFeatures)}`

dimension tensor, while `expandedSamples`

has `{(1, numSamples, numFeatures)}`

dimensions. Similarly, `centroids`

is a `{(numClusters, numFeatures)}`

dimension tensor , while `expandedCentroids`

has dimensions `(numClusters, 1, numFeatures)`

. In this way, an implicit `meshgrid`

is defined for the subsequent evaluation of the squared distances between the database points and the centroids. In other words, the dimensions of the `distances`

tensor are `(numClusters, numSamples)`

. The indices of the cluster closest to each sample can be deduced by determining the position of the minima of `distances`

along the columns using `tf.argmin(distances, 0)`

.

The `updateCentroids`

routine implements the second step of the k-means approach, which is the assignment of the elements of the dataset to the clusters:

Specifically, after casting the `nearestIndices`

to integers, these are used to partition the dataset into a `numClusters`

number of partitions. The result is a `list`

of tensors stored in `partitions`

. The new centroids are computed by reducing each tensor in `partitions`

list along the columns and then collecting them by `tf.concat`

.

The last relevant function of the customized approach is the `plotClusters `

function

This function exploits the `rainbow`

colormap by fetching `numCentroids`

“equally spaced” colors between `0`

and `1`

. Each color is then used to plot a different cluster. A possible example of a result is shown in the following figure:

A possible trend of the percentage variation of the centroids with respect to the previous step is:

This completes the description of the customized k-means approach. Let’s now look at the native one in TensorFlow.

## Native TensorFlow k-means

In the native TensorFlow case, the first step is to define a clustering estimator such as

In the example shown, mini batching is disallowed.

The next step is to define a train loop of `numIter`

iterations as

At each iteration, the actual training is performed by `kmeans.train`

which requires as input a function, `input_fn`

in this case, which returns the dataset on which to operate. The `input_fn`

function is defined, in this case, as follows:

and returns the entire dataset, cast to a single precision tensor, a number `num_epochs=1`

of times. After the training phase, the cluster centers are updated as

Once the training loop is finished, a list of labels is created corresponding to the clusters to which each element of the dataset belongs to by

Finally, the cluster centers, for each dataset element, are listed and plotted by:

# References

[1] V. Jain, Y. Sharma, A. Bhatia, V. Arora, “Crime prediction using k-means algorithm,” *Global Res. Dev. J. for Eng.*, vol. 2, n. 5, pp. 206–209, Apr. 2017.

[2] P. Pareek, “All about “k-means’’ clustering,” Medium.