K-Means algorithm in TensorFlow

Vitality Learning
7 min readSep 26, 2022

In everyday life, we often group objects according to a certain similarity, from clothes in the closet to food on the shelves in the supermarket. Google also groups all the documents it contains in its huge database by keywords.

The goal of grouping is to simplify the way we relate to objects so that, for example, it is easier to find them when needed. The clustering activity is, in technical jargon, called clustering. In machine learning, clustering is one of the most important techniques for grouping unlabeled objects.

Clustering can be performed using various methodologies. One of the most common is based on the k-means algorithm which will be discussed in detail next.

k-means is a very often used algorithm in practice. One of the most recent applications concerns the identification of locations where criminal activities are most likely [1].

The k-means algorithm has the advantage of generally being computationally light and fast. However, the critical points concern the choice of the number of clusters which is often done by trial and error by minimizing a criterion of total variation [2]. Furthermore, a random choice of centroids can produce different, and suboptimal, clustering results for different initial random choices.

K-means algorithm: Theory

K-means is an unsupervised learning algorithm that determines a predetermined number of clusters in a data set, i.e., groups that divide objects according to a relationship of similarity between them. The number of clusters is chosen a priori before the algorithm is run.

The similarity relationship between the elements of the dataset is expressed by defining a centroid for each cluster. The centroid can coincide with an element of the dataset or be a “virtual” element.
In the image below, a two-dimensional example is shown in which the elements of the dataset are represented by red dots, while the centroids for a possible division into three clusters are depicted by the three stars colored blue, red and green.

Illustrating the clustering process.

K-means is a three-steps algorithm:

  • Initialization: the number of clusters K and K randomly arranged initial centroids are defined. The choice of the centroids also identifies a first division into clusters obtained by associating each point of the dataset to the closest centroid.
  • Assigning the dataset elements to the clusters: the cluster to assign each data point is chosen by identifying the closest centroid to the data point itself. In other words, for a data point x, the centroid is chosen as:

where c is the generic centroid, S is the set of centroids and d(c, x) is the distance between c and x. In the above equation, the points of the dataset and the centroids are considered as multidimensional vectors with M dimensions, i.e. x = (x1, x2, …, xM). A common choice, which will also be considered in the following subsection, consists in choosing for d the Euclidean distance, that is

  • Centroids update: after the previous step, it is likely that new clusters have formed, as data points will have been assigned or removed from the previous ones. Consequently, the position of the centroids is recalculated as the average of all the data points that have been assigned to the new cluster, i.e.

where Si represents the i— th cluster and |Si| the number of its elements.

After initialization, the algorithm repeats steps 2 and 3 until the stop condition is reached. Among the possible stop conditions, we can consider a maximum number of iterations or the fact that the centroids, and therefore the clusters, do not change, thus reaching convergence.

After the theory, we see a practical application of the k-means algorithm in TensorFlow 2.0.

K-means algorithm: Practice

Now is the time to discuss simple implementations of the k-means algorithm in TensorFlow 2.0. We will present both a customized version of k-means and the native version in TensorFlow 2.0, arbitrarily choosing the number of clusters to a value equal to the actual number of clusters. Let’s start by discussing the customized version first and then we will move on to the native one.

Customized k-means

As usual, the first step is to define the imports. In addition to the usual numpy, tensorflow and matplotlib, for this case we will also need

to generate random clusters.

Immediately after the imports, we define the algorithm parameters, i.e., the number of features, namely, the number of dimensions of each dataset element, the number of clusters, the dataset sample number and the number of iterations:

The number of iterations will be used for the native implementation of the approach.

Now is the time to generate a random dataset. This is done first of all by defining the centroids around which the clusters will have to develop:

Following the definition of the centroids, the dataset is generated thanks to the skd.make_blobs routine as:

Note that the number of elements of the dataset, the centroids and the degree of dispersion of the clusters around the centroids must be provided as inputs to the routine. The skd.make_blobs routine outputs the samples and labels of their corresponding clusters.

Once the dataset is created, the centroids, labels and dataset samples are cast as TensorFlow tensors, while a numpy copy of the samples is kept:

We now present the main loop forthe customized approach:

The stopping rule is activated when the centroids at the current step have not changed much in percentage as compared to the previous step. For this reason, the variable that evaluates the percentage deviation, that is updateNorm, is initialized to infinity. The iterations are handled by a while loop which stops if updateNorm becomes smaller than 1/10⁴.

The initializeCentroids function performs step #1 to the k-means algorithm and is shown below:

The purpose is to initialize the centers by randomly taking elements of the dataset. For this reason, a random vector of indices from 0 to numSamples-1 ( tf.range(0, numSamples) ) is initially generated in it via tf.random.shuffle. Next, a slice of numClusters size is extracted starting from the begin of randomIndices by using tf.slice and forming centroidIndices. Finally, the elements of the dataset corresponding to centroidIndices are collected in initialCentroids via tf.gather.

In addition to the centroids, an array is also initialized which will contain, iteration after iteration, the value of the percentage difference between the centroids at the current step and the previous one with the aim of showing, at the end of the loop, the progress of the iterations.

The assignment of the elements of the dataset to the clusters is done thanks to the assign2NearestCentroid function shown below:

The tf.expand_dims function is used to expand the dimensions of both samples and centroids. In other words, samples is a {(numSamples, numFeatures)} dimension tensor, while expandedSamples has {(1, numSamples, numFeatures)} dimensions. Similarly, centroids is a {(numClusters, numFeatures)} dimension tensor , while expandedCentroids has dimensions (numClusters, 1, numFeatures). In this way, an implicit meshgrid is defined for the subsequent evaluation of the squared distances between the database points and the centroids. In other words, the dimensions of the distances tensor are (numClusters, numSamples). The indices of the cluster closest to each sample can be deduced by determining the position of the minima of distances along the columns using tf.argmin(distances, 0).

The updateCentroids routine implements the second step of the k-means approach, which is the assignment of the elements of the dataset to the clusters:

Specifically, after casting the nearestIndices to integers, these are used to partition the dataset into a numClusters number of partitions. The result is a list of tensors stored in partitions. The new centroids are computed by reducing each tensor in partitions list along the columns and then collecting them by tf.concat.

The last relevant function of the customized approach is the plotClusters function

This function exploits the rainbow colormap by fetching numCentroids “equally spaced” colors between 0 and 1. Each color is then used to plot a different cluster. A possible example of a result is shown in the following figure:

k-means result example. Dots are cluster points, crosses are cluster centroids.

A possible trend of the percentage variation of the centroids with respect to the previous step is:

k-means result example. Percentage variation of the centroids with respect to the previous step.

This completes the description of the customized k-means approach. Let’s now look at the native one in TensorFlow.

Native TensorFlow k-means

In the native TensorFlow case, the first step is to define a clustering estimator such as

In the example shown, mini batching is disallowed.

The next step is to define a train loop of numIter iterations as

At each iteration, the actual training is performed by kmeans.train which requires as input a function, input_fn in this case, which returns the dataset on which to operate. The input_fn function is defined, in this case, as follows:

and returns the entire dataset, cast to a single precision tensor, a number num_epochs=1 of times. After the training phase, the cluster centers are updated as

Once the training loop is finished, a list of labels is created corresponding to the clusters to which each element of the dataset belongs to by

Finally, the cluster centers, for each dataset element, are listed and plotted by:

References

[1] V. Jain, Y. Sharma, A. Bhatia, V. Arora, “Crime prediction using k-means algorithm,” Global Res. Dev. J. for Eng., vol. 2, n. 5, pp. 206–209, Apr. 2017.

[2] P. Pareek, “All about “k-means’’ clustering,” Medium.

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.