Support Vector Machines with TensorFlow

Vitality Learning
7 min readOct 8, 2022

The Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression purposes. It is popular in applications such as natural language processing, speech and image recognition and computer vision.

The SVM algorithm is remarkably effective in binary classification problems, although it is also used for multiclass classification problems also. For this reason, in this section, we will see the SVM algorithm at work on a binary classification problem, mainly discussing how such an algorithm works.

Support Vector Machine: Theory

For a binary classification problem, and for a simple case, the SVM is based on the idea of ​​determining a hyperplane that divides a dataset into two classes as shown in the following figure:

Linear separation of dataset points.

To understand the working principle, the below figure illustrates hyperplanes in the cases of two (left) and three (right) features only. In the case of two features, a hyperplane is represented by a line, while, in the case of three features, it is represented by a plane.

Hyperplanes.

A hyperplane that linearly separates a dataset may not exist. In this case, it becomes necessary to use a non-linear mapping to embed the training dataset in a space of greater dimension. For example, it may become necessary to pass from two to three dimensions to make the data separable linearly. In the following, we will consider the simple case when the data can already be separated linearly.

The SVM algorithm aims to determining the hyperplane that divides at the best the support vectors into classes. At the best means that if more than one hyperplane exists and in order to improve classification accuracy on new observations, the algorithm looks for the one with the highest margin with the spanning vectors. By support vectors we mean the points of the dataset closest to the dividing hyperplane, see figure below. The margin instead is the distance between the support vectors of two different classes closest to the hyperplane, as shown in the following figure:

Margin between support vectors.

The hyperplane separates this distance in half. The maximization of the margin is related to the idea that the farther from the hyperplane the points in the dataset are, the more likely it is that they have been correctly classified. When considering new test data, they are classified according to the region in which they are located with respect to the given hyperplane.

Going a little further into the mathematics of the problem, the equation of a generic hyperplane is

where x is the independent variable in the feature space, w is the weight vector and w0 is the bias.

On referring to the above figure, we assume that the red points belong to the “1” class, while the green points belong to the “-1” class. The figure shows a case in which a hyperplane has already been determined that separates the two classes. In this case, one can choose the weights in such a way that for the points of the class “1” we have

Constraint SVM #1.

while, for the points of the class “-1”, we have

Constraint SVM #2.

where the xn are the dataset points. It can be shown that the distance between the two hyperplanes

and

is precisely

This means that, by minimizing the weight vector norm, we will have the optimal hyperplane, that is the one with maximum margin.

However, minimizing

will not suffice to obtain the desired hyperplane, but it will be necessary to do so taking into account the two above SVM constraints. These two constraints can be rewritten as a single inequality as

Constraint SVM.

where yn=1 if the n-th element belongs to the class “1”, otherwise yn=-1. It is possible to show that the problem of minimizing ||w|| with the above constraint can be addressed by minimizing the functional:

SVM functional.

where α is a parameter that balances the need to maximize the margin with the need to satisfy the above constraint.

Once we have optimized Φ and determined the optimal values ​​wopt and w0opt, then it is possible to determine the separating hyperplane. For a two-dimensional problem, i.e., if w has only two components, which is the case considered in the example of the following Subsections, the mentioned hyperplane becomes a straight line. Taking into account that w represents the normal to the hyperplane, then its equation becomes

Separation line.

where

Angular coefficient.

and

Intercept.

Now that we have identified the functional to be optimized, we show how the SVM algorithm can be implemented using TensorFlow. In particular, we will use the so-called Iris dataset. A few words about this dataset are now in order.

The Iris dataset

The Iris dataset is a multivariate dataset introduced by Ronald Fisher in 1936. It consists of 150 instances of 3 Iris species: Iris setosa, Iris virginica and Iris versicolor. The four considered variables are sepal length, sepal width, petal length and petal width. The classes of the dataset elements can be 0 in the case of Iris setosa, 1 in the case of Iris versicolor or 2 in the case of Iris virginica.

Let’s now look at the developed example.

Support Vector Machine: Practice

The objective of the example is to classify, using only the features of the sepal length and petal width, the Iris species in setosa or not-setosa.

Apart from standard imports, we have:

We will indeed use the Iris database provided by the sklearn library.

The following parameters are moreover defined:

where alpha is the weight appearing in the SVM functional above and the training will be performed on a number of batchSize elements in numIter iterations.

It is now time to load the dataset

and to extract the two features of interest

In this way, xDataset will assume only the values ​​of the sepal length and petal width, while yDataset the value 1 if the Iris belongs to the silky class, otherwise -1.

A portion for training and one for performance verification must be extracted from the entire dataset. We decide to use the 90% of the dataset for training and the remaining 10% for performance verification. To do this, we use

which generates a set of round(len(xDataset)*0.9) indices between 0 and len(xDataset) without repetition. These indices will be used to address the dataset samples to be used for training. Next, we calculate the remaining indices to use for testing such as

In the above instruction, set transforms the complete indices list range(len(xDataset)) as well as the training indices list trainIndices to sets of unordered elements and the two sets are subtracted to obtain the test indices. Finally, features and labels for the training and testing databases are extracted:

The rest of the code is similar to what we have done previously with other approaches. In particular, an initializer and the variables that must contain the unknowns are defined:

Next, the cost function is defined as

The costFunction function calculates the SVM cost functional.

The definitions of the optimizer and of the optimizationStep are similar to those listed in Linear Regression with TensorFlow and are not repeated here. Let’s just give some details on the training loop that we report below:

As can be seen, the training is performed with a fixed number of iterations equal to numIter and, at each step, a subset of the training dataset is selected using the indexBatch indices. These indices are equal in number to batchSize and generated in the interval 0len(xTrain). Furthermore, at each step, the predictionAccuracy function is invoked:

This function first evaluates the model for each element of the testing dataset, that is

and then evaluates the sign and compares it with the sign of the observations of the traning dataset, counting the number of times equality occurs. Using the tf.sign() function allows to check whether Constraint SVM #1 or Constraint SVM #2 is met.

Once the training is finished, the angular coefficient and the intercept of the separation line are calculated by

and then the separation line is calculated for all the dataset points

In order to plotting the results, the data are separated into the two classes setosa and not setosa

and the results represented by

As can be seen from the following figure, the separation between setosa and not setosa species is satisfactory

Setosa/not-setosa SVM separation result.

Finally, the following figures show the trend, as the training iterations go on, of the cost functional and accuracy, respectively:

SVM accuracy against training step.
SVM accuracy against training step.

As can be seen, as training proceeds, the cost functional value decreases, while accuracy improves.

--

--

Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.