## linear classifier

Discriminative training of linear classifiers usually proceeds in a supervisedway, by means of an optimization algorithmthat is given a training set with desired outputs and athat measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form

Popular loss functions include the hinge loss(for linear SVMs) and the log loss(for linear logistic regression). If the regularization functionRis convex, then the above is a convex problem.Many algorithms exist for solving such problems; popular ones for linear classification include (stochastic) gradient descent, L-BFGS,coordinate descentand Newton methods.

## easy tensorflow - linear classifier

In this tutorial, we'll create a simple linear classifier in TensorFlow. We will implement this model for classifying images of hand-written digits from the so-called MNIST data-set. The structure of the network is presented in the following figure.

Technically, in a linear model we will use the simplest function to predict the label $\mathbf{y_i}$ of the image $\mathbf{x_i}$. We'll do so by using a linear mapping like $f(\mathbf{x_i}, \mathbf{W}, \mathbf{b})=\mathbf{W}\mathbf{x_i}+\mathbf{b}$ where $\mathbf{W}$ and $\mathbf{b}$ are called weight matrix and bias vector respectively.

For this tutorial we use the MNIST dataset. MNIST is a dataset of handwritten digits. If you are into machine learning, you might have heard of this dataset by now. MNIST is kind of benchmark of datasets for deep learning and is easily accesible through Tensorflow

The dataset contains $55,000$ examples for training, $5,000$ examples for validation and $10,000$ examples for testing. The digits have been size-normalized and centered in a fixed-size image ($28\times28$ pixels) with values from $0$ to $1$. For simplicity, each image has been flattened and converted to a 1-D numpy array of $784$ features ($28\times28$).

Here, we specify the dimensions of the images which will be used in several places in the code below. Defining these variables makes it easier (compared with using hard-coded number all throughout the code) to modify them later. Ideally these would be inferred from the data that has been read, but here we will just write the numbers.

In this section, we'll write the function which automatically loads the MNIST data and returns it in our desired shape and format. If you wanna learn more about loading your data, you may read our How to Load Your Data in TensorFlow tutorial which explains all the available methods to load your own data; no matter how big it is.

Here, we'll simply write a function (load_data) which has two modes: train (which loads the training and validation images and their corresponding labels) and test (which loads the test images and their corresponding labels). You can replace this function to use your own dataset.

randomize: which randomizes the order of images and their labels. This is important to make sure that the input images are sorted in a completely random order. Moreover, at the beginning of each epoch, we will re-randomize the order of data samples to make sure that the trained model is not sensitive to the order of data.

As you can see, x_train and x_valid arrays contain $55000$ and $5000$ flattened images ( of size $28\times28=784$ values). y_train and y_valid contain the corresponding labels of the images in the training and validation set respectively.

Based on the dimesnion of the arrays, for each image, we have 10 values as its label. Why? This technique is called One-Hot Encoding. This means the labels have been converted from a single number to a vector whose length equals the number of possible classes. All elements of the vector are zero except for the $i^{th}$ element which is one and means the class is $i$. For example, the One-Hot encoded labels for the first 5 images in the validation set are:

Here, we have about $55,000$ images in our training set. It takes a long time to calculate the gradient of the model using all these images. We therefore use Stochastic Gradient Descent which only uses a small batch of images in each iteration of the optimizer. Let's define some of the terms usually used in this context:

As explained (and also illustrated in Fig. 1), we need to define two variables $\mathbf{W}$ and $\mathbf{b}$ to construt our linear model. These are generally called model parameters and as explained in our Tensor Types tutorial, we use Tensorflow Variables of proper size and initialization to define them.The following functions are written to be later used for generating the weight and bias variables of the desired shape:

First we need to define the proper tensors to feed in the input values to our model. As explained in the Tensor Types tutorial, placeholder variable is the suitable choice for the input images and corresponding labels. This allows us to change the inputs (images and labels) to the TensorFlow graph.

Placeholder x is defined for the images; its data-type is set to float32 and the shape is set to [None, img_size_flat], where None means that the tensor may hold an arbitrary number of images with each image being a vector of length img_size_flat.

Next we have y which is the placeholder variable for the true labels associated with the images that were input in the placeholder variable x. The shape of this placeholder variable is [None, num_classes] which means it may hold an arbitrary number of labels and each label is a vector of length num_classes which is $10$ in this case.

After creating the proper input, we have to pass it to our model. Since we have a linear classifier, we will have output_logits$=\mathbf{W}\times \mathbf{x} + \mathbf{b}$ and we will use tf.nn.softmax to normalize the output_logits between $0$ and $1$ as probabilities (predictions) of samples.

After creating the network, we have to calculate the loss and optimize it. Also, to evaluate our model, we have to calculate the correct_prediction and accuracy. We will also define cls_prediction to visualize our results.### 4.3. Define the loss function, optimizer, accuracy, and predicted class

Another way to evaluate the model is to visualize the input and the model results and compare them with the true label of the input. This is advantages in numerous ways. For example, even if you get a decent accuracy, when you plot the results, you might see all the samples have been classified in one class. Another example is when you plot, you can have a rough idea on which examples your model failed. Let's define the helper functions to plot some correct and missclassified examples.

## linear classifier - an overview | sciencedirect topics

where w is a vector of feature weights and g is a monotonically increasing function. For example, in logistic regression, g is the logit function, and in SVM, it is the sign function with label space Y={-1,+1}. A plausible NCM is based on the distance of an example i from the separating hyperplane:

where the weight vector w has been computed using an inductive learner l, such as SVM. That is, w=l(z1,,zn). A version of this NCM was originally derived for logistic regression by [365]. This NCM then yields

During the first run of a linear classifier (i), the accuracy factor is determined as 0.775362318841, and later in a linear classifier (ii), the accuracy factor is determined as 0.847826086957. So, on average, the mean accuracy of the analysis for the SVM linear classifier is 0.811594202899, i.e., approximately equals 81% of accuracy, whereas the accuracy for emotion prediction through CNN is approximated to 0.89754. This accuracy could still be improved by increasing the number of hidden layers in the neural networks, thus converging into a finer classification of emotion category.

In some cases, instead of working directly with P(i|x) it is more convenient to use gi(x) = f(P(i|x)), where f() is a monotonically increasing function. The gi()s are known as discriminant functions. In this case, the Bayes decision rule becomes Assign x to class i if

Monotonicity does not change the points where maxima occur, and the resulting partition of the feature space remains the same. The contiguous regions in the feature space, Rl, that correspond to different classes are separated by continuous surfaces, known as decision surfaces. If regions Ri and Rj, associated with classes i and j, respectively, are contiguous then, it is easy to see from the respective definitions that, the corresponding decision surface is described by:

So far, we have approached the classification problem via Bayesian probabilistic arguments and the goal was to minimize the classification error probability. However, as we will soon see, not all problems are well suited for such an approach. For example, in many cases the involved pdfs are complicated and their estimation is not an easy task. In these cases it may be more preferable to compute decision surfaces directly by means of alternative costs and this will be our focus in Sections VIX. Such approaches give rise to discriminant functions and decision surfaces, which are entities with no (necessary) relation to Bayesian classification and they are, in general, suboptimal with respect to Bayesian classifiers.

i = 1,2,,M, where i = E[x|x i] is the mean vector for the class i, i = E[(x i) (x i)T] is the l l covariance matrix for class i and |i| is the determinant of i It is clear that only the vectors x i contribute in i. For the special one dimensional case, l = 1, the above becomes our familiar Gaussian pdf, i.e.,

where 2 is the variance around the mean . Our objective is to design a Bayesian classifier taking into account that each p(x|i) is a normal distribution. Having in mind the Bayesian rule given in Eq. (3), we define the discriminant functions

Assuming that the covariance matrices for all classes are equal to each other, i.e., i = , i = 1,2,,M, the terms12xTi1x and ci are the same in all gi's for all classes, thus they can be omitted. In this case, gi(x) becomes a linear function of x, and it can be rewritten as:(11)gix=wiTx+wi0,where(12)wi=1iand(13)wi0=12i1Ti+lnPi.In this case, the decision surfaces gi(x) gj(x) = 0 are hyperplanes and the Bayesian classifier is a linear classifier.

Assume, in addition, that all classes are equiprobable, i.e., P(i) = 1/M, i = 1,2,,M. Then Eq. (9) becomes(14)gix=12xi1Txi12dm2The quantity dm on the right-hand side of the above equation, without the minus sign, is called Mahalanobis distance. Thus, in this case, instead of searching for the class with the maximum gi(x), we can equivalently search for the class for which the Mahalanobis distance between the respective mean vector and the input vector x is minimum.

In addition to the above assumptions, assume that = 2I, where I is the l l identity matrix. In this case, Eq. (14) becomes (15)gix=122xiTxi,or, eliminating the factor 122which is common for all gi's, we obtain:(16)gix=xiTxid2

Clearly, searching for the class with the maximum gi(x) is equivalent to searching for the class for which the Euclidean distance, d, between the respective mean vector and the input vector x becomes minimum.

In summary, in the case where each p(x|i) is a normal distribution, gi(x)'s are quadratic functions of x and the Bayesian classifier partitions the feature space via decision surfaces which are quadrics. If, in addition, all the classes have equal covariance matrices, gi(x)'s become linear functions of x and the decision surfaces are hyperplanes. In this latter case, the Bayesian classifier becomes very simple. A pattern(represented by x) is assigned to the class whose mean vector i is closest to x. The distance measure, in the feature space, can be either the Mahalanobis or Euclidean distance, depending on the form of the covariance matrix.

Note that, given yn,xn, and , (8.36) defines a halfspace (Example 8.1); this is the reason that we used rather than a strict inequality. In other words, all s which satisfy the desired inequality (8.36) lie in this halfspace. Since each pair (yn,xn),n=1,2,,N, defines a single halfspace, our goal now becomes that of trying to find a point at the intersection of all these halfspaces. This intersection is guaranteed to be nonempty if the classes are linearly separable. Fig. 8.20 illustrates the concept. The more realistic case of nonlinearly separable classes will be treated in Chapter 11, where a mapping in a high-dimensional (kernel) space makes the probability of two classes being linearly separable to tend to 1 as the dimensionality of the kernel space goes to infinity.

whose graph is shown in Fig. 8.21. Thus, choosing the halfspace as the closed convex set to represent (yn,xn) is equivalent to selecting the zero level set of the hinge loss, adjusted for the point (yn,xn). Remarks 8.4In addition to the two applications typical of the machine learning point of view, POCS has been applied in a number of other applications; see, for example, [7,24,87,89] for further reading.If the involved sets do not intersect, that is, k=1KCk=, then it has been shown [25] that the parallel version of POCS in (8.27) converges to a point whose weighted squared distance from each one of the convex sets (defined as the distance of the point from its respective projection) is minimized.Attempts to generalize the theory to nonconvex sets have also been made (for example, [87] and more recently in the context of sparse modeling in [83]).When C:=k=1KCK, we say that the problem is feasible and the intersection C is known as the feasibility set. The closed convex sets Ck,k=1,2,,K, are sometimes called the property sets, for obvious reasons. In both previous examples, namely, regression and classification, we commented that the involved property sets resulted as the 0-level sets of a loss function L. Hence, assuming that the problem is feasible (the cases of bounded noise in regression and linearly separable classes in classification), any solution in the feasible set C will also be a minimizer of the respective loss functions in (8.33) and (8.37), respectively. Thus, although optimization did not enter into our discussion, there can be an optimizing flavor in the POCS method. Moreover, note that in this case, the loss functions need not be differentiable and the techniques we discussed in the previous chapters are not applicable. We will return to this issue in Section 8.10.

In addition to the two applications typical of the machine learning point of view, POCS has been applied in a number of other applications; see, for example, [7,24,87,89] for further reading.If the involved sets do not intersect, that is, k=1KCk=, then it has been shown [25] that the parallel version of POCS in (8.27) converges to a point whose weighted squared distance from each one of the convex sets (defined as the distance of the point from its respective projection) is minimized.Attempts to generalize the theory to nonconvex sets have also been made (for example, [87] and more recently in the context of sparse modeling in [83]).When C:=k=1KCK, we say that the problem is feasible and the intersection C is known as the feasibility set. The closed convex sets Ck,k=1,2,,K, are sometimes called the property sets, for obvious reasons. In both previous examples, namely, regression and classification, we commented that the involved property sets resulted as the 0-level sets of a loss function L. Hence, assuming that the problem is feasible (the cases of bounded noise in regression and linearly separable classes in classification), any solution in the feasible set C will also be a minimizer of the respective loss functions in (8.33) and (8.37), respectively. Thus, although optimization did not enter into our discussion, there can be an optimizing flavor in the POCS method. Moreover, note that in this case, the loss functions need not be differentiable and the techniques we discussed in the previous chapters are not applicable. We will return to this issue in Section 8.10.

If the involved sets do not intersect, that is, k=1KCk=, then it has been shown [25] that the parallel version of POCS in (8.27) converges to a point whose weighted squared distance from each one of the convex sets (defined as the distance of the point from its respective projection) is minimized.

When C:=k=1KCK, we say that the problem is feasible and the intersection C is known as the feasibility set. The closed convex sets Ck,k=1,2,,K, are sometimes called the property sets, for obvious reasons. In both previous examples, namely, regression and classification, we commented that the involved property sets resulted as the 0-level sets of a loss function L. Hence, assuming that the problem is feasible (the cases of bounded noise in regression and linearly separable classes in classification), any solution in the feasible set C will also be a minimizer of the respective loss functions in (8.33) and (8.37), respectively. Thus, although optimization did not enter into our discussion, there can be an optimizing flavor in the POCS method. Moreover, note that in this case, the loss functions need not be differentiable and the techniques we discussed in the previous chapters are not applicable. We will return to this issue in Section 8.10.

In the previous chapter we dealt with the design of linear classifiers described by linear discriminant functions (hyperplanes) g(x). In the simple two-class case, we saw that the perceptron algorithm computes the weights of the linear function g(x), provided that the classes are linearly separable. For nonlinearly separable classes, linear classifiers were optimally designed, for example, by minimizing the squared error. In this chapter we will deal with problems that are not linearly separable and for which the design of a linear classifier, even in an optimal way, does not lead to satisfactory performance. The design of nonlinear classifiers emerges now as an inescapable necessity.

We have already discussed that the VC dimension of a linear classifier in the l-dimensional space is l + 1. However, hyperplanes that are constrained to leave the maximum margin between the classes may have a smaller VC dimension.

That is, the capacity of the classifier can be controlled independently of the dimensionality of the feature space. This is very interesting indeed. It basically states that the capacity of a classifier may not, necessarily, be related to the number of unknown parameters! This is a more general result. To emphasize it further, note that it is possible to construct a classifier with only one free parameter, yet with infinite VC dimension; see, for example, [Burg 98]. Let us now consider a sequence of bounds

If the classes are separable, then the empirical error is zero. Minimizing the norm w is equivalent to minimizing the VC dimension (to be fair, the upper bound of the VC dimension). Thus, we can conclude that, the design of an SVM classifier senses the spirit of the SRM principle. Hence, keeping the VC dimension minimum suggests that we can expect support vector machines to exhibit good generalization performance. More on these issues can be found in [Vapn 98, Burg 98].

The essence of all formulas and discussion in this section is that the generalization performance and accuracy of a classifier depend heavily on two parameters: the VC dimension and the number of the available feature vectors used for the training. The VC dimension may or may not be related to the number of free parameters describing the classifier. For example, in the case of the perceptron linear classifier the VC dimension coincides with the number of free parameters. However, one can construct nonlinear classifiers whose VC dimension can be either lower or higher than the number of free parameters [Vapn 98, p. 159]. The design methodology of the SVM allows one to play with the VC dimension (by minimizing w, Eq. (5.66)), leading to good generalization performance, although the design may be carried out in a high- (even infinite) dimensional space.

Digging this fertile ground in a slightly different direction, using tools from the PAC theory of learning one can derive a number of distribution-free and dimension-free bounds. These bounds bring into the surface a key property underlying the SVM design; that is, that of the maximum margin (SVMs are just one example of this larger family of classifiers, which are designed with an effort to maximize the margin the training points leave from the corresponding decision surface). (See also the discussion at the end of Chapter 4.) Although a more detailed treatment of this topic is beyond the scope of this book, we will provide two related bounds that reinforce this, at first surprising, property of the emancipation of the generalization performance from the feature space dimensionality.

Assume that all available feature vectors lie within a sphere of radius R (i.e., x R). Let, also, the classifier be a linear one, normalized so that w = 1, designed using N randomly chosen training vectors. If the resulting classifier has a margin of 2 (according to the margin definition in Section 3.7.1) and all training vectors lie outside the margin, the corresponding true error probability (generalization error) is no more than

where c is a constant, and this bound holds true with a probability at least 1 . Thus, adjusting the margin, as the SVM does, to be maximum we improve the bound, and this can be carried out even in an infinite dimensional space if the adopted kernel so dictates [Bart 99, Cris 00]. This result is logical. If the margin is large on a set of training points randomly chosen, this implies a classifier with large confidence, thus leading with high probability to good performance.

The bound given previously was derived under the assumption that all training points are correctly classified. Furthermore, the margin constraint implies that for all training points yi(xi) , where f(x) denotes the linear classifier (the decision is taken according to sign(f(x)). A very interesting related bound refers to the more realistic case, where some of the training points are misclassified. Let k be the number of points with yi f(xi) < . (The product yf(x) is also known as the functional margin of the pair (y, x) with respect to classifier f(x)) Obviously, this also allows for negative values of the product. It can be shown that with probability at least 1 the true error probability is upper bounded by ([Bart 99, Cris 00])

Another bound relates the error performance of the SVM classifier with the number of support vectors. It can be shown [Bart 99] that if N is the number of training vectors and Ns the number of support vectors, the corresponding true error probability is bounded by

where e is the base of the natural logarithm and the bound holds true with a probability at least 1 . Note that this bound is also independent of the dimension of the feature space, where the design takes place. The bound increases with Ns and this must make the user, who has designed an SVM that results in a relatively large number (with respect to N) of support vectors, cautious and suspicious about the performance of the resulting SVM classifier.

The previous three bounds indicate that the error performance is controlled by both Ns and . In practice, one may end up, for example, with a large number of support vectors and at the same time with a large margin. In such a case, the error performance could be assessed, with high confidence, depending on which of the two bounds has lower value.

In Section 4.6 we introduced the perceptron algorithm for learning a linear classifier. It turns out that the kernel trick can also be used to upgrade this algorithm to learn nonlinear decision boundaries. To see this, we first revisit the linear case. The perceptron algorithm repeatedly iterates through the training data instance by instance and updates the weight vector every time one of these instances is misclassified based on the weights learned so far. The weight vector is updated simply by adding or subtracting the instances attribute values to or from it. This means that the final weight vector is just the sum of the instances that have been misclassified. The perceptron makes its predictions based on whether

Here, a(j) is the jth misclassified training instance, a(j)i its ith attribute value, and y(j) its class value (either +1 or 1). To implement this we no longer keep track of an explicit weight vector: we simply store the instances that have been misclassified so far and use the above expression to make a prediction.

It looks like we have gained nothingin fact, the algorithm is much slower because it iterates through all misclassified training instances every time a prediction is made. However, closer inspection of this formula reveals that it can be expressed in terms of dot products between instances. First, swap the summation signs to yield

This rings a bell! A similar expression for support vector machines enabled the use of kernels. Indeed, we can apply exactly the same trick here and use a kernel function instead of the dot product. Writing this function as K() gives

In this way the perceptron algorithm can learn a nonlinear classifier simply by keeping track of the instances that have been misclassified during the training process and using this expression to form each prediction.

If a separating hyperplane exists in the high-dimensional space implicitly created by the kernel function, this algorithm will learn one. However, it wont learn the maximum-margin hyperplane found by a support vector machine classifier. This means that classification performance is usually worse. On the plus side, the algorithm is easy to implement and supports incremental learning.

This classifier is called the kernel perceptron. It turns out that all sorts of algorithms for learning linear models can be upgraded by applying the kernel trick in a similar fashion. For example, logistic regression can be turned into kernel logistic regression. As we saw above, the same applies to regression problems: linear regression can also be upgraded using kernels. Again, a drawback of these advanced methods for linear and logistic regression (if they are done in a straightforward manner) is that the solution is not sparse: every training instance contributes to the solution vector. In support vector machines and the kernel perceptron, only some of the training instances affect the solution, and this can make a big difference to computational efficiency.

The solution vector found by the perceptron algorithm depends greatly on the order in which the instances are encountered. One way to make the algorithm more stable is to use all the weight vectors encountered during learning, not just the final one, letting them vote on a prediction. Each weight vector contributes a certain number of votes. Intuitively, the correctness of a weight vector can be measured roughly as the number of successive trials after its inception in which it correctly classified subsequent instances and thus didnt have to be changed. This measure can be used as the number of votes given to the weight vector, giving an algorithm known as the voted perceptron that performs almost as well as a support vector machine. (Note that, as mentioned earlier, the various weight vectors in the voted perceptron dont need to be stored explicitly, and the kernel trick can be applied here too.)

Complex real-world problems have nonlinear structure, thus making the linear classifiers inappropriate for use. SVMs can be easily transformed into nonlinear learners (Boser et al., 1992). This is done by mapping the attribute vectors xi into a high-dimensional feature space X using a nonlinear mapping (xi). Maximum-margin linear classification rule is learned in the feature space X. Figure 3a shows a training set that is not linearly separable in (x1,x2). Figure 3b shows the same problem after the nonlinear transformation. It should be noted that although such a mapping (x) is inefficient to compute, using the special property of SVMs as defined in Boser et al. (1992), it is sufficient to compute dot-products in the feature space, i.e., (xi).(xj) during the training and testing. Such dot-products can be computed efficiently using kernel functions (x1,x2). It can be written as

The top layer in CNN architectures for image classification is traditionally a softmax linear classifier, which produces outputs with a probabilistic meaning. These outputs can then be used to compute the cross-entropy loss with respect to the ground truth and backpropagate the gradients through the CNN. However, the former approach assumes a probabilistic nature for the ground truth as well, i.e. each ground truth vector represents the probability distribution of a sample over all the classes in the dataset. Such an assumption does not hold for the proposed approach, where the CNN maps every input image into a position vector in the output space. For this reason, two modifications to the traditional classification pipeline need to be done: (1) the softmax classifier is replaced by a projection matrix which maps the visual features extracted by the CNN into the output embedding, and (2) a loss function different from the cross-entropy is used to train the network. We have proposed three extensions to two different loss functions previously used to learn these mappings from the input to the output space:

where f(image) is a column vector obtained at the output of the CNN for the given image, ylabel is the column vector embedding of class label in the output space, M=[y0,...,yN1]T and N is the number of classes in the training set. All vectors were constrained to have unit norm, so that the inner product between M and a vector in the output space results in an N-dimensional vector containing the similarity of the latter with respect to each class in the ground truth.

Hinge rank loss [10]. This ranking loss aims to minimize the distance between the output of the CNN and the target vector while isolating the former from all the other vectors, thus penalizing equally all errors. It is defined in Eq. (16.6),

Weighted hinge rank loss. We argue that the hinge rank loss does not completely suit our problem, as some mistakes should have a larger penalization than others, e.g. mistaking a happy boy for a happy child is an acceptable error, but mistaking it for a tropical house should have a large associated cost. The weighted hinge rank loss scales the loss associated to each pair depending on the prior information given by the embedding, as defined in Eq. (16.7),

Variable margin hinge rank loss. We extend the weighted hinge rank loss by imposing more strict conditions for dissimilar classes, while relaxing the margin for similar ones, as described in Eq. (16.8),