| Support Vector Machine |
Article Index for Support |
Website Links For Support Vector |
Information AboutSupport Vector Machine |
| CATEGORIES ABOUT SUPPORT VECTOR MACHINE | |
| classification algorithms | |
| ensemble learning | |
| machine learning | |
| kernel methods for machine learning | |
|
Support vector machines (SVMs) are a set of related Supervised Learning methods used for Classification and Regression . They belong to a family of generalized Linear Classifiers . They can also be considered a special case of Tikhonov Regularization . A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as '''maximum margin classifiers'''. Support vector machines map input vectors to a higher dimensional space where a maximal separating Hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. An assumption is made that the larger the margin or distance between these parallel hyperplanes the better the generalisation error of the classifier will be. An excellent tutorial has been produced by C.J.C Burges Christopher J. C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition". Data Mining and Knowledge Discovery 2:121 - 167, 1998 http://research.microsoft.com/~cburges/papers/SVMTutorial.pdf . A comparison of the SVM to other classifiers has been made by van der Walt and Barnard (C.M. van der Walt and E. Barnard,“Data characteristics that determine classifier performance”, Proc., Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, pp.160-165, 2006.) http://www.meraka.org.za/pubs/CvdWalt.pdf. MOTIVATION Often we are interested in classifying data as a part of a machine-learning process. Each data point will be represented by a -dimensional vector (a list of numbers). Each of these data points belongs to only one of two classes. We are interested in whether we can separate them with an "p minus 1" dimensional Hyperplane . This is a typical form of Linear Classifier . There are many linear classifiers that might satisfy this property. However, we are additionally interested in finding out if we can achieve maximum separation ( Margin ) between the two classes. By this we mean that we pick the hyperplane so that the distance from the hyperplane to the nearest data point is maximized. That is to say that the nearest distance between a point in one ''separated'' hyperplane and a point in the other ''separated'' hyperplane is maximized. Now, if such a hyperplane exists, it is clearly of interest and is known as the '' Maximum-margin Hyperplane '' and such a linear classifier is known as a maximum margin classifier. FORMALIZATION We consider data points of the form: : where the ''c''''i'' is either 1 or −1, a constant denoting the class to which the point belongs. Each is a -dimensional Real vector, usually of normalised ( Normalizing Constant ) or [-1,1 values. The scaling is important to guard against variables (attributes) with larger variance that might otherwise dominate the classification. We can view this as ''training data'', which denotes the correct classification which we would like the SVM to eventually distinguish, by means of the dividing (or separating) hyperplane, which takes the form : The vector points perpendicular to the separating hyperplane. Adding the offset parameter ''b'' allows us to increase the margin. In its absence, the hyperplane is forced to pass through the origin, restricting the solution. As we are interested in the maximum margin, we are interested in the support vectors and the parallel hyperplanes (to the optimal hyperplane) closest to these support vectors in either class. It can be shown that these parallel hyperplanes can be described by equations (by scaling w and b if not) : : | ||
|   | The Problem Now Is To Minimize '''''w''''' Subject To The Constraint (1) This Is A | "http://wwwinformationdelightinfo/information/entry/quadratic_programming" class="copylinks">Quadratic Programming (QP) Optimization problem More clearly, |
|   | This Constraint In (2) Along With The Objective Of Minimizing '''''w''''' Can Be Solved Using | "http://wwwinformationdelightinfo/information/entry/Lagrange_multipliers" class="copylinks">Lagrange Multipliers The key advantage of a linear penalty function is that the slack variables vanish from the dual problem, with the constant ''C'' appearing only as an additional constraint on the Lagrange multipliers Non-linear penalty functions have been used, particularly to reduce the effect of outliers on the classifier, but unless care is taken, the problem becomes non-convex, and thus it is considerably more difficult to find a global solution |
|   | Radial Basis Function: <math>k(\mathbf{x},\mathbf{x}') | \exp(-\gamma \\mathbf{x} - \mathbf{x'}\^2)</math>, for <math>\gamma > 0</math> |
|   | Gaussian Radial Basis Function: <math>k(\mathbf{x},\mathbf{x}') | \exp\left(- rac{\\mathbf{x} - \mathbf{x'}\^2}{2 \sigma^2}
ight)</math> |
|
|