How Does Svm Work for Continuous Y Variable

  • Journal List
  • HHS Author Manuscripts
  • PMC4467789

IEEE Trans Syst Man Cybern B Cybern. Author manuscript; available in PMC 2015 Jun 15.

Published in final edited form as:

PMCID: PMC4467789

NIHMSID: NIHMS695433

Fast Support Vector Machines for Continuous Data

Kurt A. Kramer

Computer Science and Engineering, ENB118, 4202 E. Fowler Ave., University of South Florida, Tampa, FL 33620

Lawrence O. Hall

Computer Science and Engineering, ENB118, 4202 E. Fowler Ave., University of South Florida, Tampa, FL 33620

Dmitry B. Goldgof

Computer Science and Engineering, ENB118, 4202 E. Fowler Ave., University of South Florida, Tampa, FL 33620

Andrew Remsen

College of Marine Science, University of South Florida, St. Petersburg, Fl

Abstract

Support vector machines can be trained to be very accurate classifiers and have been used in many applications. However, the training and to a lesser extent prediction time of support vector machines on very large data sets can be very long. This paper presents a fast compression method to scale up support vector machines to large data sets. A simple bit reduction method is applied to reduce the cardinality of the data by weighting representative examples. We then develop support vector machines trained on the weighted data. Experiments indicate that the bit reduction support vector machine produces a significant reduction in the time required for both training and prediction with minimum loss in accuracy. It is also shown to, typically, be more accurate than random sampling when the data are not over-compressed.

1 Introduction

Support vector machines (SVMs) achieve high accuracy in many application domains, such as character [35] and face recognition [40], as well as our work [23][22] in recognizing underwater zooplankton. However, scaling up SVMs to very large data sets is still an open problem. Training an SVM requires solving a constrained quadratic programming problem, which usually takes O(m 3) computations where m is the number of examples. Predicting a new example involves O(sυ) computations where sυ is the number of support vectors and is usually proportional to m. As a consequence, SVMs' training time and prediction time to a lesser extent on a very large data set can be quite long, thus making it impractical for some real-world applications. In plankton recognition, retraining is often required as new plankton images are labeled by marine scientists and added to the training library on the ship. As we acquire a large number of plankton images, training an SVM with all labeled images becomes extremely slow.

In this paper, we propose a simple strategy to speed up the training and prediction procedures for a SVM: bit reduction. Bit reduction reduces the resolution of the input data and groups similar data into one bin. A weight is assigned to each bin according to the number of examples from a particular class in it and a weighted example is created. This data reduction and aggregation step is very fast and scales linearly with respect to the number of examples. Then an SVM is built on a set of weighted examples which are the exemplars of their respective bins. Optimal compression parameters need only be computed once and can be re-used if data arrives incrementally as in plankton recognition or an online learning application or image labeling or any domain in which data will be incrementally labeled.

This approach allows us to investigate the use of SVM's on very large data sets where it has previously been impractical. We know for smaller data sets a support vector machine is often significantly more accurate than a decision tree or naive Bayes classifier, which scale much better.

This approach can be applied to any learning algorithm which takes weighted examples, though without a reliance on support points its utility is unclear. We are only concerned with support vector machines because they are much more time intensive with much smaller data sets than other learning algorithms such as decision trees. In fact, decision trees train fast enough that they would not need to use data compression for most of the data sets in our experiments. Our experiments indicate that bit reduction SVM (BRSVM) significantly reduces the training time and prediction time with a minimal loss in accuracy. It results in more accurate classifiers than random sampling on most data sets when the data are not over-compressed.

2 Previous work

There are two main approaches to speed up the training of SVMs. One approach is to find a fast algorithm to solve the quadratic programming (QP) problem for a SVM. "Chunking", introduced in [42], solves a QP problem on a subset of data. Chunking only keeps the support vectors chosen on the subset and replaces others with data that violate the Karush-Kuhn-Tucker (KKT) conditions. Using an idea similar to chunking, decomposition [18] puts a subset of data into a "working set", and solves the QP problem by optimizing the coefficients of the data in the working set while keeping the other coefficients unchanged. In this way, a large QP problem is decomposed into a series of small QP problems, thus making it possible to train an SVM on large scale problems. Sequential minimum optimization (SMO) [33] and its enhanced versions [19] [12] take decomposition to the extreme: Each working set only has two examples and their optimal coefficients can be solved for analytically. SMO is easy to implement and does not need any thirdparty QP solvers. SMO is widely used to train SVMs. Another way of solving large scale QP problems [16][44] is to use a low-rank matrix to approximate the Gram matrix of an SVM. As a consequence, the QP optimization on the small matrix requires significantly less time than on the whole Gram matrix. In [41] a Core Vector Machine that implements an approximation algorithm using minimum enclosing balls was proposed. In [2], a multiclass core vector machine was introduced which scales better with data set size and many classes. In [25] a Langrangian Support vector machine was introduced which produced speed-ups in training with comparable test set accuracy.

In [39], a priori knowledge is incorporated in the training of a least-squares support vector machine. Our approach doesn't use knowledge about the training set, but a more general data compression approach.

The other main approach to speeding up SVM training comes from the idea of "data squashing", which was proposed in [13] as a general method to scale up data mining algorithms. Data squashing divides massive data into a limited number of bins. The statistics of the examples from each bin are computed. A model is fit by only using the statistics instead of all examples within a bin. The reduced training set results in significantly less training time. Several clustering algorithms [46][38][3] were used to partition data and build an SVM based on the statistics from each cluster. In [3], the SVM model built on the reduced set was used to predict on all of the training data. Examples falling in the margin or being misclassified were taken from their original clusters and added back into the training data for retraining. However, both [46] and [38] assumed a linear kernel which might not generalize well to other kernels. In [3], two experiments were done with a linear kernel and only one experiment used a third-order polynomial kernel. Moreover, it is not unusual for many examples to fall into the margin of an SVM model especially for an RBF kernel. In such cases, retraining with all examples within the margin is computationally expensive.

In [20] a method of compressing data for SVMs was introduced. It involves a time-consuming clustering approach. On one small data set they found higher accuracy at the cost of longer training times, but shorter testing times. Given that it increases the training time, it does not appear to be scalable for large data sets. In [1] the authors used clustering to reduce the size of the training set. On just two data sets they showed comparable accuracy to using all the data with speedups in training of 38.7 times and 15.6 times. It is not clear that the pre-processing clustering times were included in determining speedups.

Most work [4][5][28][36] on enabling fast prediction with SVMs has focused on the problem of reducing the number of support vectors (SVs) obtained. Since the prediction time of an SVM depends on the number of support vectors, a search was done for a reduced set of vectors which can approximate the decision boundary. The prediction using the reduced set was faster than using all support vectors. However, reduced set methods involve searching for a set of pre-images [36][37], which is a set of constructed examples used to approximate the solution of a SVM. It should be noted that the searching procedure is computationally expensive.

In [17] two separate decision boundaries were created one for each class rather than a single one. For each decision boundary only the examples for its respective class were used thus reducing by about half the number of examples used for finding the decision boundary. The solvers are polynomial complexity and thus the training time for each solver individually is reduced by more than 50%. The experiments performed in the paper used relatively small data sets, the largest being 1473 examples.

Data squashing approaches seem promising and can be combined with fast QP approaches like SMO etc. for fast training and prediction. However, most work [38][3][46] in data squashing+SVM requires clustering the data and/or linear kernels [38][46][30]. Clustering usually needs O(m 2) computations and high-order kernels, like the RBF kernel, are widely used and essential to many successful applications. Therefore, a fast squashing method and experiments on high-order kernels are necessary to apply data squashing+ SVMs to real-world applications. In this paper, we propose a simple and fast data compression method: bit-reduction SVM (BRSVM). It does not require any computationally expensive clustering algorithms and works well with RBF kernels as shown in our experiments. This work is an expansion of that in [21] with more data sets used, significantly more detailed experimental analysis and a guide to parameter setting.

3 Bit reduction SVM

Bit reduction SVM (BRSVM) works by reducing the resolution of examples and representing similar examples as a single weighted example. In this way, the data size is reduced and training time is saved. It is simple and much faster than clustering, for instance. Another even simpler data reduction method is random sampling. Random sampling subsamples data without replacement. Compared to weighted examples, random sampling suffers from high variance of estimation in theory [8]. In spite of its high variance, random sampling has been shown to work very well in experiments [29][38]: It was as accurate as or slightly less accurate than more complicated methods.

3.1 Bit reduction

Bit reduction is a technique to reduce the data resolution. It was used to build a bit reduction fuzzy c-means (BRFCM) method [15], which applied bit reduction to speed up the fuzzy c-means (FCM) clustering algorithm. However, classification problems differ in that only examples from the same class should be aggregated together.

There are three steps involved in bit reduction for an SVM: normalization, bit reduction and aggregation.

  1. Normalization is used to ensure equal resolution for each feature. To avoid losing too much information during quantization, an integer is used to represent each normalized feature value. The integer I(υ) for a floating point value υ is constructed as follows:

    where Z is a domain dependent number used to scale υ and the function int(k) returns the integer part of k. In this way, the true value of υ is kept and only I(υ) is used in bit reduction. In our experiments, we used Z = 1000. It is important to note that all feature data prior to the bit reduction process described here goes through a z-normalization such that each feature will have a mean centered on zero and unit variance. Unless you wish different features to have different levels of significance it is important that they are all treated the same.

  2. Bit reduction is performed on the integer I(υ). Given b, the number of bits to be reduced, I(υ) is right-shifted and its precision is reduced. We slightly abuse notation here by letting the I(υ) in the right hand side of Eq. (1) be the I(υ) before bit reduction and I(υ) in the left hand side be the I(υ) after bit reduction.

    where kb shifts the integer k to the right by b bits. Given an r-dimensional example xi =(x i1,x i2,…,xir ), its integer expression after bit reduction is (I(x i1),I(x i2),…,I(xir )).

  3. The aggregation step groups the examples from the same class whose integer expressions fall into the same bin. For each class, its mean within the bin is calculated separately from only the examples with that class label. The weight of the representative mean equals the number of examples from that class. During the mean computation, the real values (x i1, x i2, …, xir ) are used.

Note the bit reduction procedure reduces data precision. A very large b results in too many examples falling in the same bin. The mean statistic is not enough to capture the location information of many examples. A small b does not provide enough data reduction, thus still leaving training slow. The best number of bits reduced (b) varies for different data sets. It can be found by search on a small subset of data as described later. The optimal number b for bit reduction will be used for retraining on the same type of data.

During bit reduction, it is very likely that some bins have examples from many different classes. Therefore, in the aggregation step, the mean statistic of examples in a bin was computed individually for each class based only on the examples in that class. This alleviates the side effect of grouping examples from different classes into the same bin. As a result, one bin may contain weighted examples for multiple classes.

Table 1 describes the bit reduction procedure for four 1-d examples with class label yi .

Table 1

A 1-d example of bit reduction in BRSVM

i Example (xi , yi ) I(xi ) and its bit expression Z = 1000 I(xi ) after 2-bit reduction

1 (0.008, 1) 8 (1000) 2 (10)
2 (0.009, 1) 9 (1001) 2 (10)
3 (0.010, 2) 10 (1010) 2 (10)
4 (0.011, 2) 11 (1011) 2 (10)

The four examples from two classes are first scaled to integer values by using Z = 1000. Then 2-bit reduction is performed by right shifting its integer expression by 2 bits. All four examples end up having the same value, which means all four examples fall into one bin after a 2-bit reduction. Table 2 shows the weighted examples, one for each class, after the aggregation step.

Table 2

Weighted examples after the aggregation step.

i New examples (xi , yi ) Weight

1 (0.0085, 1) 2
2 (0.0105, 2) 2

Since all four examples are in the same bin, we aggregate them by class and compute their mean for each class using the original values xi . The weight is computed by simply counting the number of examples from the same class.

Although bit reduction is fast, a sloppy implementation of aggregation may easily cost O(m 2) computations where m is the number of examples. We implemented a hash table for the aggregation step as done in [15]. Universal hashing [10] was used as the hash function. Collisions were resolved by chaining. When inserting the bit-reduced integer values into the hash table, we used a list to record the places that were filled in the hash table. The mean statistics were computed by re-visiting all the filled places in the hash table. The average computational complexity for our implementation is O(2m). Please see [10] for more information about the universal hashing function.

3.1.1 Unbalanced bit reduction

The simplest way to apply bit reduction is equally to all attributes. However, at some point in the process, say after a bits, there may occur a large compression (say from 1000 examples to 10). At that point, the data is typically over compressed and accuracy will be poor. We would like to have the option of compressing the training data a little more from a − 1 bits. To do this we will apply bit reduction to only a subset of the attributes.

The attributes will be ordered from lowest to highest variance and will be reduced in order by 1 more bit. We will denote the number of attributes processed in the unbalanced way by s which can vary between zero and one less than the number of attributes in the data set.

3.2 Weighted SVM

It is important to weight aggregated examples by the number of source examples. Bins that have a greater number of examples should have a greater impact on the decision boundary. This is what would happen if bit reduction does not take place. Consider if there are 10 examples that fall into the same bucket and one noisy example (mislabeled) in its own bin. Without weighting the noisy example would have as much impact on the decision boundary as the 10 examples that fall into the same bucket.

Pavlov et al. [30] proposed a method to train a weighted SVM, although its description in [30] is concise and lacks significant details. Following their work, we describe how to train a weighted SVM in more detail in this subsection.

Given examples x 1, x 2, …, xm with class label yi ∈{−1,1}, an SVM solves the following problem

minimize 1 2 w , w + C m i = 1 m ξ i subject to : y i ( w , ϕ ( x i ) + b ) 1 ξ i ξ i > = 0

(2)

where w is normal to the decision boundary (a hyperplane), C is the regularization constant that controls the trade-off between the empirical loss and the margin width, the slack variable ξ i represents the empirical loss associated with xi . In the case of weighted examples, the empirical loss of xi with a weight β i is simply β i ξ i . Intuitively, it could be interpreted as β i identical examples xi . Accumulating the loss of the β i examples results in a loss of β i ξ i . Substitute ξ i with β i ξ i in Eq. (2), and we derive the primal problem of a weighted SVM:

minimize 1 2 w , w + C m i = 1 m β i ξ i subject to : y i ( w , ϕ ( x i ) + b ) 1 ξ i ξ i > = 0 , i = 1 , m

(3)

The constraint in Eq. (3) remains unchanged because the constraint for each of the β i examples xi is identical. The β i identical constraint formulas can be reduced to one constraint as shown in Eq. (3).

The dual form of a weighted SVM is as follows.

maximize i = 1 m α i 1 2 i = 1 m j = 1 m α i α j y i y j k ( x i , x j ) subject to 0 α i C β i m , i = 1 , , m i = 1 m α i y i = 0

(4)

The dual form of a weighted SVM is almost identical to a normal SVM except for the boundary condition of α i C β i m while in a normal SVM α i C m . Therefore, efficient solvers for an SVM such as SMO [33] can be used to solve a weighted SVM by modifying the boundary condition slightly. We note that this formulation is equivalent to having multiple examples that are exactly the same which won't affect the convergence properties.

4 Experiments

We discuss in detail experiments with BRSVM on twelve data sets including banana [34], phoneme [14], shuttle [27], page [26], pendigit [26], letter [26], SIPPER II plankton images [22], waveform [26] and satimage [26]. They come from several sources ranging in size from 5000 to 58,000 examples and from 2 to 36 attributes. We also ran experiments on the Adult, Forest, and Web data sets to compare with previous work as detailed in subsection 4.3. The details of all the data sets are summarized in Table 3.

Table 3

Description of the data sets.

data set # of data # of attributes # of classes

Adult 45222 14 2
banana 5300 2 2
Forest 495141 54 2
phoneme 5404 5 2
shuttle 58000 9 7
page 5473 10 5
pendigit 10992 16 10
letter 20000 16 26
plankton 8440 17 5
waveform 5000 40 3
satimage 6435 36 6
Web 36818 293 2

The Libsvm tool [7] for training support vector machines was modified and used in all experiments. For four of the data sets the kernel, the γ and regularization constant (C) parameters were the same as used in the original source papers. For the Adult, Forest, and Web datasets the linear kernel was used with C set to 1 as in [30]. The plankton dataset used the RBF kernel (k(x, y) = exp(−γ‖xy2)) with C = 16 and γ = 0.04096 as in [22]. For the remaining 8 datasets the RBF kernel was used with the kernel parameters γ and C being tuned by a two pass grid search, coarse followed by fine. In the first pass γ started at 0.00001 and was incremented by multiples of 2 and in the second by multiples of 1.1. In both passes C goes from 1 to 18 by increments of 1. A 5-fold cross validation on the training data was used to evaluate each γ and C parameter pair. The starting and ending γ for the second pass was derived from the smallest and largest γ chosen from the 2% of parameter pairs (C and γ) that resulted in the most accurate classifiers in the first pass. The starting γ will then be reduced to the smallest nearest power of 10 (e.g. 0.16384 would reduce to 0.10000).

We used the same training and test data separation as given by the original users of the data sets. For those data sets which do not have a separate test set, we randomly selected 80% of the examples as the training set and 20% of the examples as the test set. Since the nine data sets without given train/test splits have more than 5000 examples, 20% of the total data will have more than 1000 examples. We believe this provided a relatively stable estimation. We built SVMs on the training set with the optimal parameters and report the accuracy on the test set. Z-Normalization with mean centered on zero and unit variance was applied to all data sets. Table 4 gives the specific parameter settings used for each data set. All our experiments were done on a SUN-Fire-880 using a single SUN Ultra-sparc III processor at 750mHz with 32 GB memory under the SunOS 5.8 operating system.

Table 4

SVM Parameters Used.

Data Set Kernel C (γ) for RBF Kernel

Adult Linear 1
Banana RBF 1 3.4004
Forest Linear 1
Letter RBF 6 0.079578
Page RBF 1 0.08955
Pendigit RBF 5 0.08955
Phoneme RBF 13 1.48005
Plankton RBF 16 0.04096
SatImage RBF 9 0.14202
Shuttle RBF 16 0.274727
Waveform RBF 1 0.008138
Web Linear 1

4.1 Experiments with pure bit reduction

The experimental description starts with pure or balanced bit reduction in which all attributes are compressed by the same number of bits.

For brevity, detailed experimental results using BRSVM are examined on three representative data sets, banana, plankton and phoneme, as shown in Tables 57 and Figures 13. Results from the other data sets will be summarized. The last row of each table records the result of an SVM trained on the uncompressed data set. The other rows present the results from BRSVM. The first column is the number of bits reduced. The second column is the compression ratio, which is defined as # of examples after bit reduction # of examples . We start off with 0-bit reduction which may not correspond to a 1.0 compression ratio. The reason is that repeated examples are grouped together even when no bit is reduced. This results in compression ratios less than 1.0 at 0-bit reduction in some cases. The third column is the accuracy of BRSVM on the test set. McNemar's test [11] is used to check whether BRSVM accuracy is statistically significantly different from the accuracy of an SVM built on the uncompressed data set. An accuracy percentage in bold indicates the difference is not statistically significant at the p = 0.05 level. The fourth column represents the number of support vectors found during the training process. The fifth column is the time for bit reduction which includes the time required to do example aggregation. The sixth column is the BRSVM training time. The seventh column is the prediction time on the test set. All of the timing results were recorded in seconds. The precision of the timing measurement was 0.01 seconds. The training and prediction speedup ratio are defined as SVM training time BRSVM training time and SVM prediction time BRSVM prediction time , respectively. For random sampling the sub-sampling ratio was set to equal the compression ratio of BRSVM. Since random sampling has a random factor, we did 50 experiments for each sub-sampling ratio and report the average accuracy and average number of support vectors (SVs). These two values are listed in the last two columns of Tables 57 titled random sampling and random SVs.

An external file that holds a picture, illustration, etc.  Object name is nihms695433f1.jpg

Reduction vs. Accuracy and Support Vectors for Banana dataset.

An external file that holds a picture, illustration, etc.  Object name is nihms695433f3.jpg

Reduction vs. Accuracy and Support Vectors for Plankton dataset.

Table 5

BRSVM on the banana data set. The accuracy in bold indicates not statistically significantly different from the accuracy of SVM.

bit reduction compression ratio BRSVM accuracy support vectors BRSVM reduction time BRSVM training time BRSVM prediction time random sampling random SVs

0 1.000 90.57% 1055 0.03s 4.47s 0.49s 90.57% 1055.0
1 0.999 90.57% 1055 0.03s 4.45s 0.50s 90.56% 1055.2
2 0.997 90.57% 1055 0.03s 4.48s 0.49s 90.58% 1052.8
3 0.988 90.57% 1052 0.03s 4.48s 0.49s 90.57% 1045.5
4 0.960 90.57% 1034 0.02s 5.21s 0.48s 90.65% 1015.5
5 0.846 90.57% 976 0.03s 3.87s 0.45s 90.70% 905.8
6 0.575 90.57% 775 0.02s 1.77s 0.36s 90.67% 641.7
7 0.244 90.66% 433 0.02s 0.38s 0.20s 90.50% 315.1
8 0.077 90.28% 174 0.01s 0.07s 0.08s 89.71% 132.5
9 0.024 86.23% 72 0.01s 0.01s 0.04s 87.07% 61.3
10 0.007 76.32% 30 0.02s 0.00s 0.03s 77.10% 25.6

SVM 1.000 90.57% 1055 0.00s 4.61s 0.50s

Table 7

BRSVM on the plankton data set. The accuracy in bold indicates not statistically significantly different from accuracy of SVM.

bit reduction compression ratio BRSVM accuracy Support vectors BRSVM reduction time BRSVM training time BRSVM prediction time random sampling random SVs

0 0.995 89.60% 2615 0.08s 24.32s 2.94s 89.42% 2623.3
1 0.995 89.60% 2615 0.09s 24.51s 2.96s 89.42% 2623.3
2 0.995 89.60% 2615 0.08s 24.56s 2.96s 89.42% 2623.3
3 0.995 89.60% 2615 0.09s 24.60s 2.91s 89.42% 2623.3
4 0.995 89.60% 2615 0.09s 24.70s 2.95s 89.42% 2623.3
5 0.995 89.60% 2615 0.08s 24.67s 2.85s 89.42% 2623.3
6 0.995 89.60% 2615 0.08s 24.58s 2.97s 89.42% 2623.3
7 0.995 89.60% 2615 0.09s 24.71s 2.89s 89.42% 2623.3
8 0.995 89.60% 2610 0.08s 24.32s 2.84s 89.40% 2622.3
9 0.961 89.30% 2547 0.09s 23.17s 2.88s 89.19% 2547.8
10 0.353 84.50% 928 0.06s 4.00s 1.05s 86.37% 1092.9
11 0.065 68.40% 236 0.05s 0.20s 0.27s 80.35% 278.1

SVM 1.000 89.60% 2628 24.70s 2.96s

Figures 13 show the effect on classification accuracy and the number of support vectors as bit reduction goes from none to 12. The x-axis indicates the bit reduction level, the left y axis indicates classification accuracy while the right y-axis indicates the number of support vectors.

The experimental results on the banana data set are shown in Table 5 and Figure 1. As more bits were reduced, fewer examples were used in training. Thus training time was reduced. Also, less training data results in a classifier with fewer support vectors. The prediction time is proportional to the total number of support vectors. Therefore, the prediction time of BRSVM was reduced accordingly. When 8 bits were reduced, BRSVM was 57 times faster during training and 6 times faster during prediction than a normal SVM. Its accuracy was not statistically significantly different from an SVM built on all the data at the p = 0.05 level. Figure 1 provides a good visual picture of what happens as the data is compressed. Starting at a bit reduction level of 5 the number of support vectors starts to decline while the classification accuracy held until a bit reduction level of 9 was reached.

Phoneme is another relatively low-dimensional data set with five attributes. Table 6 and Figure 2 present the experimental results of BRSVM on this data set. When 8 bits were reduced, BRSVM was 1.8 times faster during training and 1.1 times faster during prediction than a normal SVM. Its accuracy was not statistically significantly different from an SVM built on all the data at the p = 0.05 level. BRSVM was as or more accurate than random sampling when the compression ratio was larger than 0.060. The number of support vectors reduces in a way similar to that seen with the banana data set. In Figure 2 the classification accuracy holds right up to bit reduction level 8 while the number of support vectors starts to drop after a bit reduction level of 6.

An external file that holds a picture, illustration, etc.  Object name is nihms695433f2.jpg

Reduction vs. Accuracy and Support Vectors for the Phoneme dataset.

Table 6

BRSVM on the phoneme data set. The accuracy in bold indicates not statistically significantly different from the accuracy of SVM.

bit reduction compression ratio BRSVM accuracy Support vectors BRSVM reduction time BRSVM training time BRSVM prediction time random sampling rand SVs

0 0.993 90.93% 1353 0.03s 13.69s 0.83s 90.73% 1345.7
1 0.988 90.93% 1353 0.03s 13.63s 0.83s 90.66% 1337.0
2 0.979 90.93% 1343 0.03s 13.21s 0.82s 90.57% 1325.1
3 0.973 90.93% 1336 0.03s 13.10s 0.82s 90.52% 1317.4
4 0.969 90.93% 1336 0.03s 13.17s 0.82s 90.50% 1313.4
5 0.965 90.93% 1332 0.03s 12.94s 0.81s 90.46% 1307.8
6 0.955 90.93% 1327 0.03s 13.22s 0.82s 90.39% 1296.6
7 0.899 90.84% 1292 0.03s 11.71s 0.82s 90.11% 1233.4
8 0.686 90.93% 1164 0.03s 7.47s 0.74s 89.26% 995.2
9 0.306 86.68% 745 0.02s 1.98s 0.46s 86.25% 534.1
10 0.060 78.35% 234 0.02s 0.07s 0.15s 80.07% 166.5

SVM 1.000 90.93% 1356 0.00s 13.66s 0.850

Table 7 and Figure 3 show the experimental results on a higher dimensional data set– plankton. BRSVM was slightly more accurate than random sampling when the number of reduced bits is up to 9. At the 10-bit reduction level, the compression ratio of BRSVM drops sharply from 0.961 to 0.353, resulting in a significant loss in accuracy. At the 10 and 11 bit reduction levels where the compression ratios were less than or equal to 0.353, the accuracies of BRSVM were much lower than random sampling. This phenomenon was observed on several other data sets in our experiments. The reason is that when the compression ratio is small, it is very likely that many examples from different classes fall into the same bin and the number of examples distribute far from uniformly among different bins. For instance, suppose bit reduction compresses the data into several bins and one bin has 80% of the examples from different classes. BRSVM uses the mean statistic as the representative for each class, which may not be able to capture the information about the decision boundary in its bin. Random sampling, on the other hand, selects the examples more uniformly. If 80% of the examples fall into one bin, random sampling will effectively sample four times more examples that reside in this bin than all others together, and preserve the local information of the decision boundary much better than BRSVM. As a result, random sampling is likely to be as or more accurate than BRSVM when the compression ratio is very low. This tends to happen on high dimensional data sets. On the other hand, at a higher compression ratio, where examples from the same class fall into the same bin and distributions of the number of examples in bins are not very skewed, BRSVM preserves the statistics of all examples while random sampling suffers from high sampling variance. Therefore, BRSVM is more accurate than random sampling when the compression ratio is relatively high.

Results ranging from training speedups of 34.7 times to 1.2 times were observed on shuttle, page, letter, and waveform as summarized in Table 8. The least improvement in training time came on the 16 feature letter data set.

Table 8

Summary of BRSVM on the shuttle, page, letter, and waveform data sets.

Data set Optimal b Comp Ratio SVM Accuracy BRSVM Accuracy Train Speedup Pred Speedup Random Sampling Accuracy

shuttle b = 8 0.031 99.86% 99.84% 34.7 1.2 99.41%
page b = 9 0.180 95.44% 95.35% 8.6 1.8 93.93%
letter b = 9 0.887 97.53% 97.30% 1.2 1.1 97.22%
waveform b = 11 0.450 86.70% 85.80% 4.2 1.8 86.47%

4.2 Adding unbalanced bit reduction

As shown in Section 4.1 using balanced bit reduction there is a point where there is a large reduction in the number of examples after the reduction from a − 1 to a bits (e.g. Tables 57). In this section, we explore how to get a smaller reduction in the number of examples and hence higher accuracy, by using unbalanced bit reduction after compression by a − 1 bits.

In the experiments shown in Tables 9 and 10 the number of attributes for unbalanced bit reduction, s, was varied from 0 to one less than the number of attributes in the data set. The entry with the largest b and largest number of unbalanced bits (s) that is still not statistically different from the accuracy of the SVM entry as per a McNemar's test [11] with p = 0.05 would be considered the best values for b and s.

Table 9

Unbalanced bit reduction on plankton data set with b = 9. The accuracies in bold are not statistically different from the accuracy produced from a SVM.

UnBal Bits(s) compression Ratio Accuracy SVs Red Time Train Time Test Time Random Sampling Random SVs

SVM 1.0 89.60% 2628 - 24.70s 2.96s

0 0.961 89.30% 2547 0.09s 23.17s 2.88s 89.19% 2547.8
1 0.951 89.60% 2517 0.11s 22.28s 2.84s 89.15% 2525.9
2 0.934 89.20% 2467 0.12s 21.87s 2.80s 89.06% 2490.6
3 0.912 89.30% 2411 0.10s 21.20s 2.76s 88.96% 2437.0
4 0.874 89.30% 2299 0.10s 19.54s 2.65s 88.89% 2350.2
5 0.843 89.20% 2220 0.10s 18.09s 2.51s 88.76% 2280.7
6 0.799 88.80% 2076 0.10s 15.97s 2.36s 88.58% 2174.9
7 0.747 88.20% 1924 0.10s 14.57s 2.15s 88.36% 2055.3
8 0.706 88.20% 1815 0.10s 12.51s 2.00s 88.18% 1959.3
9 0.644 87.10% 1639 0.10s 10.84s 1.81s 87.97% 1814.8
10 0.603 86.40% 1545 0.10s 9.57s 1.69s 87.88% 1715.1
11 0.557 86.70% 1434 0.09s 8.47s 1.59s 87.69% 1602.1
12 0.531 86.10% 1387 0.10s 8.01s 1.53s 87.54% 1537.3
13 0.487 85.60% 1254 0.09s 6.75s 1.39s 87.37% 1433.6
14 0.463 86.00% 1190 0.09s 6.11s 1.32s 87.14% 1371.8
15 0.413 85.10% 1044 0.09s 5.12s 1.18s 86.70% 1244.4
16 0.359 84.80% 939 0.09s 4.03s 1.06s 86.48% 1106.1

Table 10

Unbalanced bit reduction on letter data set with b = 9. The accuracies in bold are not statistically different from the accuracy produced from a SVM.

UnBal Bits(s) compression Ratio Accuracy SVs Red Time Train Time Test Time Random Sampling Random SVs

SVM 1.000 97.53% 7051 - 75.60s 49.39s

0 0.887 97.50% 6778 0.17s 66.99s 46.99s 97.22% 6491.2
1 0.869 97.50% 6743 0.22s 65.39s 46.60s 97.17% 6402.6
2 0.859 97.50% 6710 0.22s 64.53s 46.83s 97.14% 6356.4
3 0.850 97.43% 6679 0.23s 64.36s 49.26s 97.13% 6309.9
4 0.843 97.45% 6633 0.22s 63.66s 47.33s 97.12% 6278.9
5 0.836 97.43% 6602 0.22s 62.89s 46.99s 97.09% 6245.5
6 0.819 97.40% 6559 0.21s 61.10s 46.52s 97.07% 6158.7
7 0.801 97.43% 6501 0.22s 59.46s 46.76s 97.01% 6066.9
8 0.785 97.38% 6437 0.21s 58.87s 45.77s 96.98% 5990.6
9 0.748 97.40% 6292 0.22s 54.75s 44.94s 96.87% 5798.0
10 0.738 97.45% 6234 0.21s 53.73s 43.77s 96.84% 5746.4
11 0.700 97.38% 6050 0.22s 49.95s 42.59s 96.74% 5552.0
12 0.670 97.35% 5879 0.20s 46.98s 41.90s 96.65% 5395.7
13 0.654 97.30% 5758 0.21s 45.32s 41.32s 96.61% 5306.8
14 0.583 97.00% 5410 0.20s 38.57s 38.55s 96.33% 4909.0
15 0.538 96.98% 5093 0.20s 34.66s 36.39s 96.14% 4646.7

We leave to future work the idea of investigating multiple resolution unbalanced bit-reduction. That is different features can be reduced by different levels of bit reduction. Ex: feature 1 by a + 1, feature 2 by a, feature 3 by a + 4 and so forth. An optimization algorithm would need to be developed to determine the various levels of bit reduction for the different features.

The results in Tables 9 and 10 consist of 9 columns. The first column indicates the number of unbalanced bits(s). The second column indicates the compression achieved. The third column is the classification accuracy. The fourth column is the number of support vectors. The fifth, sixth, and seventh columns are the time in seconds to perform reduction, training and testing respectively. The eighth column shows the average random sampling accuracy on the test set and the ninth is the average number of SVs for random sampling. The sub-sampling ratio is set equal to the compression ratio of BRSVM. The random sampling accuracy is the mean of 50 experiments.

We discuss experiments with UBR on the phoneme, pendigit, plankton, waveform and satimage data sets, for which pure bit reduction did not result in incremental changes in compression ratios. In this paper we define a good compression ratio as the minimum compression ratio where accuracy was not statistically significantly different from that obtained from an SVM trained on the uncompressed data set.

The pure bit reduction experiments on the Plankton data set were recorded in Table 7. After 9 bit reduction, BRSVM results in a 0.961 compression ratio and 1.06 times speedup in the training phase with no loss in accuracy. While after 10 bit reduction, the compression ratio drops to 0.353 and the corresponding 4.80% accuracy loss could not be tolerated. Table 9 shows the results of applying UBR to search for a compression ratio between 0.961 and 0.353. We expected that this would result in more speedup than the 1.06 times from 9-bit reduction without sacrificing accuracy. Nine bit reduction is first applied to the data and then additional attributes are reduced further by 1 bit, starting with 1 attribute (s = 1) to 16 attributes (s = 16). At s = 8 an approximately 2.0 times speed-up in training time was achieved with a loss in accuracy of 1.4%.

With pure bit reduction on the letter data set, bit reduction (b = 9) resulted in 0.887 compression ratio and 0.03% loss in accuracy with only a 1.1 speedup in training time. At bit reduction (b = 10) accuracy dropped by 0.82%. By using UBR (see Table 10) a speedup of 1.7 was achieved with a loss in accuracy of 0.23% by using bit reduction (b = 9) plus an additional 13 attributes (s = 13) reduced by one more bit.

In Figure 4 we can see a typical example of how accuracy changes on the phoneme data set with the compression ratio. In this case BRSVM stayed higher in accuracy than random sampling until the compression ratio was low. In Figure 5, we can see how accuracy changes on the page data set with compression. In this case, BRSVM was significantly better until nine bits were compressed with five unbalanced bits compressed. After that point, it's selection of examples results in lower accuracy.

An external file that holds a picture, illustration, etc.  Object name is nihms695433f4.jpg

Phoneme accuracy with BRSVM and Random selection of examples at different compression ratios.

An external file that holds a picture, illustration, etc.  Object name is nihms695433f5.jpg

Page accuracy with BRSVM and Random selection of examples at different compression ratios.

4.3 Comparison

A comparison of bit reduction to the squashing methods of Pavlov and Chudova [30] for the Adult, Forest, and Web data sets was done. Their results are the best we have found.

Following the idea of likelihood-based squashing [24] [29], a squashing method was developed for an SVM by Pavlov and Chudova [30]. The likelihood squashing method assumes a probability model as the classifier. Examples with similar probability p(xi , yi |θ) are grouped together and taken as a weighted exemplar. Pavlov and Chudova used a probabilistic interpretation of SVMs to perform the likelihood squashing with a linear kernel. They call the approach squash-SMO because it was paired with sequential minimal optimization. Simply taking a random 1–5% of the data was called srs-SMO.

The boost-SMO algorithm [32] trains a sequence of classifiers on a small subset of the data (say 1%) while concentrating on examples misclassified (in all the data). While generally suboptimal, it is fast to train and provides accuracy comparable to using all the data.

For comparison the C parameter of the SVM was set to 1.0 for all three data sets, and a Linear Kernel was used. All 3 data sets contain nominal features for which no compression, just strict matching, was done. For the adult data set the 8 nominal attributes took on multiple values. We encoded these attributes as groups of inputs. For example, an attribute that took on three values would consist of three inputs only one of which could be non-zero. This resulted in a total of 99 nominal features and 6 continuous features for the SVM.

For the Adult data set, a compression ratio of 0.634 was achieved using 9-bit reduction (b = 9) and 4 unbalanced attributes (s = 4). The reduction itself took 0.69 seconds, and the time required to train using the reduced set was 526.21 seconds. This resulted in a speedup of 1.88 times with only a 0.01% accuracy loss. In comparison, the squash-SMO method had an 8.7 times speedup with a 1.2% accuracy loss, boost-SMO had a 153.5 times speedup with only a 0.9% accuracy loss, and the srs-SMO method had a 677.1 times speedup with a 4.5% reduction in accuracy. The results are shown in Table 11. The training times are significantly lower using the 1% selection methods of [30], but the accuracy suffers as a result. However, a 9-bit reduction plus 4 unbalanced attributes produced no statistically significant difference in accuracy.

Table 11

Results for the Adult data set.

Method Accuracy Training Time Reduction Time Speed Up Compression Ratio

Majority Class 75.53%

9-Bits s = 0 84.56% 694.25s 0.73s 1.424 0.788
9-Bits s = 4 84.64% 526.21s 0.69s 1.878 0.634
No Reduction 84.65% 989.69s 1.000

srs-SMO 79.7% 0.2s 1.5s 677.1 0.01
squash-SMO 83.0% 0.2s 132.4s 8.7 0.01
boost-SMO 83.3% 7.5s - 153.47 0.01
full-SMO 84.2% 1151

The forest cover data set has more than 500,000 examples. To compare with the work in [30, 32], rather than other work with more training data but no comparable compression experiments [9], we used the two most populated classes which had the highest misclassification rates: Spruce-fur and Lodge-pole pine. The training data set consisted of 98,884 examples and the test set consisted of 396,257 examples with details in [31]. With the Forest cover data set, bit reduction using 10-bits (b = 10) and 4 unbalanced attributes (s = 4) resulted in a speedup of 54.0. This result is roughly 30 times faster than what was obtained for the Adult data set. This occurs because more than half of the features in Adult data set are nominal in comparison to 17% in the Forest data set. The number of nominal features is significant for bit-reduction based squashing since the reduction calculation is not applied to nominal features; they are used as-is. In contrast, the squash-SMO method only resulted in a 1.79 times speed-up and resulted in a 0.6% accuracy loss. Boost-SMO had a 14.36 times speedup with only a 1.1% accuracy loss. Using bit reduction, a 0.1% gain in accuracy was achieved as shown in Table 12.

Table 12

Results for the Forest data set.


Method Accuracy Training Time Reduction Time Speed Up Compression Ratio

Majority Class 57.26%

10-Bits s = 0 76.10% 271.01s 2.09s 27.50 0.181
10-Bits s = 4 75.89% 136.64s 2.34s 54.04 0.076
10-Bits s = 9 74.18% 4.62s 2.45s 1061.18 0.017
No Reduction 75.85% 7511.02s 1.000

srs-SMO 74.2% 2.0s 13.0s 81.67 0.01
squash-SMO 77.1% 2.0s 682.9s 1.79 0.01
boost-SMO 76.6% 85.3s - 14.36 0.01
full-SMO 77.7% 1225s - 1.00

For the Web data set, bit reduction is marginalized in it's effectiveness since all of the attributes are boolean and will not have the reduction calculation applied. Thus, a 0-bit reduction is the only level of reduction that will produce meaningful results. In other words, the best that can be done is to remove all of the duplicate examples from the training set. Even so, we managed to achieve a compression ratio of 0.352, which indicates there are a significant number of duplicate examples in the Web data set. The speed-up was 2.81 times with minimal loss (0.02%) in classification accuracy. From Table 13, we can see squash-smo results in a speedup of 11 times at a cost of a 0.8% drop in accuracy and boost-SMO results in a 402 times speed-up with just a 0.6% loss in accuracy.

Table 13

Results for the Web data set.


Method Accuracy Training Time Reduction Time Speed Up Compression Ratio

Majority Class 68.05%

0-Bits s = 0 74.56% 162.87s 5.13s 2.81 0.352
No Reduction 74.58% 471.47s 1.000

srs-SMO 71.2% 0.2 1.5 2363.53 0.01
squash-SMO 74.4% 0.2 361.7 11.10 0.01
boost-SMO 74.6% 10 - 401.8
full-SMO 75.2% 4018

The results of our comparison to [30] show that bit reduction has limited utility for data sets that contain more nominal than continuous attributes. In general, the presence of nominal attributes will mean lower compression ratios. Thus, squash-SMO or boost-SMO may be the more appropriate technique for such data sets. However, since bit-reduction uses more of the original data than the squash-SMO method, the classification accuracy will suffer less. It should also be pointed out that the time needed to reduce the training set is significantly lower when using bit reduction.

The Core Vector Machine (CVM) paper [41] gave results for the Forest dataset with a different train/test data split. At 100,000 examples training time was not even twice as fast as the SVM while BRSVM at 98,000 examples was 54 times faster.

4.4 Summary and discussion

Table 14 summarizes the performance of BRSVM on all data sets. The first column gives the name of the data set with the number of attributes in parentheses. The second column has the optimal b and s results for a "good" compression ratio, at which BRSVM achieves speedup with an accuracy that is not statistically different from that obtained by SVM using uncompressed data. A McNemar's test with p = 0.05 was used to determine statistical significance. The third column indicates the compression ratio achieved. The fourth column shows the accuracy achieved with SVM. The fifth column shows the accuracy when using BRSVM with the settings indicated in the second column. The sixth and seventh columns indicate the speedup in training and test times respectively. The training time speedup reflects the time needed for bit reduction and SVM training. The last three columns show the accuracy of Random Sampling with the same number of examples as the optimal BRSVM, BRSVM accuracy with s=0 and random sampling accuracy with the same compression as for BRSVM with s=0.

Table 14

Summary of BRSVM on all twelve data sets.

Data set Optimal b and s Comp Ratio SVM Accuracy BRSVM Accuracy Train SpeedUp Pred SpeedUp Random Sampling BRSVM s = 0 Random s = 0

adult(105) b = 9,s = 4 0.634 84.65% 84.64% 1.9 1.8 84.62% 84.56% 84.67%
banana(2) b = 8,s = 0 0.077 90.57% 90.28% 57.6 6.3 89.71% 90.28% 89.70%
forest(54) b = 10,s = 4 0.076 75.85% 75.89% 54.0 14.6 75.79% 76.10% 75.79%
letter(16) b = 9,s = 13 0.654 97.53% 97.30% 1.7 1.2 96.61% 97.22% 95.15%
page(10) b = 9,s = 3 0.112 95.44% 95.26% 15.7 2.4 93.31% 95.35% 93.93%
pendigit(16) b = 9,s = 11 0.621 98.14% 98.03% 1.6 1.0 98.01% 98.17% 98.14%
phoneme(5) b = 8,s = 1 0.609 90.93% 90.29% 2.1 1.2 88.96% 90.93% 89.26%
plankton(17) b = 9,s = 8 0.706 89.60% 88.20% 2.0 1.5 88.18% 89.30% 89.18%
satimage(36) b = 9,s = 28 0.833 91.65% 90.95% 1.2 1.1 91.41% 91.65% 91.65%
shuttle(9) b = 8,s = 5 0.019 99.86% 99.83% 48.9 1.3 99.11% 99.84% 99.41%
waveform(40) b = 11,s = 22 0.185 86.70% 85.60% 19.4 4.1 85.17% 85.80% 86.47%
web(293) b = 0,s = 0 0.352 74.58% 74.56% 2.8 2.6 74.88% 74.56% 74.89%

BRSVM results in speed up ratios from 1.2 to 57.6 in training and 1.0 to 14.6 for prediction. Its application holds the promise of applying SVM to larger problems, as even in the worst case it provided a 20% speedup. It can be very effective (e.g. for forest which has a lot of examples the speedup is 54 times during training and 14.6 times during testing). There is a small accuracy loss (≤ 0.76% for all but two).

Results are typically improved by using unbalanced bit reduction. The banana data set is the only one for which unbalanced bit reduction does not improve the speedup, but it began with a 57.6 times speed up on the training data. BRSVM with pure bit reduction is more accurate than random sampling on 8 data sets (and with unbalanced bit reduction more accurate on 10). On the letter, satimage and plankton data sets which are relatively high dimensional, pure bit reduction fails to provide a very good compression ratio. It is more difficult to get smooth changes to the compression ratio with the larger search space induced by higher dimensionality. The best bit reduction and compression ratio vary across data sets. For example, pure bit reduction fails to provide a compression ratio between 0.362 and 0.962 on the plankton data set. When unbalanced bit reduction was introduced an intermediate compression ratio of 0.706 was obtained resulting in a speedup in training of 1.96.

Although random sampling has higher variances in theory, it works fairly well in our experiments except for page, phoneme, and shuttle where random sampling was more than 1.2% less accurate than BRSVM. It performs only slightly worse than BRSVM on seven out of twelve data sets and provides slightly higher accuracy on web and satimage. This phenomenon was also observed in [29][38], where complicated data squashing strategies brought small or no advantages over random sampling.

On smaller dimensional datasets such as page, phoneme and shuttle the accuracy of BRSVM was respectively more than 2.0%, 1.5%, and 1.2% accurate than random sampling. BRSVM also gives us finer control in the search for an appropriate level of compression than random sampling. By adjusting compression by one bit at a time a level of compression that does not significantly sacrifice classification accuracy can be quickly found. Unbalanced bit reduction then gives a fine tuning mechanism to squeeze out more compression. Section 4.4.1 discusses how and when to tune these two new parameters b and s. We note that for random sampling to achieve the results given here, one would need to develop an algorithm to determine the size of the random sample which has currently been chosen to match that obtained by bit reduction.

BRSVM should also work well for high dimensional data sets that have relatively tightly packed groupings within classes. If the classes are well spread out in feature space, it may not be as effective and will require effective unbalanced bit reduction, which with lots of attributes (say > 100000) will take more time.

One advantage of our approach when compared with other squashing approaches [3, 30] is that the time to do the squashing is minimal. For nine data sets it was less than 1 second, one required 1.09 seconds, and two data sets, forest and web, had the longest times with 2.3 and 5.1 seconds respectively. The much longer time required for the web data set was because there were 31932 examples with 293 attributes each. Forest with 98,884 examples and 54 features had the second longest reduction time at 2.3 seconds. The compression time is typically orders of magnitude less than the training time whereas in [30] it was sometimes two orders of magnitude greater than the training time. We specifically compare on the Adult data set from the UCI repository using a linear kernel. The same training/test sets were used. Our accuracy went from 84.65% to 84.64% after 9 bit reduction and 4 unbalanced attributes (b = 9, s = 4). With Squash-SMO accuracy went from 84.2% down to 83.0% which represents a 1.2% drop in accuracy. BRSVM had a speedup of 1.88 compared to 8.7 for Squash-SMO. If you are willing to build an ensemble of classifiers, boost-SMO generally provides only a small loss of accuracy and was fast for the three data sets for which the authors reported results. However, [32] there is a random selection feature in choosing a small subset of data for boosting that needs to be fully explored to ensure boost-SMO is stable. Generally, BRSVM was more accurate (no loss in accuracy) on the comparison data sets and provided competitive speed-ups.

In [3] compression by clustering was used. The compression time was ⅙ t h the training time. They looked at just 2 datasets and found speedups of less than two times.

The BRSVM approach does not change the computational complexity of SVM training. It simply reduces the number of examples for training which is crucial as the most straightforward training algorithm is O(m 3) for m examples and SMO which is used here scales approximately quadratically in the number of examples. Hence, it can be used in conjunction with faster SVM training algorithms [43, 45] which do approximations within the learning algorithm. It can be used in conjunction with a parallel learning algorithm such as in [6]. In both cases, this will enable SVMs to be applied to large data sets where their use was not feasible, previously.

4.4.1 Bit Reduction Parameter Tuning

The two bit reduction parameters b and s need to be treated as additional parameters that are to be tuned for a given dataset; in the same way the cost(C) and γ parameters are tuned for the SVM. This tuning would only need to be done when there is a drastic change in the training data. Otherwise the same values initially determined can be reused.

For very large datasets, such as the Adult and Forest datasets, it is simple to select a small dataset to tune the bit reduction parameters before training on the full training dataset and still achieve a significant speedup. For smaller datasets you would incur a potentially noticeable time cost in tuning the parameters. However, if you collected more labeled examples for the smaller data set, you would not need to choose new compression parameters.

We have implemented a simple down-hill search strategy to automatically select b and s on a random subset of data. The search was performed on a random sample of n% of the data which was randomly divided into a train set consisting of 50% of the data and a test set of the remainder the data. The strategy is to find the best compression while keeping accuracy loss below a user selected threshold. First the level of bit reduction b that produces compression more than 0.99 was determined. Using that b, we incremented by one until the accuracy loss on the tuning test set was greater than a threshold of 0.5% and chose the value b just before the threshold was reached. Next the number of unbalanced bits s was determined. Initially, s = 1 and it was incremented by one until the accuracy loss exceeded the 0.5% threshold. The selected s was the one before the accuracy loss exceeded the threshold.

Experimental results are reported on three large data sets, Forest, Shuttle and Adult. The timing results include the tuning time and are an average over 30 trials with different random seeds. The values for b and s are median values. We have kept the subsampling size small, but enough to have a representative sample. The subsample sizes were 8% for the forest cover data and 20% for the other two data sets.

Table 15 shows the results. The average accuracy loss was just 0.08% using the Forest data and yielded an average 45 times speedup. The average speedup with the shuttle data was 6.6 times with a average accuracy loss of 0.56%. The average accuracy loss with adult was 2.31% with a 2.25 average training speedup which is not so good. The median compression is 1 bit more than the best that can be found by hand resulting in some accuracy loss. Hence, the proposed tuning strategy is suboptimal in some cases and improvements are left for future work.

Table 15

Bit reduction tuning example on Adult, Forest and Shuttle datasets.

Data set Parameter Search time Bit Red Parameters Train Time Using Parameters Pure SVM Train Time Overall Speedup Accuracy Loss

forest 217.12s b = 10,s = 2 981.17s 54152.90s 45.19 0.09%
shuttle 13.87s b = 9, s = 0 0.95s 98.30s 6.63 0.56%
adult 84.28s b = 10, s = 6 349.69s 976.16s 2.25 2.31%

5 Conclusion

In this paper, a bit reduction SVM was proposed to speed up SVMs' during training and prediction. BRSVM groups similar examples together by reducing their resolution. Such a simple method significantly reduces the training time and the prediction time of an SVM in our experiments when bit reduction can compress the data well. It is more accurate than random sampling when the data set is not over-compressed. BRSVM tends to work better with relatively lower dimensional data sets, on which it is more accurate than random sampling and also results in larger speedups. Therefore, feature selection methods might be used to reduce the data dimensionality and potentially help BRSVM to obtain further speedups.

We note that the bit reduced data sets can be given to other types of learning algorithms. Certainly, for SVM variant classifiers [41] and different learning paradigms within SVM's this approach can be applied. Neural networks or import vector machines [47], for instance, take a significant amount of training time for very large data to varying degrees. However, it is not clear that the same amount of compression would necessarily be acceptable. The algorithms must be modified to accept weighted examples. The characteristics of the algorithms may or may not lend themselves to this type of compression.

We can also conclude if a very high speedup is desired in which a lot of compression is required, random sampling may be a competitive choice. This tends to happen with high dimensional data. For those data sets, BRSVM and random sampling have the potential to be used together. Instead of using one weighted exemplar for each bin, one might randomly sample several examples at a ratio proportional to the number of examples in this bin. Then several weighted exemplars would be used to represent the examples in this bin. This combination method can help when the example distribution is skewed across the bins, and has the potential to improve BRSVM on high dimensional data sets.

Acknowledgments

The research was partially supported by the United States Navy, Office of Naval Research, under grant number N00014-02-1-0266 and the NSF under grant EIA-0130768. The authors thank Kevin Shallow, Scott Samson, and Thomas Hopkins for their cooperation in producing and classifying the plankton data set.

Contributor Information

Kurt A. Kramer, Computer Science and Engineering, ENB118, 4202 E. Fowler Ave., University of South Florida, Tampa, FL 33620.

Lawrence O. Hall, Computer Science and Engineering, ENB118, 4202 E. Fowler Ave., University of South Florida, Tampa, FL 33620.

Dmitry B. Goldgof, Computer Science and Engineering, ENB118, 4202 E. Fowler Ave., University of South Florida, Tampa, FL 33620.

Andrew Remsen, College of Marine Science, University of South Florida, St. Petersburg, Fl.

Tong Luo, Searchspark.

References

1. Asharaf S, Murty MN. Scalable non-linear support vector machine using hierarchical clustering. Pattern Recognition, International Conference on. 2006;1:908–911. [Google Scholar]

2. Asharaf S, Murty MN, Shevade SK. ICML '07: Proceedings of the 24th international conference on Machine learning. New York, NY, USA: ACM; 2007. Multiclass core vector machine; pp. 41–48. [Google Scholar]

3. Boley D, Cao D. Training support vector machines using adaptive clustering; SIAM International Conference on Data Mining; 2004. [Google Scholar]

4. Burges CJC. Simplified support vector decision rules; International Conference on Machine Learning; 1996. pp. 71–77. [Google Scholar]

5. Burges CJC, Schölkopf B. Improving the accuracy and speed of support vector machines. Advances in Neural Information Processing Systems. 1997;9:375–381. [Google Scholar]

6. Cao L, Keerthi S, Ong C-J, Zhang J, Periyathamby U, Fu X-J, Lee H. Parallel sequential minimal optimization for the training of support vector machines. IEEE Transactions on Neural Networks. 2006;17(4):1039–1049. [PubMed] [Google Scholar]

7. Chang C, Lin C. LIBSVM: a library for support vector machines (version 2.3) 2001 [Google Scholar]

8. Cochran WG. Sampling Techniques. 3 edition. John Wiley and Sons, Inc.; 1977. [Google Scholar]

9. Collobert R, Bengio Y, Bengio S. Scaling large learning problems with hard parallel mixtures. Int. J. Pattern Recogn. Artific. Intell. 2003;17(3):349–365. [Google Scholar]

10. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. 2 edition. MIT Press; 2001. [Google Scholar]

11. Dietterich TG. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation. 1998;10(7):1895–1924. [PubMed] [Google Scholar]

12. Dong JX, Krzyzak A. A fast svm training algorithm. International Journal of Pattern Recognition and Artificial Intelligence. 2003;17(3):367–384. [Google Scholar]

13. DuMouchel W, Volinsky C, Johnson T, Cortes C, Pregibon D. Squashing flat files flatter. Data Mining and Knowledge Discovery. 1999:6–15. [Google Scholar]

15. Eschrich S, Ke J, Hall L, Goldgof D. Fast accurate fuzzy clustering through data reduction. IEEE Transactions on Fuzzy Systems. 2003;11(2):262–270. [Google Scholar]

16. Fine S, Scheinberg K. Efficient svm training using low-rank kernel representations. Journal of Machine Learning Research. 2001;2:243–264. [Google Scholar]

17. Jayadeva, Khemchandani R, Chandra S. Twin support vector machines for pattern classification. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2007;29(5):905–910. [PubMed] [Google Scholar]

18. Joachims T. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Machines. 1999:169–184. [Google Scholar]

19. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt's smo algorithm for svm design. Neural Computation. 2001;13:637–649. [Google Scholar]

20. Liu Y, You Z, Cao L. A novel and quick svm-based multiclassifier. Pattern Recognition. 2006;39:2258–2264. [Google Scholar]

21. Luo T, Hall LO, Goldgof DB, Remsen A. Bit reduction support vector machine; Fifth IEEE International Conference on Data Mining (ICDM'05); 2005. [Google Scholar]

22. Luo T, Kramer K, Goldgof D, Hall L, Samson S, Remsen A, Hopkins T. Active learning to recognize multiple types of plankton. 17th conference of the International Association for Pattern Recognition. 2004;3:478–481. [Google Scholar]

23. Luo T, Kramer K, Goldgof D, Hall L, Samson S, Remsen A, Hopkins T. Recognizing plankton images from the shadow image particle profiling evaluation recorder. IEEE Transactions on System, Man, and Cybernetics–Part B: Cybernetics. 2004 Aug;34(4):1753–1762. [PubMed] [Google Scholar]

24. Madigan D, Raghavan N, Dumouchel W, Nason M, Posse C, Ridgeway G. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery. 2002;6(2):173–190. [Google Scholar]

25. Mangasarian OL, Musicant DR. Lagrangian support vector machines. J. Mach. Learn. Res. 2001;1:161–177. [Google Scholar]

27. Michie D, Spiegelhalter DJ, Taylor CC. Machine learning, neural and statistical classification. 1994 [Google Scholar]

28. Osuna E, Girosi F. Reducing the run-time complexity of support vector machines. Advances in Kernel Methods: support vector machines. 1999 [Google Scholar]

29. Owen A. Data squashing by empirical likelihood. Data Mining and Knowledge Discovery. 2003:101–113. [Google Scholar]

30. Pavlov D, Chudova D, Smyth P. Towards scalable support vector machines using squashing; Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining; 2000. pp. 295–299. [Google Scholar]

31. Pavlov D, Chudova D, Smyth P. Towards scalable support vector machines using squashing; Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining; 2000. pp. 295–299. [Google Scholar]

32. Pavlov D, Mao J, Dom B. Scaling-up support vector machines using boosting algorithm. Proceedings. 15th International Conference on Pattern Recognition. 2000;2:219–222. [Google Scholar]

33. Platt J. Advances in Kernel Methods - Support Vector Learning. The MIT Press; 1999. Fast training of support vector machines using sequential minimal optimization; pp. 185–208. [Google Scholar]

34. Ratsch G, Onoda T, Muller K. Soft margins for adaboost. Machine Learning. 2001;42(3):287–320. [Google Scholar]

35. Scholkopf B, Kah-Kay S, Burges C, Girosi F, Niyogi P, Poggio T, Vapnik V. Comparing support vector machines with gaussian kernels to radialbasis function classifiers. IEEE Transactions on Signal Processing. 1997;45(11):2758–2765. [Google Scholar]

36. Schölkopf B, Mika S, Burges C, Knirsch P, Muller KR, Rätsch G, Smola A. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks. 1999;10(5):1000–1017. [PubMed] [Google Scholar]

37. Schölkopf B, Smola AJ. Learning with kernels. The MIT Press; 2002. [Google Scholar]

38. Shih YCL, Rennie JDM, Karger DR. Text bundling: Statistics based data-reduction ; Proceedings of the Twentieth International Conference on Machine Learning; 2003. pp. 696–703. [Google Scholar]

39. Sun B-Y, Huang D-S, Fang H-T. Lidar signal denoising using least-squares support vector machine. IEEE Signal Processing Letters. 2005;12(2):101–104. [Google Scholar]

40. Sung K-K, Poggio T. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20(1) [Google Scholar]

41. Tsang IW, Kwok JT, Cheung P-M. Core vector machines: Fast svm training on very large data sets. The Journal of Machine Learning Research. 2005;6:363–392. [Google Scholar]

42. Vapnik V. Estimation of Dependences Based on Empirical Data. Springer; 2001. [Google Scholar]

43. Wang L, Sun S, Zhang K. A fast approximate algorithm for training L1-SVMs in primal pspace. Neurocomputing. 2007;70:1554–1560. [Google Scholar]

44. Williams CKI, Seeger M. Advances in Neural Information Processing Systems. the MIT Press; 2001. Using the nystrom method to speed up kernel machines; pp. 682–688. [Google Scholar]

45. Wu G, Chang E, Chen Y-K, Hughes C. Incremental approximate matrix factorization for speeding up support vector machines; Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2006. pp. 760–766. [Google Scholar]

46. Yu H, Yang J, Han J. Classifying large data sets using svm with hierarchical clusters; Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining; 2003. pp. 306–315. [Google Scholar]

47. Zhu J, Hastie T. Kernel logistic regression and the import vector machine. 2001 [Google Scholar]

harmonlusand.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4467789/

0 Response to "How Does Svm Work for Continuous Y Variable"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel