Jupyter Snippet P4M IntroToDataScience
Jupyter Snippet P4M IntroToDataScience
A basic introduction to Data Science
Disclaimer: This is not a full introduction and unashamedly focusses on the areas I’m most interested in (optimisation). We are going to spend a small amount of time looking at howto extract data contained in text files. The focus is then on giving you an introduction to how some of the machine learning algorithms work under the hood (rather than just calling the existing implementations from a library). Overall this is very basic though, with a thorough treatment requiring at least a semester course.
Scenario
In this exerercise we are going to work through a fairly common scenario. After developing a couple of algorithms for solving a particular problem (or implementing some “standard” algorithms as well as your shiny new and hopefully improved algorithm), you test the algorithm on some instances and want to know which algorithm is “best”? Or which algorithm is best for what type of data?
Here we are going to go through this type of analysis by focussing on alternative linear programming algorithms. So here are the elements that go into the analysis:
- The problem to be solved: $\min c^T x$ subject to $A x = b,\ x \ge 0$ where $A$ is a $m\times n$ matrix, $x$ is a vector of decision variables and $b$ and $c$ are appropriate constant vectors. In practice we normally allow some variants (for example inequality constraints, arbitrary lower and upper bounds on the variables).
- Algorithms: Three standard algorithms, primal simplex, dual simplex, and the barrier method, as implemented by the CPLEX solver library
- Instances: The MIPLIB2010 “standard” test instances. The data sets are actually for mixed integer problems (where some variables are required to be binary or general integers), but for these tests we will throw away the binary requirements. There are 87 instances in this data set
To make things easy you will not be required to re-run all of the tests but are given the results from running each algorithm on each instance.
Getting the data
In an ideal world the data would already be sitting in a database or simple CSV file for you to read. However this may not always be possible. It is not uncommon for the data to be sitting in text files (perhaps the log files produced as the algorithms are run). Since such text files can get quite large, it makes sense to compress these.
Your first task then is to obtain the data and display a few lines of one of the log files to see what these look like. That data can be found as a zip file at http://users.monash.edu/~andrease/Files/Other/cplexLP.zip. You could download this, unzip it and look at it - but we want to do things with python here. The basic library functions you need are
urllib
using theurllib.request.urlopen("name.of.url/to-load")
function. Basic usage information is provided in this how-to. It is probably easiest to just read this data and write it to file.zipfile
module provides methods for reading (and writing) zip archives. The basic usage is to createzip=ZipFile(<file>,'r')
archive, then usezip.namelist()
to inspect the contents andzip.open(name,'r')
to open the named file. Complete documentation is here- Note that all log files are ASCII. To convert a binary (ASCII) string to the standard python unicode string use
.decode("utf-8")
# add your code here to get http://users.monash.edu/~andrease/Files/Other/cplexLP.zip
519756
# Add code here to read one of the files
# Note: if the output is too long you may want to manually
# toggle scrolling under `Cell->Current Outputs` in the notebook menu
Extracting Features
For each instance we want to extract the following information:
- The name (identifier) for the data set. This is encoded in the name of each file in the Zip archive (after the cplexLP/) or in the log file itself as part of the “Problem name”
- Characteristics or features of the instance to be solved. The log file contains some “statistics” of the problem such as the number of constraints and variables, number of non-zero coefficients, etc. See the CPLEX documentation on displaying problem statistics for more informaiton
- The solution algorithm used: this is again shown in the “Problem name” or can be seen as an integer in the log file as “method for linear optimisation” (1=primal, 2=dual, 4=barrier)
- The time required using the solution algorithm: You could take the solution time in seconds, but for the analysis here using the “Deterministic time” in “ticks” is likely to be more useful as a way of assessing which algorithm is more efficient. Having said that, you may also want to extract the time in seconds. See this brief discussion on ticks or CPLEX documentation for a little more information about what these mysterious ticks mean
In particular from the section of the log files that looks like this:
Objective sense : Minimize
Variables : 18380 [Fix: 7282, Box: 11098]
Objective nonzeros : 2
Linear constraints : 576 [Less: 486, Equal: 90]
Nonzeros : 109706
RHS nonzeros : 30
Variables : Min LB: 0.000000 Max UB: 212.0000
Objective nonzeros : Min : 51.00000 Max : 100.0000
Linear constraints :
Nonzeros : Min : 1.000000 Max : 224.0000
RHS nonzeros : Min : 1.000000 Max : 1.000000
...lines deleted...
Deterministic time = 296.40 ticks (956.13 ticks/sec)
We want to extract fields like the count of number of variables, the number of non-zeros or range of coefficient values.
Python functions and libraries to use
When dealing with text files, regular expressions are your friend. They form a mini-language of their own for expressing string patterns. Regular expressions are not unique to python, with relatively similar syntax used in a variety of tools (particularly in linux/unix). The regular expression documentation gives all the details, but to get you started the string r"^[abc]\d*"
would match any line that
- is at the start of a line (as incdicated by the
^
) - has one of the characters in the set {‘a’,‘b’,‘c’} (written as “[abc]") first
- is followed by zero or more digits “\d” = digit, * means “zero or more”, you could use “+” for “one or more”
- the string is written as a raw string (
r""
) to make sure that “" is not interpreted as an escape character - since regular expressions make frequent use of the backslash - it is also particularly useful and easy to mark a substring as a group. For example
r"([xyz]+)\1"
means that we have a group consisting of one or more characters in {x,y,z} followed by the same group (group 1) repeating a second time. import re
to get the library, then usere.search(pattern,string)
to find a pattern in the given string,re.search(pattern,string).group(1)
would give the substring of the first group in the time the pattern was matched (if the pattern was found at all).
Regular expression testing
To test your skills at writing regular expressions, try the following exercises:
import re
def testre(regex,string):
"Simple test function to print all matches of a regular expression in a string"
matches = re.findall(regex,string)
print("%d:"%len(matches),";".join(matches))
return matches
testre(__INSERT_REGEX_HERE__, # insert regular expressions to find any capitalised words with 2 letters or more
"""Some words, like Australia I know, have Capital letters, see Figure 3 for more"""
)# should find 4: Some;Australia;Capital;Figure
testre(__INSERT_REGEX_HERE__, # outer LaTeX environments: \begin{somename}up to\end{somename}
r"""\begin{enumerate}\item
\end{enumerate}\begin{nested}\begin{env}\end{env}\end{nested}
\begin{test}\end{fail} \begin{array}1 & 2\\3 & 4 \\end{array}
""") # should find 3: enumerate;nested;array
[float(f) for f in
testre(__INSERT_REGEX_HERE__,
# find floating point numbers (including integers and scientific notation)
"15.3 - -100 1.2e-03 NaN 1ee4 +2E+6 ") # 7: 15.3;-100;1.2e-03;NaN;1;4;+2E+6
] # output 15.3, -100.0, 0.0012, nan, 1.0, 4.0, 2000000.0]
4: Some;Australia;Capital;Figure
3: enumerate;nested;array
7: 15.3;-100;1.2e-03;NaN;1;4;+2E+6
[15.3, -100.0, 0.0012, nan, 1.0, 4.0, 2000000.0]
Extracting the data
Have a go at extracting some/all of the data. Store the data in a pandas DataFrame
object. See the 10 minute introduction to Pandas if you have not come across this before. Essentially it is a “table like” data structure that makes it easy to manipulate data.
We want to extract the following fields relating to Variables (Var), Constraints (Con) or the Objective function (Obj): VarCnt, VarFix,VarBox, ObjNZ,ConNZ, ConLess, ConEqual, ConNZ, RhsNZ, VarMin,VarMax,ObjMin,ObjMax,ConMin,ConMax,RhsMin,RhsMax. We also want the performance of the algorithm: PrimalT or DualT or BarrierT for the ticks.
For this exercise there is no need to get all of the fields that might be interesting them all (as getting them all can be a bit fiddly). To get a complete data set we will load a DataFrame of results extracted from these files below.
import re,pandas
import numpy as np
###### insert your code here ###############
data # finish by displaying the DataFrame obtained
# let's just check for missing values
data[np.logical_or.reduce(data.isnull(),axis=1)]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Starting with just the data
To make things easy, here is a way to just start with the data you should have by now (you can also skip the above and just start from here). As a reminder on using data frames, this pandas selection tutorial provides a good summary on how to access rows,columns etc
Also to make things easy there is a version of the data that has
- All timing (tick) information for the different algoritms removed
- Replace the min/max value of coefficients with the magnituded (log) of the range
- Scaled all columns to have entries in the range $[-1,1]$ (to have similar magnitudes)
You can load the raw data (including ‘PrimalT’, ‘DualT’ and ‘BarrierT’ performance fields) and the preprocessed features using the commands below:
import pandas
from urllib.request import urlopen
data = pandas.read_json(urlopen("http://users.monash.edu/~andrease/Files/Other/cplexLP.json"))
features = pandas.read_json(urlopen("http://users.monash.edu/~andrease/Files/Other/features.json"))
features.head(5) # show first few rows
Reload using: data = pandas.read_json(open("cplexLP.json","r"))
Visualising our data
The set of features is fairly high dimensional. How do we make sense of this? Let’s try to do a plot. For this we need to reduce the data to two dimensions. Enter PCA (Principal Component Analysis). For most of the “machine learning” type algorithms in this workshop we will use the scikit-learn set of libraries. This includes a PCA method. However so we can see more easily what is going on here, let’s do it directly using the numpy matrix routines.
Useful functions:
data.values
gives the whole frame as a matrix.- We need to form the covariance matrix (subtract the column mean, multiply by the transpose and divide by n-1): $cov= \frac{1}{n-1}(X-I\bar x)^T (X-I\bar x)$. For this you need
np.mean
which takes an argumentaxis=d
to take the mean in the d’th dimension, andX.T
is the transpose of numpy matrixX
. Or just callnumpy.cov
- but be careful not to get features & observations (columns and rows) mixed up np.linalg.eig(X)
returns the (eigenvalues, eigenvectors) as vector/matrix respectively.- We only want to keep the two largest eigenvectors and plot the projection onto these two axes
- I’m assuming you are using matplotlib (
matplotlib.pyplot.scatter
) but you can use anything else for plotting
Question: Is the scaling of the different attributes proposed above sufficient? Shoud we do something more radical?
import numpy as np
import matplotlib.pyplot as plt
## insert your code here
For alternative, more complicated ways to map data down to 2 dimensions there are plenty of other tools in sklearn library (look at Manifold Learning). Looking at underlying algoirthms is well beyond the scope of this tutorial. However you may want to have have a go at a few of these such as TSNE
or MDS
. Note that some of these manifold learning methods are randomised so won’t return the same result each time
import numpy as np
from sklearn.manifold import TSNE, MDS,LocallyLinearEmbedding,SpectralEmbedding
## you may want to give this a go ###
Unsupervised learning - clustering
In unsupervised learning we are looking for some structure without having any real idea of what is “correct”. The simplest case of unsupervised learning is clustering - grouping data into clusters of points that are close together.
The simplest way to do this is to apply the cluster.KMeans
model from the scklearn
library. See this tutorial for more information.
Try splitting the data (based on the features table) into 3 clusters.
from sklearn import cluster
ncluster=3
### insert your code here and plot the results ###
Writing our own k-means algorithm
The basic algorithm is very simple - so simple that we can write our own
Pick an initial assignment of points to clusters (or an initial set of centres of the clusters)
Repeat until converged:
Assign each point to the nearest centre (euclidean distance)
Move the center of each cluster to the centroid (average of all points in the cluster)
This method is a heuristic: It is not guaranteed to give an optimal solution (the one with the least average distance of points to their centre) and it will converge to different solutions depending on the initial solution we start with.
Some usuful numpy funtions
np.random.randint(low,high,n)
returns array length n of integers inrange(low,high)
np.mean(matrix,axis=d)
= array of means (along either columns or rows depending on axisnp.linalg.norm
= norm of an arraynp.argmin(matrix,axis=d)
= like mean iterates over elements along just one axis and returns the minimum indexX[v > 0]
- returns a submatrix of X depending on which elements ofv
are positive. Whetherv
may be of length number of rows (to select whole rows) or same dimension as X (to select submatrix). SimilarlyX[:,v>0]
would select submatrix based on columns wherev
is positive.
import numpy as np
def kmeansHeur(data,p):
"""Given a n x m matrix with n points and m features
return list/np.array length n of labels (in range(p))
"""
data = np.array(data) # just to make convert from DataFrame or other data type
### insert your algorithm here #####
return lbl
lblsHeur = kmeansHeur(features,ncluster) # fit model to data
print("Points per cluster=",[sum(1 for i in lblsHeur if i==k) for k in range(ncluster)])
# do a scatter plot here using c=lbslHeur to get the dots coloured by cluster
This answer is probably not the same that you got wiht KMeans from sklearn. Which one is “right” or at least “better”? Need to formally define the objective function that we are trying to optimise. Essentially we are minimising the sum of squared norm distances: $$\min_{c_k,C_k} \sum_k \sum_{i\in C_k} ||c_k - x_i||^2$$ Where $c_k$ and $C_k$ are the centre of cluster $k$ and the set of points in the cluster respectively. Each $x_i$ is a point of the data.
Compute the objective function for both your solution and the KMeans solution. Which one is better? Is either of these optimal (and how would we know?)
def clusterObj(data,label):
"Input: feature matrix (pts x features) and label for each feature."
data = np.array(data)
# compute the centre for each cluster and return the sum of squared distances
clusterObj(features,k_means.labels_),clusterObj(features,lblsHeur)
(35.92617097573589, 44.328479351670445)
Note that our method is a randomised heuristic and the result depends on the starting point. Rerun your kmeansHeur
repeatedly (say 1000 times) and keep the best result. How does this compare to the result from a single run of sklearn.cluster.KMeans
?
### insert your code here ####
Supervised Learning - Classification with Support Vector Machines
We might want to figure out which algorithm works fastest for a given problem. That is, given a linear program and its characteristics, can we decide which algorithm to use to get the solution as fast as possible. Support vector machines are one model for such a classification task. It supposes that there is a linear function of the factors that determines which of 2 classes a point belongs to. That is, that there exists a set of weights $w_i$ and constant $W$ such that $\sum_i w_i f_i < W$ if one algorithm is faster and $\sum_i w_i f_i > W$ when the other method is faster where $f_i$ are the features of the instance we are interested in solving. How do we decide what the $w_i$ (and $W$) should be? This is supervised learning where we are using existing training data to fit the parameters.
Let’s start by trying to work out for which instances the primal vs dual simplex method is faster. We have primalT
and dualT
that tells us which is faster for the test data. So we want to find a vector $w$ and constant $W$ such that
$\sum_i w_i f_{ki} \le W-1$ if PrimalT
< DualT
and $\sum_i w_i f_{ki} \ge W+1$ otherwise. (Where $f_{ki}$ is the i-th feature of instance k. Finding such a set of $w_i$ and $W$ is just a linear programming problem. Some things we need to consider is: The limit of 1 is just to ensure we get some minimum separation (we could pick any number as multiplying each row by a constant would not change the problem other than to increase the non-zero difference between the LHS and W
- If the instances can be separated, we want to make the 2 hyperplanes (defined by the $\pm1$ on the right hand side) as far apart as possible(so that we get a clean separation). The separation distance is $2/||w||$. So maximising the separation is equivalent to $\min ||w||^2$ which gives us a quadratic program with linear constraints.
- It may not be possible to separate the points cleanly into two groups with a single hyperplane. Hence we might want to penalise any point that is mis-classified, perhaps based on how far it is away from the hyperplane. Let $v_k$ be the distance that point $k$ is away from the hyperplane then we might want to $\min ||w||^2 + \sum_k v_k$ with $\sum_i w_i f_{ki} \le W-1+v_k$,
- We can put the two objectives together - by noting that maximisation is the same as minimising the negative and placing greater priority on minimising violations: $\min ||w||^2 + \alpha \sum_k v_k$ subject to $\sum_i w_i f_{ki} \le W+v_k-s$ for $k$ such that the primal solution is better (with correspoding $\ge$ constraints for the other points). Here $\alpha$ is an arbitrary weight that provides a trade-off between violations (misclassifications) and the distance separating the hyperplanes.
- Some variants of this are possible. For example we could replace the$||w||^2$ term by $\max_i |w_i|$ (for example) which can be written in a linear program, or even drop it entirely. When there are violations these are minimised if $||w||$ is minimised so we may not need this.
Hence for an initial experiment we want to set up an SVM training function that takes our features as input data together with the list instances for which data['PrimalT'] < data['dualT']
(this python expression returns a boolean column). Let I
be those instances and IC
be the complement. The we want to solve the following linear program:
$$\min_{v,w,W} \sum_{i\in I\cup IC} v_i$$
$$s.t.\ \sum_{k\in F} D_{if} w_f \le W - 1 + v_i\quad\forall\ i\in I$$
$$\quad \sum_{k\in F} D_{if} w_f \ge W + 1 - v_i\quad\forall\ i\in IC$$
$$ v_i \ge 0\quad\forall\ i$$
Here $D_{if}$ is the data for instance $i$ and feature $f$ out of the set of features $F$.
How to set this LP up using the CPLEX solver:
import docplex.mp.model as MP
mdl = MP.Model()
create a mathematical programming (optimisation) modelx = mdl.binary_var_list(n)
create list of variablesx[0]
…x[n-1]
that are all in {0,1} - not needed herex = mdl.continuous_var_list(n,lb=-mdl.infinity,ub=3*np.ones(n))
create list of variables $-\infty < x[i] \le 3$ for i=0,…,n-1mdl.add( mdl.sum(x[i] for i in range(n)) == 1)
add the constraint $\sum_{i=0}^{n-1} x_i=1$mdl.minimize( x[0]+x[1]**2)
set the objective (linear + quadratic term). Note: CPLEX handles linear and quadratic programming problems but not more general non-linear optimisation problemsCPXparam = mdl.context.cplex_parameters
access parameters - complete list of CPLEX parameters. It may be a good idea to setCPXparam.timelimit=60
seconds (to stop it running too long if the problem is not set up correctly),CPXparam.threads=1
. This model should run in no more than a second though.solution = mdl.solve(log_output=True)
does the optimisation and shows logs some information to the screen as it runs.solution == None
if the solver fails to obtain a solution.mdl.get_solve_status()
gives the final solution status (as a string)solution.get_value(x[0]+x[1])
would return the final value ofx[0]+x[1]
in the solution. Alternatively you can also usesolution[x[2]]
to get the value ofx[2]
in the solution. (More informmation on the solution class see SolveSolution)
# Mathematical programming optimisation model
import docplex.mp.model as MP
def SVM(features,label):
"""features: the usual matrix of points x features
label: a list/array (length number of points) of bools/integers that
identifies the points to be separated from the rest
Returns: array/list length number of features+1 of w_i plus the constant W"""
# insert your code here and return the solution
return [solution.get_value(w[f]) for f in F]
soln = SVM(features,data['PrimalT']<data['DualT'])
print("Weights =",soln)
What can we do to improve the fit? How to deal with non-linear separations? Try adding additional “features” that capture the non-linear effects. For example we might expect that the effectiveness depends on the size or density of the constraint matrix (density = ‘ConNZ’ / (‘VarCnt’ x ‘ConCnt’)). Create an extended feature set and try again. The idea here is that if we have say a two dimensional set of features (x,y for each point) and we wanted to separate those points that are inside an elipse centered at the origin from those outside, then we can find such a separating elipse by including features $x^2$ and $y^2$ with the optimisation choosing some combination $w_x x^2 + w_y y^2 \le W$ giving us an elipse while still solving a lienar problem (since the $x^2$ is just a constant coefficent for the variable $w_x$ in the optimisation).
Try creating some additional features. Remember for DataFrames
we can easily create new columns by doing arithmetic with whole columns. For example features['eqSq'] = features['ConEqual']**2
would create a new column eqSq
that contains the number of equality constraints (ConEqual
) squared.
expFeat = features.copy() # expanded set of features
### add features here ####
expSoln = SVM(expFeat,data['PrimalT']<data['DualT'])
How well is our classifier doing?
To analyse the performance we should think about:
- Do we care about:
- how far we are on the wrong side of the line? (this is what we are optimising at the moment)
- how many instances are classified incorrectly?
- how much extra compute time we would incur if we used the minimum?
- How well it works for data it hasn’t seen?
Below compute some alternative measures of the quality of the fit. Then address the second issue by using a random split of the instances to create 4 groups (of about 21 each). Then we used 3 parts as the “training” data to fit the SVM model, and the other group for testing to validate that the model is acutally OK on data that wasn’t used for the training. By repeating this 4 times for each group we can compute a more accurate average performance of the approach on this type of data.
def SVMviolation(feat,w,lower=data['PrimalT'],higher=data['DualT']):
"""Input: feat = matrix of features, w - list/array of weights (length features + 1),
higher/lower = performance measure
Returns total violation of sum( w[i]*feat[i]) <= W or >= W depending on if lower[i] < or > higher[i]"""
### insert your code here
def SVMextraT(feat,w,lower=data['PrimalT'],higher=data['DualT']):
"""Input: feat = matrix of features, w - list/array of weights (length features + 1),
higher/lower = performance measure (if feat * w < W we run the 'lower' method)
Returns total extra time for running the slower algorithm"""
#### insert your code here
def SVMcount(feat,w,lower=data['PrimalT'],higher=data['DualT']):
"""Input: feat = matrix of features, w - list/array of weights (length features + 1),
higher/lower = performance measure (if feat * w < W we run the 'lower' method)
Returns total extra time for running the slower algorithm"""
### insert your code here
print("Results from training on whole data")
print("Effectiveness of solution: %.2f viol, %.2f ticks, %d misclassified"%(
SVMviolation(features,soln),SVMextraT(features,soln),SVMcount(features,soln)))
print("Expanded feature set soln: %.2f viol, %.2f ticks, %d misclassified"%(
SVMviolation(expFeat,expSoln),SVMextraT(expFeat,expSoln),SVMcount(expFeat,expSoln)))
Now partition your data into 4 subsets of about 21 instances each. Take each subset in turn as the test data and use the remainder to train (fit) the SVM. How well, on average, does this approach work for data it hasn’t been trained on?
Note: while in general one might want to do the splitting randomly, this could lead to somewhat misleading results. The data sets are very variable in size & complexity, so it would make sense to try to split them a bit more systematically to ensure none of the groups only have very big/complex problems or only small trivial data sets. If you are feeling creative, make up a way to split the instances that deterministic or less random and more likely to be balanced.
# do things correctly using a subset of data for training only
def SVMtest(feat,nGroup=4,method=SVM,lower=data['PrimalT'],higher=data['DualT']):
"""Split features feat into nGroup groups, train using method, evaluate using lower/higher"""
### insert your code here
return (violation,extraT,count) # total performance across all 4 groups with the 3 performance measures
print("Results when using original features: %.2f violation, %.2f ticks, %d misclassified"%
SVMtest(features))
print("Results when using expanded features: %.2f violation, %.2f ticks, %d misclassified"%
SVMtest(expFeat))
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 2 rows and 18 columns.
Aggregator did 1 substitutions.
Reduced LP has 55 rows, 57 columns, and 745 nonzeros.
Presolve time = 0.00 sec. (0.14 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 25.846194539417073 7/3 points misclassified out of 58
Separation = 0.0001707715384478243
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 2 rows and 14 columns.
Reduced LP has 68 rows, 74 columns, and 920 nonzeros.
Presolve time = 0.00 sec. (0.13 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Iteration: 62 Dual objective = 33.875482
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 34.94454272579978 5/11 points misclassified out of 70
Separation = 3.8695707264822075e-05
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 13 columns.
Reduced LP has 64 rows, 69 columns, and 889 nonzeros.
Presolve time = 0.00 sec. (0.12 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Iteration: 62 Dual objective = 37.870247
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 38.05281703487499 5/13 points misclassified out of 64
Separation = 0.001642740150009337
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 11 columns.
Reduced LP has 69 rows, 76 columns, and 959 nonzeros.
Presolve time = 0.00 sec. (0.13 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Iteration: 62 Dual objective = 35.678993
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 35.904024755019826 3/14 points misclassified out of 69
Separation = 0.007035393760542038
Performance with 4 groups:1856.27 viol, 9805082.54 ticks, 48 misclassified
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 18 columns.
Reduced LP has 62 rows, 68 columns, and 1230 nonzeros.
Presolve time = 0.00 sec. (0.17 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Perturbation started.
Iteration: 51 Dual objective = 0.000000
Removing perturbation.
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 16.6574843028596 2/6 points misclassified out of 62
Separation = 0.0005100711480481556
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 16 columns.
Reduced LP has 61 rows, 69 columns, and 1202 nonzeros.
Presolve time = 0.00 sec. (0.16 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Iteration: 62 Dual objective = 22.943956
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 23.44995181413229 6/6 points misclassified out of 61
Separation = 0.0007552419195052046
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 1 rows and 20 columns.
Reduced LP has 67 rows, 72 columns, and 1301 nonzeros.
Presolve time = 0.00 sec. (0.21 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 21.538678238361175 4/2 points misclassified out of 68
Separation = 1.1258324539094591e-05
CPXPARAM_Read_DataCheck 1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 17 columns.
Reduced LP has 70 rows, 77 columns, and 1385 nonzeros.
Presolve time = 0.00 sec. (0.19 ticks)
Initializing dual steep norms . . .
Iteration log . . .
Iteration: 1 Dual objective = 0.000000
Iteration: 62 Dual objective = 16.821517
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation = 18.917275053481482 5/1 points misclassified out of 70
Separation = 0.0001448685861530269
Performance with 4 groups:1125.87 viol, 9787073.67 ticks, 36 misclassified
(1125.871757496142, 9787073.669999996, 36)
Extension exercises
- We could modify our SVM approach to use an objective that better matches what we want to achieve (e.g. minimise additional computational effort from misclassified instances)
- Use the sklearn builtin method
LinearSVC
to implement a support vector machine. Documentation is available here The basic approach is:
from sklearn import svm
mdl = svm.LinearSVC()
mdl.fit(X,y) # X = training data, y is -1 when PrimalT < DualT and +1 otherwise
mdl.predict(x) # predict outcome using test data x