Friday, November 10, 2017

Linear Programming and Chebyshev Regression

The LAD (Least Absolute Deviation) or \(\ell_1\) regression problem (minimize the sum of the absolute values of the residuals) is often discussed in Linear Programming textbooks: it has a few interesting LP formulations [1]. Chebyshev or \(\ell_{\infty}\) regression is a little bit less well-known. Here we minimize the maximum (absolute) residual.

\[\begin{align}\min_{\beta}\>&\max_i \> |r_i|\\
&y-X\beta = r\end{align}\]

As in [1],  \(\beta\) are coefficients to estimate and  \(X,y\) are data. \(r\) are the residuals.

Some obvious and less obvious formulations are:

  • Variable splitting:\[\begin{align}\min\>&z\\& z \ge  r^+_i + r^-_i\\&r^+_i - r^-_i =y_i –\sum_j X_{i,j}\beta_j\\&r^+_i, r^-_i\ge 0\end{align}\] With variable splitting we use two non-negative variables \(r^+_i - r^-_i\) to describe a value \(r_i\) that can be positive or negative. We need to make sure that one of them is zero in order for \(r^+_i + r^-_i\) to be equal to the absolute value \(|r_i|\). Here we have an interesting case, as we are only sure of this for the particular index \(i\) that has the largest absolute deviation (i.e. where \(|r_i|=z\)). In cases where \(|r_i|<z\) we actually do not  enforce \(r^+_i \cdot r^-_i = 0\). Indeed, when looking at the solution I see lots of cases where \(r^+_i > 0, r^-_i > 0\). Effectively those \(r^+_i,  r^-_i\) have no clear physical interpretation. This is very different from the LAD regression formulation [1] where we require all \(r^+_i \cdot r^-_i = 0\).
  • Bounding:\[\begin{align}\min\>&z\\ & –z  \le y_i –\sum_j X_{i,j}\beta_j \le z\end{align}\]Here \(z\) can be left free or you can make it a non-negative variable (it will be non-negative automatically). Note that there are actually two constraints here: \(–z  \le y_i –\sum_j X_{i,j}\beta_j\) and \(y_i –\sum_j X_{i,j}\beta_j \le z\). This model contains the data twice, leading to a large number of nonzero elements in the LP matrix.
  • Sparse bounding: This model tries to remedy the disadvantage of the standard bounding model by introducing extra free variables \(d\) and extra equations:\[\begin{align}\min\>&z\\ & –z  \le d_i \le z\\& d_i = y_i –\sum_j X_{i,j}\beta_j \end{align}\] Note again that \(–z  \le d_i \le z\) is actually two constraints: \(–z  \le d_i\) and \(d_i \le z\).
  • Dual:\[\begin{align}\max\>&\sum_i y_i(d_i+e_i)\\&\sum_i X_{i,j}(d_i+e_i) = 0 \perp \beta_j\\&\sum_i (d_i-e_i)=1\\&d_i\ge 0, e_i\le 0\end{align}\]The estimates \(\beta\) can be found to be the duals of equation \(X^T(d+e)=0\).

We use the same synthetic data as in [1] with \(m=5,000\) cases, and \(n=100\) coefficients. Some timings with Cplex (default LP method) yield the following results (times are in seconds):


Opposed to the LAD regression example in [1], the bounding formulation is very fast here. The dual formulation remains doing very good.

Historical Note

The use of this minimax principle goes back to Euler (1749) [3,4].

Leonhard Euler (1707-1783)

  1. Linear Programming and LAD Regression,
  2. A.Giloni, M.Padberg, Alternative Methods of Linear Regression, Mathematical and Computer Modeling 35 (2002), pp.361-374.
  3. H.L. Harter, The method of least squares and some alternatives, Technical Report, Aerospace Research Laboratories, 1972.
  4. L. Euler, Pièce qui a Remporté le Prix de l'Academie Royale des Sciences en 1748, sur les Inegalités de Mouvement de Saturn et de Jupiter. Paris (1749).

Thursday, November 9, 2017

Linear Programming and LAD Regression

I believe any book on linear programming will mention LAD (Least Absolute Deviation) or \(\ell_1\) regression: minimize the sum of the absolute values of the residuals.

\[\begin{align}\min_{\beta}\>&\sum_i |r_i|\\
&y-X\beta = r\end{align}\]

Here \(\beta\) are coefficients to estimate. So they are the decision variables in the optimization model. \(X,y\) are data (watch out here: in many optimization models we denote decision variables by \(x\); here they are constants). \(r\) are the residuals; it is an auxiliary variable that can be substituted out.

The standard LP models suggested for this problem are not very complicated, but still interesting enough to have them cataloged.

There are at least three common LP formulations for this: variable splitting, bounding and a dual formulation. Here is a summary:

  • Variable splitting:\[\begin{align}\min\>&\sum_i r^+_i + r^-_i\\&r^+_i - r^-_i =y_i –\sum_j X_{i,j}\beta_j\\&r^+_i, r^-_i\ge 0\end{align}\]In this model, automatically one of the pair \((r^+_i,r^-_i)\) will be zero. We don’t have to add an explicit complementarity condition \(r^+_i \cdot r^-_i = 0\). This is fortunate: we can keep the model linear.
  • Bounding:\[\begin{align}\min\>&\sum_i r_i\\&-r_i \le y_i –\sum_j X_{i,j}\beta_j \le r_i\end{align}\]Here \(r_i\) can be left free or you can make it a non-negative variable. It will be non-negative automatically. Note that there are actually two constraints here: \(-r_i \le y_i –\sum_j X_{i,j}\beta_j\) and \(y_i –\sum_j X_{i,j}\beta_j\le r_i\). This formulation is mentioned in [1].
  • Sparse bounding:
    In the standard bounding formulation we have all the data \(X_{i,j}\) twice in the model, leading to a large number of non-zero elements in the LP matrix. We can alleviate this by introducing an extra free variable \(d\): \[\begin{align}\min\>&\sum_i r_i\\&-r_i \le d_i \le r_i\\&d_i = y_i –\sum_j X_{i,j}\beta_j\end{align}\] Effectively we reduced the number of non-zeros by a factor of two compared to the standard bounding formulation. Note that the first constraint \(-r_i\le d_i \le r_i\) is actually two constraints: \(-r_i\le d_i\) and \(d_i\le r_i\). The sparse version will have fewer non-zero elements, but many more constraints and variables. Advanced LP solvers are based on sparse matrix technology and the effect of more nonzeros is often underestimated by novice users of LP software. The same arguments and reformulations can be used in \(\ell_1\) portfolio models [3].
  • Dual:\[\begin{align}\max\>&\sum_i y_i d_i\\&\sum_i X_{i,j} d_i=0 \perp \beta_j \\&-1\le d_i \le 1\end{align}\] The optimal values for \(\beta\) can be recovered from the duals for the constraint \(X^Td=0\) (this is what the notation \(\perp\) means).

One could make the argument the last formulation is the best: it has fewer variables than variable splitting, and fewer equations than the bounding approach. In addition, as mentioned before, the standard bounding formulation has the data twice in the model, resulting in a model with many nonzero elements.

Modern LP solvers are not very well suited for these type of models. They like very sparse LP models, while these models are very dense. Let’s try anyway with a large, artificial data set with \(m=5,000\) cases, and \(n=100\) coefficients. The data matrix \(X\) has 500,000 elements. Some timings with Cplex (default LP method) yield the following results:


Times are in seconds.

The dual formulation seems indeed quite fast. It is interesting that the bounding model (this formulation is used a lot) is actually the slowest. Note that these results were obtained using default settings. This effectively means that Cplex selected the dual simplex solver for all these instances. These timings will change when the primal simplex method or the barrier algorithm is used.


The R package L1pack [5] contains a dense, simple (compared to say Cplex), specialized version of the Simplex method based on an algorithm from [6,7]. It is actually quite fast on the above data:


The number of Simplex iterations is 561. This confirms that sparse, general purpose LP solvers have a big disadvantage on this problem (usually Cplex would beat any LP algorithm you can come up yourself, by a large margin).

See the comment from Robert Fourer for some notes on the Barrodale-Roberts algorithm. So I think I need to refine my previous statement a bit: there are two reasons why l1fit is doing so well: (1) dense data using a dense code and (2) a specialized Simplex method handling the absolute values. The number of iterations is close to our dual formulation (which is also very compact), so the time per iteration is about the same as Cplex.

Historical note

The concept of minimizing the sum of absolute deviations goes back to [4].


The term quaesita refers to “unknown quantities”.

F.Y. Edgeworth, 1845-1926.

I noticed that Edgeworth was from Edgeworthstown, Ireland (population: 2,335 in 2016).

  1. Least absolute deviations,
  2. A.Giloni, M.Padberg, Alternative Methods of Linear Regression, Mathematical and Computer Modeling 35 (2002), pp.361-374.
  3. L1 portfolio formulation,
  4. Francis Ysidro Edgeworth, On observations relating to several quantities, Hermathena 6 (1887), pp. 279-285
  5. Osorio, F., and Wołodźko, T. (2017). L1pack: Routines for L1 estimation,
  6. Barrodale, I., and Roberts, F.D.K. (1973). An improved algorithm for discrete L1 linear approximations. SIAM Journal of Numerical Analysis 10, 839-848
  7. Barrodale, I., and Roberts, F.D.K. (1974). Solution of an overdetermined system of equations in the L1 norm. Communications of the ACM 17, 319-320.

Tuesday, November 7, 2017

Suguru: GAMS/MIP vs Python/Z3

 For many logical puzzles a constraint solver like Z3 may be more suitable than a Mathematical Programming approach using a MIP solver. In [1] a MIP formulation was presented for Suguru puzzle. Now let’s see how much difference we observe when we use Z3.

There are two major implications for the model itself:

  1. We can use integers directly, i.e.  \(x_{i,j} \in \{1,\dots,u_{i,j}\}\) instead of \(x_{i,j,k} \in \{0,1\}\).
  2. We have access to constructs like Distinct and != (unequal).

However most of the code is devoted to data manipulation as we will see below. Although the approaches are different (different languages, different solver technologies), the actual code is remarkably similar.

Data entry: GAMS

First we need to get the data into GAMS. Unfortunately we need to enter two matrices: one for the blocks and one for the initial values.

'rows' /i1*i7/
'columns' /j1*j9/
'values' /1*9/
'blocks' /b1*b15/
table blockno(i,j)
j1 j2 j3 j4 j5 j6 j7 j8 j9
i1   1  1  2  2  2  3  3  4  6
i2   1  1  2  2  3  3  4  4  6
i3   1  7  8  8  3  4  4  5  6
i4   7  7  7  8  8  9  9  6  6
i5   7 10 10  8 12  9  9 14 15
i6  11 11 10 10 12 12  9 15 15
i7  11 11 10 12 12 13 13 15 15
table initval(i,j)
j1 j2 j3 j4 j5 j6 j7 j8 j9
i1   1        1
i2      2                 3
i3   4     4              1  4
i4            1
i5   4     5     4     5  1
i6                        3
i7               2           5

The set k is a bit of an over-estimate: we actually need only values 1 through 5 for this problem (the maximum size of a block is 5).

Data entry: Python

Entering the same boards in Python can look like:

from z3 import *

blockno = [ [
1, 1, 2, 2, 2, 3, 3, 4, 6],
1, 1, 2, 2, 3, 3, 4, 4, 6],
1, 7, 8, 8, 3, 4, 4, 5, 6],
7, 7, 7, 8, 8, 9, 9, 6, 6],
7,10,10, 8,12, 9, 9,14,15],
11,11,10,10,12,12, 9,15,15],
11,11,10,12,12,13,13,15,15] ]

initval = [ [
1, 0, 0, 1, 0, 0, 0, 0, 0],
0, 2, 0, 0, 0, 0, 0, 3, 0],
4, 0, 4, 0, 0, 0, 0, 1, 4],
0, 0, 0, 1, 0, 0, 0, 0, 0],
4, 0, 5, 0, 4, 0, 5, 1, 0],
0, 0, 0, 0, 0, 0, 0, 3, 0],
0, 0, 0, 0, 2, 0, 0, 0, 5] ]

m = len(blockno) 
# number of rows
n = len(blockno[
0]) # number of columns
R = range(m)
C = range(n)
bmax = max([blockno[i][j]
for i in R for j in C])
B = range(bmax)

The order is a bit reversed here: first enter the boards with the block numbers and the initial values and then calculate the dimensions.

Derived data: GAMS

We need to calculate a number of things before we can actually write the equations.

alias (i,i2),(j,j2);

'maps between b <-> (i,j)'
'allowed (b,k) combinations'
'allowed values'
'number of cells in block'

bmap(b,i,j) = blockno(i,j)=
len(b) =
bk(b,k) =
ijk(i,j,k) =

n(i,j,i,j+1)   = blockno(i,j)<>blockno(i,j+1);
n(i,j,i+1,j-1) = blockno(i,j)<>blockno(i+1,j-1);
n(i,j,i+1,j)   = blockno(i,j)<>blockno(i+1,j);
n(i,j,i+1,j+1) = blockno(i,j)<>blockno(i+1,j+1);

Note that the set n (indicating neighbors we need to inspect for different values) only looks “forward” from cell \((i,j)\). This is to prevent double-counting: we don’t want to compare neighboring cells twice in the equations.

Derived data: Python

The Python/Z3 model needs to calculate similar derived data:

# derived data
maxb = [
0]*bmax    # number of cells in each block
bij = [[]
for b in B]  # list of cells in each block
nij = [ [ []
for j in C ] for i in R ] # neighbors
for i in R:
for j in C:
         bn = blockno[i][j]-
         maxb[bn] +=
         bij[bn] += [(i,j)]
if j+1and blockno[i][j]!=blockno[i][j+1]:
             nij[i][j] += [(i,j+
if i+1and j-1>=0 and blockno[i][j]!=blockno[i+1][j-1]:
             nij[i][j] += [(i+
if i+1and blockno[i][j]!=blockno[i+1][j]:
             nij[i][j] += [(i+
if i+1and j+1and blockno[i][j]!=blockno[i+1][j+1]:
             nij[i][j] += [(i+

# upperbound on X[i,j]
maxij = [ [ maxb[blockno[i][j]-
1] for j in C ] for i in R ]

I use here nested lists. bij has a list of cells for each block. nij contains a list of neighboring cells for each cell.

Model constraints: GAMS

The decision variables and linear constraints are described in [1]. The GAMS representation is very close the mathematical formulation.

variable dummy 'objective variable';
binary variable x(i,j,k);

* fix initial values
ord(k)) = 1;

'pick one value per cell'
'unique in a block'
'neighboring cells must be different'
'dummy objective'

sum(ijk(i,j,k), x(i,j,k)) =e= 1;

sum(bmap(b,i,j), x(i,j,k)) =e= 1;

and ijk(i,j,k) and ijk(i2,j2,k))..
    x(i,j,k)+x(i2,j2,k) =l= 1;

edummy.. dummy =e= 0;

Notice we added a dummy objective to the model.

Model constraints: Z3
# Model
X = [ [ Int(
"x%d.%d" % (i+1, j+1)) for j in C ] for i in R ]

Xlo =
And([X[i][j] >= 1 for i in R for j in C])
Xup =
And([X[i][j] <= maxij[i][j] for i in R for j in C])

Uniq =
And([ Distinct([X[i][j] for (i,j) in bij[b]]) for b in B])
Nbor =
And([ And([X[i][j] != X[i2][j2] for (i2,j2) in nij[i][j] ]) for i in R for j in C if len(nij[i][j])>0 ])

Xfix =
And([X[i][j] == initval[i][j] for i in R for j in C if initval[i][j]>0])

Cons = [Xlo,Xup,Uniq,Nbor,Xfix]

Most of this is boilerplate. We use Distinct to make the contents of the cells unique within a block. The != operator is used to make sure neighbors of different blocks do not have the same value. Both these constructs are not so easy to implement in a MIP model, but there we were rescued by using binary variables instead.

Solving and printing results: GAMS

We only look for a feasible solution (any feasible integer solution will be globally optimal immediately). This means we don’t have to worry about setting the OPTCR option in GAMS. The default gap of 10% is irrelevant here as the solver will stop after finding the first integer solution.

model m /all/;
solve m minimizing dummy using mip;

parameter v(i,j) 'value of each cell';
v(i,j) =
sum(ijk(i,j,k), x.l(i,j,k)*ord(k));
option v:0;
display v;

To report a solution we recover the integer cell values. The results look like:

----     81 PARAMETER v  value of each cell

            j1          j2          j3          j4          j5          j6          j7          j8          j9

i1           1           5           4           1           5           1           5           2           1
i2           3           2           3           2           3           2           4           3           5
i3           4           1           4           5           4           1           5           1           4
i4           5           2           3           1           3           2           3           2           3
i5           4           1           5           2           4           1           5           1           4
i6           2           3           4           3           5           3           4           3           2
i7           4           1           2           1           2           1           2           1           5

Solving and printing results: Python/Z3
# solve and print
s = Solver()
if s.check() == sat:
     m = s.model()
     sol = [ [ m.evaluate(X[i][j])
for j in C] for i in R]

The solution looks like:

[[1, 5, 4, 1, 5, 1, 5, 2, 1], [3, 2, 3, 2, 3, 2, 4, 3, 5], [4, 1, 4, 5, 4, 1, 5, 1, 4], 
[5, 2, 3, 1, 3, 2, 3, 2, 3], [4, 1, 5, 2, 4, 1, 5, 1, 4], [2, 3, 4, 3, 5, 3, 4, 3, 2],
[4, 1, 2, 1, 2, 1, 2, 1, 5]]
Check for uniqueness: GAMS

To check if the solution is unique we add a cut and resolve. If this is infeasible we know there was only one solution.

* test if solution is unique

equation cut 'forbid current solution';
sum(ijk(i,j,k), x.l(i,j,k)*x(i,j,k)) =l= card(i)*card(j)-1;

model m2 /all/;
solve m2 minimizing dummy using mip;
abort$(m2.modelstat=1 or m2.modelstat=8) "Unexpected solution found",x.l;

We abort if the model status is not "optimal" (1) or "integer solution found" (8). (It should be enough to check on optimal only, we err on the safe side here).

Check for uniqueness: Z3

# Check for uniqueness
Cut =
Or([X[i][j] != sol[i][j] for i in R for j in C])
if s.check() == sat:
print("There is an alternative solution.")
print("Solution is unique.")
  1. Suguru,