I wrote this post and am also happy to comment. Hopefully we'll be following this up soon with a post on treating robustness and cost simultaneously in a multicriteria setting. Also, special thanks again to Devon Sigler at the University of Colorado Denver for his help editing this post.
The methods in scipy.optimize are great if your function is serial, cheap, deterministic, and convex. Many ML models and real world problems don't fit into this context though.
We have a Jupyter notebook showing how SigOpt compares to several scipy.optimize methods as well as standard methods like grid/random on a simple non-convex problem here [1]. These results only get more striking as the dimensionality increases.
Many of these scipy methods can cope with concave, noisy functions. There's a bit of skill/alchemy with selecting starting values and tolerance parameters. Interesting to see that your approach has superior performance in that case, but in many cases these free tools with no need for an api would be sufficient.
> The tradeoffs between these two criteria can either be managed by some supervisory decision maker (the driver of the car in this example) or by merging the multiple criteria into some single criteria and phrasing the problem as a standard optimization problem.
You always have to reduce the problem to a scalar if you want a single answer.
First off, I am very hesitant to say anything about biconvex problems - I only see them in passing and they are definitely not in my wheelhouse. If anyone out there is an expert, or even just has a solid (basic) reference on biconvex problems, please feel free to drop some knowledge on me.
For this particular problem, which has only one input variable, yes the answer can be resolved with a good-old fashioned Plug-In-The-Answer strategy. For problems with more than one input variable, that will almost certainly not be the case.
Really, all I was trying to say there at the end is that converting the multicriteria problem to a constraint based problem has potential benefits over scalarization. Speaking only for myself, I always default to treating multicriteria problems in some sort of norm-scalarized sense: minimize ||g|| for some vector norm. I thought it was valuable to remind myself, and maybe others, that there are other ways to naturally rephrase multicriteria problems as scalar optimization problems. I'm definitely not saying anything about how easy it is to solve, as in general these constrained problems are going to be harder than the non-constrained linear (or norm) scalarization.
That makes sense, at least as long as the vector- or matrix-valued objective behaves somewhat. I guess I've been using this all along (with transformations to enforce proper behavior) for matrix and tensor completion anyways... Hmm. I just didn't implement it terribly elegantly!
Now that I think about it, all of the methods I've ever seen for matrix-valued time series fits (i.e. multiple measurements at multiple sites per time point) are Bayesian. That's about the most irreducible constrained optimization problem I can think of in this setting.
Do you have a reference for fitting matrix-valued time series with nonlinear criteria? I'm familiar with the standard Box-Jenkins methods but I usually see that done with linear least-squares methods. I'd love to up my game on that front.
I do research in combinatorial optimization. Does your company solve any problems with a combinatorial flavor? (say, things can be optimized using combinatorial algorithms instead of going for gradient descent.
We don't (yet), but we are growing rapidly and always looking for ways to help our customers solve optimization problems. Fwiw, we do support categorical parameters (as well as continuous and integer) and our ensemble of Bayesian optimization techniques are able to solve this mixed type problem much more efficiently than techniques like gradient decent. Although the way we handle the purely combinatorial (only category) problem isn't as flushed out as our mixed type problems. We are looking to grow the team (and our offerings) if you're interested though [1].
> [We] are able to solve this mixed type problem much more efficiently than techniques like gradient decent.
Naive gradient descent is probably the simplest strategy that one can imagine. How do your algorithms compare to Newton methods for minimizing the primal-dual gap (interior point methods)?
We compare to some standard convex optimization techniques implemented in scipy here [1]. We have comparisons to some other Bayesian methods here [2]. I'm happy to answer any questions!
What are the bread and butter of combinatorial optimization? In other words, the concepts you would first come across, at an undergrad level if possible?
Doing research in mixed-integer linear optimization (but wrote diploma thesis about some topic in combinatorial optimization):
> What are the bread and butter of combinatorial optimization? In other words, the concepts you would first come across, at an undergrad level if possible?
- Polyhedral combinatorics (Books: "Alexander Schrijver - Combinatorial Optimization: Polyhedra and Efficiency" (more focus on polyhedral combinatorics; IMHO the best book, but not the most approachable), "Bernhard Korte, Jens Vygen - Combinatorial Optimization: Theory and Algorithms" (more focus on algorithms; easier to read). This of course includes (mixed-)integer linear programming ((M)ILP).
- Of course learning about (M)ILPs means understanding linear programming (LP). Here I personally prefer "Alexander Schrijver - Theory of Linear and Integer Programming" (this books also covers ILP aspects, but not MILPs).
- Other books for learning about ILPs are "Dimitris Bertsimas, Robert Weismantel - Optimization Over Integers" (main focus is ILP, nevertheless a good book) and "Conforti, Cornuejols, Zambelli - Integer Programming". There are no really good books about MILPs that I know of, but these two books at least will cover some aspects of it.
- Sometimes semidefinite relaxations will occur (most famous example: Goemans-Williamson Algorithm; less famous, but also important: Lovasz-Schrijver hierarchy, Sherali-Adams hierarchy, Lasserre hierarchy)
I'd also like to throw in some work by a former colleague of mine at Argonne, Sven Leyffer on nonlinear programming:
- A compendium he co-edited named (appropriately enough) Mixed Integer Nonlinear Programming
- A review paper he co-authored for Acta Numerica: http://www.mcs.anl.gov/papers/P3060-1112.pdf
Also, yeah, the "Alexander Schrijver - Theory of Linear and Integer Programming" reference is solid.
The survey article by Sven Leyffer is a good paper, but comes from a completely different direction:
It comes from people from convex optimization trying to additionally apply some integrality conditions (a little bit as second-class citizen). On the other hand classical combinatorial optimization is integrality conditions as first-class citizen. I, coming from (M)ILP, would argue that the MINLP people coming from convex optimization tend to sidestep all the problems that make ILP so hard (and interesting). On the other hand MINLP people would equally vocally argue that the (M)ILP people tend to prefer "academic" problems and don't grasp how many important research questions they miss.
It's up to the reader to decide which side is right. :-)
My personal opinion in this "flamewar" is that if you come from a computer science background (in particular theoretical computer science) you will probably prefer classic MILP culture. On the other hand if you come from engineering you will probably prefer MINLP theory as outlined in Sven Leyffer's survey paper.
Yeah, I think that's probably the split - folks from computer science/discrete math on one side and folks from engineering on the other. I grew up in math, but I was on the numerical analysis side so I definitely ended up on the MINLP side, which is why that's what I generally reference. There is certainly something elegant about ILP problems which gets lost when treating them with the sledgehammer that is gradient-based convex optimization.
> There is certainly something elegant about ILP problems which gets lost when treating them with the sledgehammer that is gradient-based convex optimization.
It is funny that you call gradient-based convex optimization a "sledgehammer" since people working in combinatorial optimization (opposed to ILP) tend to denote ILP methods (e.g. cutting plane algorithms, branch & bound, branch & cut, relaxation hierarchies, ...) also as a "sledgehammer". :-D They are just jealous. :-)
I almost exclusively work on problems can be solved(by a combinatorial algorithm) in polynomial time. I have never used theory of mixed integer linear programs, but that's because I focus on different aspects of combinatorial optimization.
If I teach an undergrad course, I would model after Michel Goemans's course. http://www-math.mit.edu/~goemans/18433S15/18433.html
I will also introduce some submodular functions(and touches submodular flow).
It captures half of the things encountered in the course, and general and simple enough to be the first thing to try.
For example, the following problem might be difficult if one tries to create an algorithm by modify the standard matching algorithms. However, one can easily show it is polynomial time solvable by proving some submodular property.
Does your platform support more than one level of preemption / lexicographic goal setting? (Like the ε-constraint scalarization technique, but with a second, third, nth set of goals between the first set and the objective?)
Our customers who are working with multicriteria problems have, thus far, had primarily two criteria, thus we have been helping them manage their two criteria problems into a scalar setting. As such, we do not, at this moment, permit the layers of ordering strategy you suggest through our API. To do so internally would introduce a complicated bifurcation between problems phrased with real-valued observations (as is our standard workflow), and the less informative comparative structure you're suggesting, whereby we would only be able to make statements about the relative order of points and not the magnitude by which they differ. If we were willing to impose a magnitude, doing so would revert the problem back into the weighted combination scalarization setting (or at least some norm-scalarization setting, if not the linear setting discussed in the post). I do not foresee us implementing such a tiered preemptive ordering any time soon.
I agree that expressing the problem can become more confusing in the presence of more levels of preemption, however it can be an effective way to organize large numbers of criteria. If your customers are mainly working with biobjective problems, I can see why you wouldn't be too interested in adding more levels!
And I can absolutely agree that, as more criteria arise, the mechanism for linear scalarization probably becomes more fragile (subject to inconsistent behavior from the coefficients). As a result, something less sensitive but more robust, such as the tiered ordering, is probably preferable. But yeah, we just have not seen the demand yet. What actually seems to be most common is that people who have ~10 metrics spend some time thinking about it, and then realize that they mostly only cared about 1-2 so long as the rest did not cause problems/failures. That was part of the reason I wrote about the epsilon-constraint idea.
We do this in one sense within our company, but it's actually not within the context of a numerical multicriteria optimization problem. We are always trying to optimize around our customer's needs, which is in some ways a multicriteria problem involving balancing: 1) the "best" parameterization of a model subject to some (usually cross-validation) metric, 2) the "cost" (number of samples) required to optimize the model quality, 3) the "robustness" of (degree to which small parameter changes impact) the resulting solution, 4) the "parallel speed" (number of simultaneous suggestions) of the optimization process.
We consult with enterprise customers to understand their needs and expectations regarding these criteria to produce a sort of hierarchical ordering (as you've suggested) which helps inform our optimization procedure (maybe a customer doesn't care as much about speed but definitely cares about robustness). Obviously, it's a relatively restricted problem, and we're not considering it in a rigorous mathematical framework (just how best to serve our customers). Because these factors have no real numerical relationship, the only mechanism we can use to balance the concerns is a relative ordering, which is then manage internally. We spoke about this design at the ICML AutoML workshop this year (A Strategy for Ranking ... at https://sites.google.com/site/automl2016/accepted-papers)
> The reason this is a more complicated situation is that an ordering of vectors in RkRk does not exist... how would one order the vectors u = (1,2,3), v = (2,1,3), w = (3,2,1)?
It seems straightforward to order those vectors by first comparing the first component, then the second, then the third. The result is u,v,w. It's as if you wanted to sort a multi-column report in Excel. What am I missing?
If such an ordering did exist, then we could certainly apply that ordering to sort results from the vector objective function so as to find the "answer" to the multicriteria problem. The Wikipedia article on multiobjective optimization discusses this strategy: https://en.wikipedia.org/wiki/Multi-objective_optimization#A.... On that note, lemme throw a shout out to the wonderful person who took the time to write that Wikipedia article - it is outstanding.
Given that, such an ordering may not be appropriate in all circumstances. Sorting objective vectors from the function suggested in this post would first sort by "time to destination" and then break ties in "time to destination" with "cost of trip". That would mean that (1, 1000) < (1.0000001, 2), but I think most people would be willing to arrive 0.0000001 hours later to save 998 dollars. The flexibility in interpreting the vector objective and making tradeoffs is why the standard lexicographic ordering is not always appropriate.
> most people would be willing to arrive 0.0000001 hours later to save 998 dollars
Unfortunately, if you don't know how your model behaves, you can't tell when you're setting γ whether you're actually going to get a solution that's way past the point of diminishing returns. You may not even be able to tell after the fact. This is one of the reasons for doing sensitivity analysis on γ. (Your ε-constraint scalarization helps with this problem.)
You make a perfectly accurate point that, in practice, it is unlikely one would be able to make such a prediction without significant info about the model. The above comment was meant in more of a post hoc "we've executed our multicriteria optimization, approximated our Pareto frontier, now let's make a decision" sense, not within a specific scalarization context. Indeed sensitivity analysis on the gamma parameter is very important; I tried to hint at that by showing that the interplay of the choice of currency and the gamma on the optimal value is nontrivial (though predictable since this is just a toy problem).
Do you happen to have any references talking about such sensitivity analysis on scalarization parameters? I would love to add them to my reading list. Thanks.
I don't have any references -- I'm going off something one of my committee members said. But the basic idea is just that, if your weight is 0.2, you want to make sure you also obtain solutions for 0.18 and 0.22, to give you some sense of whether you're on the edge of a cliff. This is pretty cheap in general for convex optimization problems because you've already got a solution that should be close to optimal for the new weight.
Good call - if the problem is well behaved then small changes in gamma should be able to use the previous solution as an initial guess. And I absolutely agree with the robustness idea you're talking about; I was hinting at it when I was talking about the impact of choice of currency. For a well scaled problem there is a consistent and well-behaved impact on the solution for small changes in gamma. But when the currency was changed to RMB, small changes in gamma no longer had a consistent impact on the optimum.
If you change the basis of R_n (e.g., you rotate the axes, or you can do), you get a different result by this comparison. If there's no intrinsic reason for using a particular basis, you get different results with this ordering. In other words, you're more or less being completely arbitrary.
this implies you're giving weight to the first element then the second.
let's say you were optimizing something to be pretty A, delicious B, soft C.
you tune the system and evaluate the prettiness, deliciousness and softness.
you tuned it several times and got three products (A, B, C) of (1,2,3), (2,1,3), (3,2,1) - concrete values are correct evaluations. how exactly do you choose the best one, is prettiness more important?
pareto efficiency come to mind - also discussed in the article. [1]
Consider a simpler case of two dimensional real vectors.
Lets consider the vector space of R^2, since all finite dimensional vector spaces of same dimensionality are isomorphic, R^2 is isomorphic to the complex numbers.
As is suggested there, though, implementing this no-preference strategy requires some clean rescaling of the component functions in order to yield equal significance for all of them. If you have such a rescaling, that's outstanding; however, as I suggested in the section of the article dealing with the impact of the choice of currency, rescaling a problem may be a difficult proposition. This is especially true for problems that aren't as simple as the toy problem I've proposed here.