## Sets with Small Intersection

Suppose that we want to construct subsets $S_1, \ldots, S_m \subseteq \{1,\ldots,n\}$ with the following properties:

1. $|S_i| \geq k$ for all $i$
2. $|S_i \cap S_j| \leq 1$ for all $i \neq j$

The goal is to construct as large a family of such subsets as possible (i.e., to make $m$ as large as possible). If $k \geq 2\sqrt{n}$, then up to constants it is not hard to show that the optimal number of sets is $\frac{n}{k}$ (that is, the trivial construction with all sets disjoint is essentially the best we can do).

Here I am interested in the case when $k \ll \sqrt{n}$. In this case I claim that we can substantially outperform the trivial construction: we can take $m = \Omega(n^2 / k^3)$. The proof is a very nice application of the asymmetric Lovasz Local Lemma. (Readers can refresh their memory here on what the asymmetric LLL says.)

Proof. We will take a randomized construction. For $i \in \{1,\ldots,n\}$, $j \in \{1,\ldots,m\}$, let $X_{i,a}$ be the event that $i \in S_a$. We will take the $X_{i,a}$ to be independent each with probability $\frac{2k}{n}$. Also define the events

$Y_{i,j,a,b} = I[i \in S_a \wedge j \in S_a \wedge i \in S_b \wedge j \in S_b]$

$Z_{a} = I[|S_a| < k]$

It suffices to show that with non-zero probability, all of the $Y_{i,j,a,b}$ and $Z_{a}$ are false. Note that each $Y_{i,j,a,b}$ depends on $Y_{i',j,a',b}, Y_{i',j,a,b'}, Y_{i,j',a',b}, Y_{i,j',a,b'}, Z_a, Z_b$, and each $Z_a$ depends on $Y_{i,j,a,b}$. Thus each $Y$ depends on at most $4nm$ other $Y$ and $2$ other $Z$, and each $Z$ depends on at most $n^2m/2$ of the $Y$. Also note that $P(Y_{i,j,a,b}) = (k/n)^4$ and $P(Z_a) \leq \exp(-k/4)$ (by the Chernoff bound). It thus suffices to find constants $y, z$ such that

$(k/n)^4 \leq y(1-y)^{4nm}(1-z)^2$

$\exp(-k/4) \leq z(1-y)^{n^2m/2}$

We will guess $y = \frac{k}{n^2m}$, $z = \frac{1}{2}$, in which case the bottom inequality is approximately $\exp(-k/4) \leq \frac{1}{2}\exp(-k/2)$ (which is satisfied for large enough $k$, and the top inequality is approximately $\frac{k^4}{n^4} \leq \frac{k}{4n^2m} \exp(-4k/n)$, which is satisfied for $m \leq \frac{n^2}{4ek^3}$ (assuming $k \leq n/4$). Hence in particular we can indeed take $m = \Omega(n^2/k^3)$, as claimed.

## Linear algebra fact

Here is interesting linear algebra fact: let $A$ be an $n \times n$ matrix and $u$ be a vector such that $u^{\top}A = \lambda u^{\top}$. Then for any matrix $B$, $u^{\top}((A-B)(\lambda I - B)^{-1}) = u^{\top}$.

The proof is just basic algebra: $u^{\top}(A-B)(\lambda I - B)^{-1} = (\lambda u^{\top} - u^{\top}B)(\lambda I - B)^{-1} = u^{\top}(\lambda I - B)(\lambda I - B)^{-1} = u^{\top}$.

Why care about this? Let’s imagine that $A$ is a (not necessarily symmetric) stochastic matrix, so $1^{\top}A = 1^{\top}$. Let $A-B$ be a low-rank approximation to $A$ (so $A-B$ consists of all the large singular values, and $B$ consists of all the small singular values). Unfortunately since $A$ is not symmetric, this low-rank approximation doesn’t preserve the eigenvalues of $A$ and so we need not have $1^{\top}(A-B) = 1^{\top}$. The $(I-B)^{-1}$ can be thought of as a “correction” term such that the resulting matrix is still low-rank, but we’ve preserved one of the eigenvectors of $A$.

## Maximal Maximum-Entropy Sets

Consider a probability distribution ${p(y)}$ on a space ${\mathcal{Y}}$. Suppose we want to construct a set ${\mathcal{P}}$ of probability distributions on ${\mathcal{Y}}$ such that ${p(y)}$ is the maximum-entropy distribution over ${\mathcal{P}}$:

$\displaystyle H(p) = \max_{q \in \mathcal{P}} H(q),$

where ${H(p) = \mathbb{E}_{p}[-\log p(y)]}$ is the entropy. We call such a set a maximum-entropy set for ${p}$. Furthermore, we would like ${\mathcal{P}}$ to be as large as possible, subject to the constraint that ${\mathcal{P}}$ is convex.

Does such a maximal convex maximum-entropy set ${\mathcal{P}}$ exist? That is, is there some convex set ${\mathcal{P}}$ such that ${p}$ is the maximum-entropy distribution in ${\mathcal{P}}$, and for any ${\mathcal{Q}}$ satisfying the same property, ${\mathcal{Q} \subseteq \mathcal{P}}$? It turns out that the answer is yes, and there is even a simple characterization of ${\mathcal{P}}$:

Proposition 1 For any distribution ${p}$ on ${\mathcal{Y}}$, the set

$\displaystyle \mathcal{P} = \{q \mid \mathbb{E}_{q}[-\log p(y)] \leq H(p)\}$

is the maximal convex maximum-entropy set for ${p}$.

To see why this is, first note that, clearly, ${p \in \mathcal{P}}$, and for any ${q \in \mathcal{P}}$ we have

$\displaystyle \begin{array}{rcl} H(q) &=& \mathbb{E}_{q}[-\log q(y)] \\ &\leq& \mathbb{E}_{q}[-\log p(y)] \\ &\leq& H(p), \end{array}$

so ${p}$ is indeed the maximum-entropy distribution in ${\mathcal{P}}$. On the other hand, let ${\mathcal{Q}}$ be any other convex set whose maximum-entropy distribution is ${p}$. Then in particular, for any ${q \in \mathcal{Q}}$, we must have ${H((1-\epsilon)p + \epsilon q) \leq H(p)}$. Let us suppose for the sake of contradiction that ${q \not\in \mathcal{P}}$, so that ${\mathbb{E}_{q}[-\log p(y)] > H(p)}$. Then we have

$\displaystyle \begin{array}{rcl} H((1-\epsilon)p + \epsilon q) &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log((1-\epsilon)p(y)+\epsilon q(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}[-\log(p(y) + \epsilon (q(y)-p(y))] \\ &=& \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[-\log(p(y)) - \epsilon \frac{q(y)-p(y)}{p(y)} + \mathcal{O}(\epsilon^2)\right] \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon \mathbb{E}_{(1-\epsilon)p+\epsilon q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) - \epsilon^2 \mathbb{E}_{q}\left[\frac{q(y)-p(y)}{p(y)}\right] + \mathcal{O}(\epsilon^2) \\ &=& H(p) + \epsilon(\mathbb{E}_{q}[-\log p(y)]-H(p)) + \mathcal{O}(\epsilon^2). \end{array}$

Since ${\mathbb{E}_{q}[-\log p(y)] - H(p) > 0}$, for sufficiently small ${\epsilon}$ this will exceed ${H(p)}$, which is a contradiction. Therefore we must have ${q \in \mathcal{P}}$ for all ${q \in \mathcal{Q}}$, and hence ${\mathcal{Q} \subseteq \mathcal{P}}$, so that ${\mathcal{P}}$ is indeed the maximal convex maximum-entropy set for ${p}$.

## Convex Conditions for Strong Convexity

An important concept in online learning and convex optimization is that of strong convexity: a twice-differentiable function $f$ is said to be strongly convex with respect to a norm $\|\cdot\|$ if

$z^T\frac{\partial^2 f}{\partial x^2}z \geq \|z\|^2$

for all $z$ (for functions that are not twice-differentiable, there is an analogous criterion in terms of the Bregman divergence). To check strong convexity, then, we basically need to check a condition on the Hessian, namely that $z^THz \geq \|z\|^2$. So, under what conditions does this hold?

For the $l^2$ norm, the answer is easy: $z^THz \geq \|z\|_2^2$ if and only if $H \succeq I$ (i.e., $H-I$ is positive semidefinite). This can be shown in many ways, perhaps the easiest is by noting that $z^THz-\|z\|_2^2 = z^T(H-I)z$.

For the $l^{\infty}$ norm, the answer is a bit trickier but still not too complicated. Recall that we want necessary and sufficient conditions under which $z^THz \geq \|z\|_{\infty}^2$. Note that this is equivalent to asking that $z^THz \geq (z_i)^2$ for each coordinate $i$ of $z$, which in turn is equivalent to $H \succeq e_ie_i^T$ for each coordinate vector $e_i$ (these are the vectors that are 1 in the $i$th coordinate and 0 everywhere else).

More generally, for any norm $\|\cdot\|$, there exists a dual norm $\|\cdot\|_*$ which satisfies, among other properties, the relationship $\|z\| = \sup_{\|w\|_* = 1} w^Tz$. So, in general, $z^THz \geq \|z\|^2$ is equivalent to asking that $z^THz \geq (w^Tz)^2$ for all $w$ with $\|w\|_* = 1$. But this is in turn equivalent to asking that

$H \succeq ww^T$ for all $w$ such that $\|w\|_* = 1$.

In fact, it suffices to pick a subset of the $w$ such that the convex hull consists of all $w$ with $\|w\|_* \leq 1$; this is why we were able to obtain such a clean formulation in the $l^{\infty}$ case: the dual norm to $l^{\infty}$ is $l^1$, whose unit ball is the simplex, which is a polytope with only $2n$ vertices (namely, each of the signed unit vectors $\pm e_i$).

We can also derive a simple (but computationally expensive) criterion for $l^1$ strong convexity: here the dual norm is $l^{\infty}$, whose unit ball is the $n$-dimensional hypercube, with vertices given by all $2^n$ vectors of the form $[ \pm 1 \ \cdots \ \pm 1]$. Thus $z^THz \geq \|z\|_1^2$ if and only if $H \succeq ss^T$ for all $2^n$ sign vectors $s$.

Finally, we re-examine the $l^2$ case; even though the $l^2$-ball is not a polytope, we were still able to obtain a very simple expression. This was because the condition $H \succeq I$ manages to capture simultaneously all dual vectors such that $w^Tw \leq 1$. We thus have the general criterion:

Theorem. $H \succeq M_jM_j^T$ for $j = 1,\ldots,m$ if and only if $H$ is strongly convex with respect to the norm $\|\cdot\|$ whose dual unit ball is the convex hull of the transformed unit balls $M_j\mathcal{B}_j$, $j = 1, \ldots, m$, where $\mathcal{B}_j$ is the $l^2$ unit ball whose dimension matches the number of columns of $M_j$.

Proof. $H \succeq M_jM_j^T$ if and only if $z^THz \geq \max_{j=1}^m \|M_j^Tz\|_2^2$. Now note that $\|M_j^Tz\|_2 = \sup_{w \in \mathcal{B}_j} w^TM_j^Tz = \sup_{w' \in M_j\mathcal{B}_j} (w')^Tz$. If we define $\|z\| = \max_{j=1}^m \|M_j^Tz\|_2$, it is then apparent that the dual norm unit ball is the convex hull of the $M_j\mathcal{B}_j$.

## Convexity counterexample

Here’s a fun counterexample: a function $\mathbb{R}^n \to \mathbb{R}$ that is jointly convex in any $n-1$ of the variables, but not in all variables at once. The function is

$f(x_1,\ldots,x_n) = \frac{1}{2}(n-1.5)\sum_{i=1}^n x_i^2 - \sum_{i < j} x_ix_j$

To see why this is, note that the Hessian of $f$ is equal to

$\left[ \begin{array}{cccc} n-1.5 & -1 & \cdots & -1 \\ -1 & n-1.5 & \cdots & -1 \\ \vdots & \vdots & \ddots & \vdots \\ -1 & -1 & \cdots & n-1.5 \end{array} \right]$

This matrix is equal to $(n-0.5)I - J$, where $I$ is the identity matrix and $J$ is the all-ones matrix, which is rank 1 and whose single non-zero eigenvalue is $n$. Therefore, this matrix has $n-1$ eigenvalues of $n-0.5$, as well as a single eigenvalue of $-0.5$, and hence is not positive definite.

On the other hand, any submatrix of size $n-1$ is of the form $(n-0.5)I-J$, but where now $J$ is only $(n-1) \times (n-1)$. This matrix now has $n-2$ eigenvalues of $n-0.5$, together with a single eigenvalue of $0.5$, and hence is positive definite. Therefore, the Hessian is positive definite when restricted to any $n-1$ variables, and hence $f$ is convex in any $n-1$ variables, but not in all $n$ variables jointly.

## A Fun Optimization Problem

I spent the last several hours trying to come up with an efficient algorithm to the following problem:

Problem: Suppose that we have a sequence of $l$ pairs of non-negative numbers $(a_1,b_1),\ldots,(a_l,b_l)$ such that $\sum_{i=1}^l a_i \leq A$ and $\sum_{i=1}^l b_i \leq B$. Devise an efficient algorithm to find the $k$ pairs $(a_{i_1},b_{i_1}),\ldots,(a_{i_k},b_{i_k})$ that maximize

$\left[\sum_{r=1}^k a_{i_r}\log(a_{i_r}/b_{i_r})\right] + \left[A-\sum_{r=1}^k a_{i_r}\right]\log\left(\frac{A-\sum_{r=1}^k a_{i_r}}{B-\sum_{r=1}^k b_{i_r}}\right).$

Commentary: I don’t have a fully satisfactory solution to this yet, although I do think I can find an algorithm that runs in $O\left(\frac{l \log(l)}{\epsilon}\right)$ time and finds $2k$ pairs that do at least $1-\epsilon$ as well as the best set of $k$ pairs. It’s possible I need to assume something like $\sum_{i=1}^l a_i \leq A/2$ instead of just $A$ (and similarly for the $b_i$), although I’m happy to make that assumption.

While attempting to solve this problem, I’ve managed to utilize a pretty large subset of my bag of tricks for optimization problems, so I think working on it is pretty worthwhile intellectually. It also happens to be important to my research, so if anyone comes up with a good algorithm I’d be interested to know.

## Eigenvalue Bounds

While grading homeworks today, I came across the following bound:

Theorem 1: If A and B are symmetric $n\times n$ matrices with eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n$ and $\mu_1 \geq \mu_2 \geq \ldots \geq \mu_n$ respectively, then $Trace(A^TB) \leq \sum_{i=1}^n \lambda_i \mu_i$.

For such a natural-looking statement, this was surprisingly hard to prove. However, I finally came up with a proof, and it was cool enough that I felt the need to share. To prove this, we actually need two ingredients. The first is the Cauchy Interlacing Theorem:

Theorem 2: If A is an $n\times n$ symmetric matrix and B is an $(n-k) \times (n-k)$ principle submatrix of A, then $\lambda_{i-k}(A) \leq \lambda_i(B) \leq \lambda_i(A)$, where $\lambda_i(X)$ is the ith largest eigenvalue of X.

As a corollary we have:

Corollary 1: For any symmetric matrix X, $\sum_{i=1}^k X_{ii} \leq \sum_{i=1}^k \lambda_i(X)$.

Proof: The left-hand-side is just the trace of the upper-left $k\times k$ principle submatrix of X, whose eigenvalues are by Theorem 2 bounded by the k largest eigenvalues of X. $\square$

The final ingredient we will need is a sort of “majorization” inequality based on Abel summation:

Theorem 3: If $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$ are such that $\sum_{i=1}^k x_i \leq \sum_{i=1}^k y_i$ for all k (with equality when $k=n$), and $c_1 \geq c_2 \geq \ldots \geq c_n$, then $\sum_{i=1}^n c_ix_i \leq \sum_{i=1}^n c_iy_i$.

Proof: We have:

$\sum_{i=1}^n c_ix_i = c_n(x_1+\cdots+x_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(x_1+\cdots+x_i) \leq c_n(y_1+\cdots+y_n) + \sum_{i=1}^{n-1} (c_i-c_{i+1})(y_1+\cdots+y_i) = \sum_{i=1}^n c_iy_i$

where the equalities come from the Abel summation method. $\square$

Now, we are finally ready to prove the original theorem:

Proof of Theorem 1: First note that since the trace is invariant under similarity transforms, we can without loss of generality assume that A is diagonal, in which case we want to prove that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$. But by Corollary 1, we also know that $\sum_{i=1}^k B_{ii} \leq \sum_{i=1}^k \mu_i$ for all k. Since by assumption the $\lambda_i$ are a decreasing sequence, Theorem 3 then implies that $\sum_{i=1}^n \lambda_i B_{ii} \leq \sum_{i=1}^n \lambda_i \mu_i$, which is what we wanted to show. $\square$

## Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions $p$ and $q$, the KL divergence is defined as

$KL(p || q) := \int p(x) \log(p(x)/q(x)) dx$

Note that $KL(p || q) \neq KL(q || p)$. Intuitively, a small KL(p || q) means that there are few points that p assigns high probability to but that q does not. We can also think of KL(p || q) as the number of bits of information needed to update from the distribution q to the distribution p.

Suppose that p and q are both mixtures of other distributions: $p(x) = \sum_i \alpha_i F_i(x)$ and $q(x) = \sum_i \beta_i G_i(x)$. Can we bound $KL(p || q)$ in terms of the $KL(F_i || G_i)$? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If $\sum_i \alpha_i = \sum_i \beta_i = 1$ and $F_i$ and $G_i$ are all probability distributions, then

$KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right)$.

Proof: If we expand the definition, then we are trying to prove that

$\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx$

We will in fact show that this is true for every value of $x$, so that it is certainly true for the integral. Using $\log(x/y) = -\log(y/x)$, re-write the condition for a given value of $x$ as

$\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)$

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen’s inequality on the $\log$ function:

$\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)$

This proves the inequality and therefore the theorem. $\square$

Remark: Intuitively, if we want to describe $\sum \alpha_i F_i$ in terms of $\sum \beta_i G_i$, it is enough to first locate the $i$th term in the sum and then to describe $F_i$ in terms of $G_i$. The theorem is a formalization of this intuition. In the case that $F_i = G_i$, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.

Today Arun asked me the following question:

“Under what conditions will a set $\{p_1,\ldots,p_n\}$ of polynomials be quadratically independent, in the sense that $\{p_1^2, p_1p_2, p_2^2, p_1p_3,\ldots,p_{n-1}p_n, p_n^2\}$ is a linearly independent set?”

I wasn’t able to make much progress on this general question, but in the specific setting where the $p_i$ are all polynomials in one variable, and we further restrict to just monomials, (i.e. $p_i(x) = x^{d_i}$ for some $d_i$), the condition is just that there are no distinct unordered pairs $(i_1,j_1),(i_2,j_2)$ such that $d_{i_1} + d_{j_1} = d_{i_2} + d_{j_2}$. Arun was interested in the largest such a set could be for a given maximum degree $D$, so we are left with the following interesting combinatorics problem:

“What is the largest subset $S$ of $\{1,\ldots,D\}$ such that no two distinct pairs of elements of $S$ have the same sum?”

For convenience of notation let $n$ denote the size of $S$. A simple upper bound is $\binom{N+1}{2} \leq 2D-1$, since there are $\binom{N+1}{2}$ pairs to take a sum of, and all pairwise sums lie between $2$ and $2D$. We therefore have $n = O(\sqrt{D})$.

What about lower bounds on n? If we let S be the powers of 2 less than or equal to D, then we get a lower bound of $\log_2(D)$; we can do slightly better by taking the Fibonacci numbers instead, but this still only gives us logarithmic growth. So the question is, can we find sets that grow polynomially in D?

It turns out the answer is yes, and we can do so by choosing randomly. Let each element of $\{1,\ldots,D\}$ be placed in S with probability p. Now consider any k, $2 \leq k \leq 2D$. If k is odd, then there are (k-1)/2 possible pairs that could add up to k: (1,k-1), (2,k-2),…,((k-1)/2,(k+1)/2). The probability of each such pair existing is $p^2$. Note that each of these events is independent.

S is invalid if and only if there exists some k such that more than one of these pairs is active in S. The probability of any two given pairs being simultaneously active is $p^4$, and there are $\binom{(k-1)/2}{2} \leq \binom{D}{2}$ such pairs for a given $k$, hence $(D-1)\binom{D}{2} \leq D^3/2$ such pairs total (since we were just looking at odd k). Therefore, the probability of an odd value of k invalidating S is at most $p^4D^3/2$.

For even $k$ we get much the same result except that the probability for a given value of $k$ comes out to the slightly more complicated formula $\binom{k/2-1}{2}p^4 + (k/2-1)p^3 + p^2 \leq D^2p^4/2 + Dp^3 + p^2$, so that the total probability of an even value of k invalidating S is at most $p^4D^3/2 + p^3D^2 + p^2D$.

Putting this all together gives us a bound of $p^4D^3 + p^3D^2 + p^2D$. If we set p to be $\frac{1}{2}D^{-\frac{3}{4}}$ then the probability of S being invalid is then at most $\frac{1}{16} + \frac{1}{8} D^{-\frac{1}{4}} + \frac{1}{4}D^{-\frac{1}{2}} \leq \frac{7}{16}$, so with probability at least $\frac{7}{16}$ a set S with elements chosen randomly with probability $\frac{1}{2}D^{-\frac{3}{4}}$ will be valid. On the other hand, such a set has $D^{1/4}$ elements in expectation, and asymptotically the probability of having at least this many elements is $\frac{1}{2}$. Therefore, with probability at least $\frac{1}{16}$ a randomly chosen set will be both valid and have size greater than $\frac{1}{2}$, which shows that the largest value of $n$ is at least $\Omega\left(D^{1/4}\right)$.

We can actually do better: if all elements are chosen with probability $\frac{1}{2}D^{-2/3}$, then one can show that the expected number of invalid pairs is at most $\frac{1}{8}D^{1/3} + O(1)$, and hence we can pick randomly with probability $p = \frac{1}{2}D^{-2/3}$, remove one element of each of the invalid pairs, and still be left with $\Omega(D^{1/3})$ elements in S.

So, to recap: choosing elements randomly gives us S of size $\Omega(D^{1/4})$; choosing randomly and then removing any offending pairs gives us S of size $\Omega(D^{1/3})$; and we have an upper bound of $O(D^{1/2})$. What is the actual asymptotic answer? I don’t actually know the answer to this, but I thought I’d share what I have so far because I think the techniques involved are pretty cool.

## Useful Math

I have spent the last several months doing applied math, culminating in a submission of a paper to a robotics conference (although culminating might be the wrong word, since I’m still working on the project).

Unfortunately the review process is double-blind so I can’t talk about that specifically, but I’m more interested in going over the math I ended up using (not expositing on it, just making a list, more or less). This is meant to be a moderate amount of empirical evidence for which pieces of math are actually useful, and which aren’t (of course, the lack of appearance on this list doesn’t imply uselessness, but should be taken as Bayesian evidence against usefulness).

I’ll start with the stuff that I actually used in the paper, then stuff that helped me formulate the ideas in the paper, then stuff that I’ve used in other work that hasn’t yet come to fruition. These will be labelled I, II, and III below. Let me know if you think something should be in III that isn’t [in other words, you think there’s a piece of math that is useful but not listed here, preferably with the application you have in mind], or if you have better links to any of the topics below.

I. Ideas used directly

Differential equations: Lyapunov functions, linear differential equations, Poincaré return map, exponential stability, Ito calculus

Linear algebra: matrix exponential, trace, determinantCholesky decomposition, plus general matrix manipulation and familiarity with eigenvalues and quadratic forms

Multivariable calculus: partial derivative, full derivative, gradient, Hessian, Matrix calculus, Taylor expansion

Inequalities: Jensen’s inequality, testing critical points

Optimization: (non-convex) function minimization

III. Other useful ideas

Function Approximation: variational approximation, neural networks

Graph Theory: random walks and relation to Markov Chains, Perron-Frobenius Theorem, combinatorial linear algebra, graphical models (Bayesian networks, Markov random fields, factor graphs)

Miscellaneous: Kullback-Leibler divergence, Riccati equation, homogeneity / dimensional analysis, AM-GM, induced maps (in a general algebraic sense, not just the homotopy sense; unfortunately I have no good link for this one)

Probability: Bayes’ rule, Dirichlet process, Beta and Bernoulli processes, details balance and Markov Chain Monte Carlo

Spectral analysis: Fourier transform, windowing, aliasing, wavelets, Pontryagin duality

Optimization: quasiconvexity