The integral of csc(x)

[NOTE: At the end of editing this, I found that the substitution used below is famous enough to have a name, and for Spivak to have called it the “world’s sneakiest substitution”.  Glad I’m not the only one who thought so.]

In the course of working through some (very good) material on neural networks (which I may try to work through here later), I noticed that it was beneficial for a so-called “activation function” to be able to be written as the solution of an “easy” differential equation.  Here by “easy” I mean something closer to “short to write” than “easy to solve”.

The [activation] function sigma.

The [activation] function sigma.

In particular, two often used activation functions are
\sigma (x) := \frac{1}{1+e^{-t}}
and
\tau (x) := \tanh{x} = \frac{e^{2x}-1}{e^{2x}+1}.

One might observe that these satisfy the equations
\sigma' = \sigma (1-\sigma),
and
\tau' = 1-\tau^2.

By invoking some theorems of Picard, Lindelof, Cauchy and Lipschitz (I was only going to credit Picard until wikipedia set me right), we recall that we could start from these (separable) differential equations and fix a single point to guarantee we would end up at the functions above.  In seeking to solve the second, I found after substituting cos(u) =τ that
-\int\frac{du}{\sin{u}} = x+C,
and shortly after that, I realized I had no idea how to integrate csc(u).  Obviously the internet knows (substitute v = cot(u) + csc(u) to get the integral being –log(cot(u)+csc(u))), which is a really terrible answer, since I would never have gotten there myself.

Not the right approach.

Not the right approach.

Instinctually, I might have tried the approach to the right, which gets you back to where we started, or by changing the numerator to cos2x+sin2x, which leads to some amount of trouble, though intuitively, this feels like the right way to do it.  Indeed, eventually this might lead you to using half angles (and avoiding integrals of inverse trig functions).  We find
I = \int \frac{du}{\sin{u}} = \int \frac{\cos^2{u/2} + \sin^2{u/2}}{2\cos{u/2}\sin{u/2}}.
Avoiding the overwhelming temptation to split this integral into summands (which would leave us with a cot(u)), we instead divide the numerator and denominator by sin2(u) to find
I=\int \frac{1+\tan^2{u/2}}{2\tan{u/2}} du.
Now substituting v = tan(u/2)we find that dv = 1/2 (1+tan2(u/2))du = 1/2(1+v2)du, so making this substitution, and then undoing all our old substitutions:
I = \int \frac{1+v^2}{v}*\frac{2}{1+v^2}dv = \int \frac{dv}{v} = \log{|v|} + C = \log{|\tan{\frac{u}{2}}|}+C = \log{|\tan{\frac{\cos^{-1}\tau}{2}}|}+C.

The function tau we worry so much about.  Looks pretty much like sigma.

The function tau we worry so much about. Looks pretty much like sigma.

Using the half angle formulae that everyone of course remembers and dropping the C (remember, there’s already a constant on the other side of this equation), this simplifies to (finally)
I = \log{|\frac{\sqrt{1-\tau^2}}{1+\tau}|}.  Subbing back in and solving for \tau(x) gives, as desired,

\tau(x) = \frac{e^{2x}-1}{1+e^{2x}}.

Phew.

Expectations II

A contour plot of the function. Pretty respectable looking hills- maybe somewhere in the Ozarks- if I say so myself.

As a further example of yesterday’s post, I was discussing multivariable calculus with a student who had never taken it, and mentioned the gradient.  Putting our discussion into the framework of this post, here is what he wanted out of such a high dimensional analogue of the derivative of a function f: \mathbb{R}^2 \to \mathbb{R} (note to impressionable readers: the function defined below is not quite the gradient):
1. Name the answer: Call the gradient D.
2. Describe the answer:  should be a function from \mathbb{R}^2 \times \mathbb{R}^2 \to \mathbb{R}^3, which takes a point in the domain, a direction in the domain, and returns the vector in the range.  The idea being that if you had a map, knew where you are and in which direction you wished to travel, then the gradient should tell you what 3-dimensional direction you would head off in.

Certainly there is such a function, though in some sense we are making it too complicated.  As an example we have some pictures of the beautiful hills formed by the function

f(x,y) = \sin{3y} + \sin{(4x + 5y)} - x^2 - y^2 + 4.

The (actual) gradient of this function is

\nabla f(x,y) = \left(4\cos (4x + 5y) - 2x, 3\cos(3y) - 2y + 5\cos(4x + 5y)\right).
Plugging in a point in the plane will give a single vector, and then taking the dot product of this vector with a direction will give a rate of change for f at that point, in that direction.  Specifically, if we start walking north at unit speed from the origin, the gradient will be (4,8), and I take the dot product of this with (0,1) to find that I will be climbing at 8m/s (depending on our units!)

Now the correct answer from my student’s point of view would be that the answer is (0,1,8), since this is the direction in 3 dimensions that one would travel, and that the correct definition for would have
Df(x,y) \cdot v = \left(x,y,\nabla(x,y) \cdot v \right).

The graph of the indicated function, including the vector of the "pseudo-gradient" we discuss.

Of course there are more sophisticated examples of this.  Suppose a function u: \mathbb{R}^n \to \mathbb{R} is harmonic.  That is to say, \Delta u := \sum_{j = 1}^n \frac{\partial^2 u}{\partial x_j^2} = 0.  Notice that in order to write down this equality, we already named our solution u.  But just working from this equation, we can deduce a number of qualities that any solution must have: u is infinitely differentiable and, restricted to any compact set, attains its maximum and minimum on the boundary of that set.  Such properties quickly allow us to narrow down the possibilities for solutions to problems.

A picture of some hills that might be shaped like the function we're looking at. In the Ozarks of all places!

Math-ing the place up

A convex function, with one secant line drawn in red.

Haven’t had a straight up “math” day in quite a while here.  That ends now.  While reading an article [PDF link], I came across the definition of a Young function: a function F: [0,\infty) \to [0,\infty) which is convex and where F(t) = 0 only when t = 0.  Recall that for a real valued function, convex means that every secant line lies above the curve.  So far, so good.  Such a function might serve as a measure of how alike two functions are: rather than looking at (for example) the L^2 norm of u-v, we might first see what we can say about F(|u-v|), a more general question.

But here’s the proposition that made me pause: For 1<q<m<p, there is a differentiable Young function F and a constant C(m,p,q)>0 so that

1. \int_0^{\infty}F'(t)^{-\frac{1}{m-1}}~dt\leq C, and

2. q\leq \frac{tF'(t)}{F(t)}\leq p for all t > 0.

A non-convex function, with a secant line drawn in passing beneath the function.

In fact, there was one more somewhat technical, but non-trivial assertion about this F (proposition 2.1 in the linked paper), but let’s focus on these two.  Initially I was convinced that no such F existed, even satisfying these two conditions.  Here’s how my erroneous thoughts went: suppose such an F were to exist.  Property 2 gives us then that \frac{qF(t)}{t} \leq F'(t).  Solving this “differential inequality” gives us that F(t) \geq A_1t^q.  A similar calculation will also yield that F(t) \leq A_2t^p.

Now as a “back of the envelope” calculation, I tried plugging the bounds of F into property 1.  Specifically, first I computed

\int_0^{\infty} A_1qt^{\frac{q-1}{m-1}}dt.

Since q<m, the exponent is (strictly) smaller than 1, and the integral diverges (the indefinite integral looks like ct^{\alpha}, where 0<\alpha < 1).  In particular, it does great from 0 to any finite number, but has a “fat tail”.  Similarly, the integral \int_0^{\infty} A_1qt^{\frac{p-1}{m-1}}dt diverges, but this time because its singularity near zero is too big (the indefinite integral is the same as the previous one, though now $\alpha < 0$.  So this one does great from a very small number to infinity, but ultimately diverges.

Possibly I’ve given away the answer since I have emphasized my mis-steps, but here is an example of a Young function satisfying the first two properties.  The trick is to construct the function out of two pieces:

The constructed Young function, near where the two "pieces" join.

F(t) = c_1 t^q for small t, and F(t) = c_2t^p for large t.  You can even select c_1, c_2 so that the derivative is continuous.  Explicitly, suppose m = 3.  Then we may set F(t) = t for t < 1, and F(t) = \frac{t^5+4}{5} for t\geq 1.  Notice that F(t) is continuously differentiable, and that \int_0^{\infty}F'(t)^{-\frac{1}{m-1}}~dt = 2+1 = 3, so we know that C(3,1,5) \geq 3.

Slices followup

More from yesterday.  Of course I see the following .gif in the course of my daily reading, which blows all my animations out of the water:

This is a sequence of 2-D slices of the human body, collected from the Visible Human Project.  For those who don’t go to the wikipedia page, this data was collected in a much more… invasive manner than the MRI I described yesterday.  Specifically, a cadaver was “frozen in a gelatin and water mixture”, then they grinded away 1mm, take a picture, and repeat.    There is also a female data set with 0.33mm resolution.  Also, apparently this guy was a Texan murderer, which is a far less pleasant tie-in than, say, Tibor Rado being at Rice before solving Plateau’s problem, or Milgram being at Syracuse before proving his theorem.

There is also a fantastic viewer for this dataset hosted here.

In any case, this sort of image is a good introduction to the next logical place I was going to go with visualizing data.  That is, keeping track of five dimensions of data at once.  Yesterday’s images had three spatial dimension and one color dimension.  Today’s .gif has two spatial dimension, one color dimension, and one time dimension.

Long exposures and fire: another way to "see" another dimension.

This suggests (and I plan to provide examples of this) having three space dimensions, one color dimension, and one time dimension.  A great example would be watching a person’s temperature change over time, or how hot an oven is over time.  For today though, we have two somewhat boring examples, in that the color and time dimensions are equal to each other (i.e., I colored these by recording the height, and making that the intensity.

This first is an implementation of a variant of the heat equation with “random shocks” introduced, and Von-Neumann boundary conditions (which means: no energy escapes the system).  It is meant to model how head would travel, and the “random shocks” are small points of heat introduced into the system.

The second image is an implementation of a variant of the wave equation (in both these cases, when I say “variant”, I am allowing just a little bit of energy to leave the system so that there is not too much “heat” at the end of the simulation… things tended to just white-out without this).  The von Neumann boundary conditions are much more evident in this one, where you can see waves “bouncing” off the walls.

How to solve a variational problem, II

Lots of text today = lots of filler photos. Some ducks from the park next door to Rice.

This is a method used to show that a minimizer exists for a variational problem, though it typically provides no clues as to what the minimizer is.  It is typically referred to as “the direct method in the calculus of variations”.  See yeseterday’s post, at least the first paragraph or two, for background.

First, a point I forgot about yesterday since it is such a convention: “why minimization?”  I would guess this is motivated by physics, and has been explored by Hildebrandt and Tromba in The Parsimonious Universe– roughly, that nature seeks to minimize (rather than maximize) energy spent.  So much of math has been motivated by physics (often useful, as it provides an intuition as to what should be true), it is only natural many of our questions are posed in this manner.  This parsimony can also explain why we typically make sure that the Lagrangian is nonnegative- since energy is nonnegative.

From xkcd. We take the second path here.

Now, the direct method, along with intuition of how we find a least area surface given a prescribed boundary, and what could go wrong in general.  We denote the area of a surface u by F[u]:

1.

Geometry? A bridge in upstate New York?

Show that there is at least one candidate minimizer.  For the least area surface, we pick a point not on the boundary, and draw a straight line from that point to each boundary point to make our first candidate.  As a silly example of what could go wrong, consider trying to minimize the integral of |u|, subject to u(0) = 1 and u(0) = 2.  Well, we just cannot satisfy that boundary condition with any function.

2. Take an infimum.  We now have a nonempty set of candidates \{ u_{\alpha} \}, which is associated with a nonempty set of real numbers \{F[u_{\alpha}]\}.  Going back to the physics explanation, I’ll assume that the integrand is bounded below by zero, and invoke a property of the real numbers to say that there is a greatest lower bound.  It will be our goal to find a surface realizing this infimum.  Notice though that this step is not constructive, so right here we lose our chance at finding a minimizer analytically.  Really the only thing that could go wrong is if our integrand is not bounded below.  It would be somewhat silly to try to find the function whose integral is the “most negative”.

Once we have an infimum, M, we can take a sequence of candidates \{u_j\} which converge to the infimum, F[u_j] \to M.  Notice that the candidate functions do not necessarily converge to anything.  Yet.

"Bikeman" in some of the most beautiful woods in the world.

3. Invoke compactness.  Now, lurking under the surface has been this “space of admissible functions”.  Often this is a Sobolev space, which has the property that the unit ball is “(weak-*) compact”.  See this short paper (pdf link) by Frank Morgan on compactness, which he called “the most important concept in mathematics”.  Compactness implies that every bounded sequence has a convergent subsequence.  Here “bounded” means “with respect to the norm on the space”.  In the case of area minimizers, there is a “mass norm” on the admissible functions (“integer rectifiable currents”) which measures the surface area, and a famous (for some value of the word “famous”)  compactness theorem for these functions (defining a space of admissible functions that was physically relevant and so that such a theorem existed was, perhaps, the most difficult part of Plateau’s problem).  Hence, we now have a subsequence \{u_{j'} \}, and a target (admissible) function u so that u_{j'} \to u and F[u_{j'}] \to M, still.

A sequence of functions converging to the infimum for the integral of |u(x)| subject to u(0) = 0, u(1) = 1. However, there is no smooth function that they are converging to. This sequence appears on the cover of Frank Morgan's real analysis text.

Last thing to show is that F[u] = M.  The two conditions we’ve established certainly nod suggestively at this last equation, but don’t give it themselves.  The figure at right shows what could go wrong.

4. Lower semicontinuity of the functional. We need to show that F[u] = M.  A typical approach is to prove that the functional is lower semicontinuous.  That is to say, F[\liminf u_j] \leq \liminf F[u_j] (fun story: when I asked my advisor to define “upper semicontinuity”, he gave me the finger.  Bonus points if you can see why this is a reasonable definition).  If this is true, then we have M \leq F[u] = F[\liminf u_j] \leq \liminf F[u_j] = M, and are done.  The surface area is lower semicontinuous, which is how we complete the proof of the existence of area minimizing surfaces.

Reusing figures! Non lowersemicontinuity.

The Takagi curves from yesterday provide an example of a functional which is not lower semicontinuous, since F[u_j] \to 0, but F[u] = 1.

How to solve a variational problem, I

Only had 5 figures and 600 words, so here's a photo of a saddle in the Guadalupe Mountains. I like that you can see all the "local maxima" easily. If you look closely, you can even see a trail crossing right at the lowest point of the saddle.

Last week, I discussed how to solve a PDE.  This week, I want to talk about how to solve a variational problem.  By variational problem, I will mean minimizing an integral like

F[u] = \int_U L(Du,u,x)~dx,

subject to some boundary conditions, where U is a subset of some Euclidean space,  u is the function we are solving for (which maps from U to some other Euclidean space), and Du is the differential of u.  If u: U \subset \mathbb{R}^m \to \mathbb{R}^n, then Du is an n x m matrix (though for a real valued u, i.e. where n =1, it is customary to write Du as a row vector, column vectors being hell for typesetting).

Today I focus on examples of variational problems which can be solved analytically, while tomorrow will be an outline of the so-called “direct method in the calculus of variations”.

One minimizer of the integral of u'(x). Any function would do.

First, a variational problem which is (hopefully) easily solved by any student of calculus: Find a function u: [0,1] \to \mathbb{R} which minimizes \int_0^1 u'(x)~dx, subject to u(0) = a, u(1) = b.

In this case, we know that any function that starts at a and ends at b will be a minimizer, since, by the fundamental theorem of calculus,

\int_0^1u'(x)~dx = u(1)-u(0) = b-a.

Hence, we have existence, but not uniqueness.

The most famous variational problem is surely minimizing surface area (or length, or volume depending on the dimension of the domain) of a graph.  That is, finding a real valued function u that minimizes

\int_U \sqrt{1+|Du|^2}~dx

subject to some boundary conditions.

The *only* function that minimizes length, with u(0) = a, u(1) = b.

Longtime reader(s) will recall the Euler-Lagrange equationswhich can sometimes provide a solution to such problems:

\left. D_uL(Du,u,x)-div_x(D_pL(Du,u,x))=0 \right..

As an easy example, if we wanted to minimize the length of a line, we would use the above functional, Du = u’ in this case (the ol’ 1 x 1 matrix), and the Euler-Lagrange equations give

\frac{d}{dx}\left( \frac{u'(x)}{\sqrt{1+(u'(x))^2}} \right) = 0.

Rather than calculate this derivative, we notice that this means

\frac{u'(x)}{\sqrt{1+(u'(x))^2}} = C

so

u'(x) = \frac{C}{\sqrt{1- C^2}} = D.

Hence, u’ is a constant, and our solution must be a straight line, u(x) = Dx+E.

We also point out that it should be surprising that there are any general results for solving variational problems.  This post was inspired by Leonid Kovalev’s discussion of the Takagi curve, which is a fractal whose iterates look like a famous counterexample.  Specifically, suppose we wish to minimize

\int_0^1 u(x)^2+(1-|u'(x)|)^2~dx,

subject to u(0) = u(1) = 0.  Notice that the integrand is

  1. Takagi curves

    Always nonnegative,

  2. small when u is close to 0,
  3. small when u’ is close to +1 or -1,
  4. only 0 if u is (almost) always 0 and |u’| is (almost) always plus or minus 1.

The sequence of functions pictured at the bottom of the post (which forms the Takagi curve) will always have 0 for the derivative part, and will be getting smaller and smaller on the u(x)^2 part of the integrand.  In fact, if you name any very small number \epsilon > 0, we can choose a member of this sequence so the integral that is smaller than \epsilon.  But there is no function where the integral of this Lagrangian is zero, so there is no minimizing function for this particular Lagrangian!

A sequence of functions, where the integrals of the Lagrangians converge to zero, but there is no limit where the integral of the Lagrangian is zero.

As an aside, I’m sure that this Lagrangian is named/famous, but I could not find mention of it in any of my usual sources…

MATLAB code to generate the above gif.  Remove the “pause” commands to just plot everything at once:

function takcurves(n)
%generates n iterates of the Takagi curves
ColOrd = hsv(n);
figure(1)
set(gcf, 'Position', get(0,'Screensize'));
hold on
pause(0.4)
for j = 1:n
x = 0:.5^j:1;
y = zeros(size(x));
y(2:2:size(y,2)) = .5^j;
line(x,y,'LineSmoothing','on','Color',ColOrd(j,:))
hold on
pause(0.4)
end
hold off
end


How to solve a PDE

I briefly considered trying to make the case that smoke was governed by PDEs. Really, I just like this photo.

So here’s an attempt to summarize how to prove that a solution exists for linear, second order partial differential equations.  This is some of my favorite mathematics, though at first I was really turned off by all this talk about Sobolev spaces and functional analysis, instead of just solving the darn thing.  So I encourage those who have never seen any PDE theory yet to dig in (though maybe some ODE familiarity would be helpful).  Unfortunately, I could not come up with any helpful figures/pictures for today (suggestions welcome!), so you get a potpourri of shots.

First, let me define a second order linear differential operator L, but let me do so by describing what it does to a function u: \mathbb{R}^n \to \mathbb{R} (since a second order linear operator is just something that eats functions and gives back functions, in a linear manner):

Lu = -\sum_{j,k = 1}^n (a^{j,k}(x)u_{x_j}(x))_{x_k} + \sum_{j = 1}^n b^j(x)u_{x_j}(x) + c(x)u(x).

The photos on the page are mine. The dog, unfortunately, is not.

One famous example of such an operator is the Laplacian, denoted \Delta, for which a^{j,k} is 1 when j = k, and 0 otherwise, and b^js and c are zero.  Put more simply, the Laplacian is the sum of the pure second derivatives.  Now we provide first an outline, and then a few details of a proof of existence of solutions.

How to solve Lu = f:

1. Associate to the operator L a bilinear form B.  That is, a function B that eats two functions and gives a number which is linear in each coordinate.
2. Remember from functional analysis that, under certain conditions, there exists a unique u so that  B[u,v] = \lambda(v) for any v,  where \lambda is a linear functional (eats functions, gives a number).
3. Define \lambda(v) := \int f(x) v(x)~dx.
4. Pat yourself on the back for defining B in a clever way so that the u produced actually solves Lu = f in a weak sense.

Again, with slightly more detail:

1. We’ll define
B[u,v]:=\int\left(\sum_{j,k = 1}^na^{j,k}u_{x_j}v_{x_k}+\sum_{j = 1}^nb^ju_{x_j}v+cuv\right)dx,

It's a fish!

where everything is a function of x, but that really clutters up the equation.  A reason you might have thought to do this yourself is that it takes a little pressure off of u.  That is, we only need to require that u has one derivative, and that derivative only has to be defined under an integral sign.  If u happened to have two derivatives, we could integrate the first term by parts, and then we would have B[u,v]=\int Lu\cdot v~dx.  In this beautiful, integrable, differentiable world, if we could show that \int Lu \cdot v~dx = \int f \cdot v~dx for all functions v, then it would be reasonable to conclude that Lu = f.
2. The Lax-Milgram theorem says, roughly, that if H is a Hilbert space, then for a bilinear form B: H \times H \to \mathbb{R} and a (continuous) linear functional \lambda: H \to \mathbb{R}, there is a unique u \in H

Again, I could make the case that gravity is a differential equation, whose solutions are parabolas...

so that B[u,v]=\lambda(v) for each v \in H.  There are some other non-trivial hypotheses on B, which may be translated into hypotheses on L, but let’s revel in success now.
3.-4. These steps were actually described pretty well above.

So we need functional analysis to use this great Lax-Milgram theorem.  We need Sobolev spaces because our favorite function spaces from calculus- C^m, the m-times differentiable functions- are not Hilbert spaces.  Certain Sobolev spaces are, though, and they provide precisely the integrability conditions we need to get precise conditions to guarantee solutions for large classes of PDE.

The 100 words/picture is tough sometimes, but other times I get to put artsy photos in, so everything balances out.