The integral of csc(x)

[NOTE: At the end of editing this, I found that the substitution used below is famous enough to have a name, and for Spivak to have called it the “world’s sneakiest substitution”.  Glad I’m not the only one who thought so.]

In the course of working through some (very good) material on neural networks (which I may try to work through here later), I noticed that it was beneficial for a so-called “activation function” to be able to be written as the solution of an “easy” differential equation.  Here by “easy” I mean something closer to “short to write” than “easy to solve”.

The [activation] function sigma.

The [activation] function sigma.

In particular, two often used activation functions are
\sigma (x) := \frac{1}{1+e^{-t}}
and
\tau (x) := \tanh{x} = \frac{e^{2x}-1}{e^{2x}+1}.

One might observe that these satisfy the equations
\sigma' = \sigma (1-\sigma),
and
\tau' = 1-\tau^2.

By invoking some theorems of Picard, Lindelof, Cauchy and Lipschitz (I was only going to credit Picard until wikipedia set me right), we recall that we could start from these (separable) differential equations and fix a single point to guarantee we would end up at the functions above.  In seeking to solve the second, I found after substituting cos(u) =τ that
-\int\frac{du}{\sin{u}} = x+C,
and shortly after that, I realized I had no idea how to integrate csc(u).  Obviously the internet knows (substitute v = cot(u) + csc(u) to get the integral being –log(cot(u)+csc(u))), which is a really terrible answer, since I would never have gotten there myself.

Not the right approach.

Not the right approach.

Instinctually, I might have tried the approach to the right, which gets you back to where we started, or by changing the numerator to cos2x+sin2x, which leads to some amount of trouble, though intuitively, this feels like the right way to do it.  Indeed, eventually this might lead you to using half angles (and avoiding integrals of inverse trig functions).  We find
I = \int \frac{du}{\sin{u}} = \int \frac{\cos^2{u/2} + \sin^2{u/2}}{2\cos{u/2}\sin{u/2}}.
Avoiding the overwhelming temptation to split this integral into summands (which would leave us with a cot(u)), we instead divide the numerator and denominator by sin2(u) to find
I=\int \frac{1+\tan^2{u/2}}{2\tan{u/2}} du.
Now substituting v = tan(u/2)we find that dv = 1/2 (1+tan2(u/2))du = 1/2(1+v2)du, so making this substitution, and then undoing all our old substitutions:
I = \int \frac{1+v^2}{v}*\frac{2}{1+v^2}dv = \int \frac{dv}{v} = \log{|v|} + C = \log{|\tan{\frac{u}{2}}|}+C = \log{|\tan{\frac{\cos^{-1}\tau}{2}}|}+C.

The function tau we worry so much about.  Looks pretty much like sigma.

The function tau we worry so much about. Looks pretty much like sigma.

Using the half angle formulae that everyone of course remembers and dropping the C (remember, there’s already a constant on the other side of this equation), this simplifies to (finally)
I = \log{|\frac{\sqrt{1-\tau^2}}{1+\tau}|}.  Subbing back in and solving for \tau(x) gives, as desired,

\tau(x) = \frac{e^{2x}-1}{1+e^{2x}}.

Phew.

Expectations II

A contour plot of the function. Pretty respectable looking hills- maybe somewhere in the Ozarks- if I say so myself.

As a further example of yesterday’s post, I was discussing multivariable calculus with a student who had never taken it, and mentioned the gradient.  Putting our discussion into the framework of this post, here is what he wanted out of such a high dimensional analogue of the derivative of a function f: \mathbb{R}^2 \to \mathbb{R} (note to impressionable readers: the function defined below is not quite the gradient):
1. Name the answer: Call the gradient D.
2. Describe the answer:  should be a function from \mathbb{R}^2 \times \mathbb{R}^2 \to \mathbb{R}^3, which takes a point in the domain, a direction in the domain, and returns the vector in the range.  The idea being that if you had a map, knew where you are and in which direction you wished to travel, then the gradient should tell you what 3-dimensional direction you would head off in.

Certainly there is such a function, though in some sense we are making it too complicated.  As an example we have some pictures of the beautiful hills formed by the function

f(x,y) = \sin{3y} + \sin{(4x + 5y)} - x^2 - y^2 + 4.

The (actual) gradient of this function is

\nabla f(x,y) = \left(4\cos (4x + 5y) - 2x, 3\cos(3y) - 2y + 5\cos(4x + 5y)\right).
Plugging in a point in the plane will give a single vector, and then taking the dot product of this vector with a direction will give a rate of change for f at that point, in that direction.  Specifically, if we start walking north at unit speed from the origin, the gradient will be (4,8), and I take the dot product of this with (0,1) to find that I will be climbing at 8m/s (depending on our units!)

Now the correct answer from my student’s point of view would be that the answer is (0,1,8), since this is the direction in 3 dimensions that one would travel, and that the correct definition for would have
Df(x,y) \cdot v = \left(x,y,\nabla(x,y) \cdot v \right).

The graph of the indicated function, including the vector of the "pseudo-gradient" we discuss.

Of course there are more sophisticated examples of this.  Suppose a function u: \mathbb{R}^n \to \mathbb{R} is harmonic.  That is to say, \Delta u := \sum_{j = 1}^n \frac{\partial^2 u}{\partial x_j^2} = 0.  Notice that in order to write down this equality, we already named our solution u.  But just working from this equation, we can deduce a number of qualities that any solution must have: u is infinitely differentiable and, restricted to any compact set, attains its maximum and minimum on the boundary of that set.  Such properties quickly allow us to narrow down the possibilities for solutions to problems.

A picture of some hills that might be shaped like the function we're looking at. In the Ozarks of all places!

Expectations

Hard thinkin' being done today.

It is useful (for me!) to think about the importance of math as teaching us how to think about problems, rather than providing us with useful factoids (I’m looking at you, history class).  There are a lot of problems/puzzles/patterns in the world, and the chance of seeing the same problem twice is very low (and really, I’ve never seen Batman use the Pythagorean theorem even once, so what’s the point?), so we focus on solving problems in as broad of a context as possible.  In this way, I’d argue, mathematicians become very good problem solvers (“toot! toot!” <– my own horn)

One method of problem solving I would like to focus on today is to name and describe your answer before you have found it.  As a simple example, in order to answer the question “what number squared is equal to itself?”, we would:
1. Name the answer: Suppose x squared is equal to x.
2. Describe the answer: This is where the explicitly developed machinery comes in: We know that x^2 = x, so we deduct that x also has the property x(x-1) = 0, and conclude that either x = 0 or x = 1.

A geometric way of looking at the word problem. NOT TO SCALE.

As a second example, much of linear algebra is naming objects, describing them, and then realizing you accidentally completely described them.  For example, suppose we wanted to identify every matrix with a number, and make sure that every singular matrix has determinant 0:
1. Name the answer: Let’s call the answer the determinant, or det() for short.
2. Describe the answer: det() should be a function from matrices to numbers, and at least satisfy the following properties: (i) det(I) = 1, so that the identity matrix is associated with the number 1 (so at least some nonsingular matrices will not have determinant zero), (ii) if the matrix A has a row of zeros, then det(A) = 0 (so that at least some singular matrices will have determinant zero, and (iii) the determinant is multilinear, which takes some motivation, will definitely respect identifying singular matrices.

Well, it turns out that these three properties have already completely determined the object we are looking for!  If I had been greedy and asked (iv) each nonsingular matrix is associated with a unique number, then I would have deduced that no such map exists.  If I had not included property (iii), then I would have found there are many such maps.  It is a fairly enjoyable exercise to deduce the other properties of determinants starting from just these three rules.

More filler photos! This is from Cinqueterre in Italy, between some two of the towns.

Another nice theorem

Trying to visualize the projection map using fibers. You'll have to take my word that the lines stop before getting to the origin.

Today’s Theorem of the Day (TM) I used to compute the Jacobian of a radial projection.  In particular, consider the map F: \mathbb{R}^n \to \mathbb{R}^n where x \mapsto x/|x| for all |x| > 1.  This projects all of n-space onto the surface of the unit ball, and leaves the interior untouched.  Then we may compute the derivative \frac{\partial F_j}{\partial x_k} = \frac{\delta_{j,k}|x|^2 - x_k^2}{|x|^3}.

To calculate the Jacobian of means we have to calculate the determinant of that matrix.  With a little figuring, we can write that last sentence as |JF(x)| = \det \left(\frac{1}{|x|} ( I - \frac{x^Tx}{|x|^2} ) \right) = \frac{1}{|x|^n} \det \left(I-\frac{x^Tx}{|x|^2}\right).

Now we apply The Theorem, which Terry Tao quoted Percy Deift as calling (half-jokingly) “the most important identity in mathematics”, and wikipedia calls, less impressively, “Sylvester’s determinant formula“.  Its usefulness derives from turning the computation of a very large determinant into a much smaller determinant.  At the extreme, we apply the formula to vectors u and v, and it says that \det (I+u^Tv) = 1+v^Tu.  In our case, it yields |JF(x)| = 0.  Thus we turned the problem of calculating the determinant of an n x n matrix into calculating the determinant of a 1 1 matrix.

Pretty nifty.

Busy days.

Somehow spring break has turned into one of the busier weeks of my year.  Trying to keep up with real life work has not left a ton of time for writing anything thoughtful/reasonable, though at least for continuity I will try to keep a paragraph or so up here each day with my favorite thought of the day.  This also means I can reuse some old graphics!

Today I really enjoyed a particular fact about Sobolev functions.  Recall that these are actually equivalence classes of functions, as they are really defined under an integral sign, which “can’t see” sets of small measure.  However, the following quantifies exactly how small the bad set might have to be:

If f \in W^{1,p}(\Omega) for \Omega \subset \mathbb{R}^n, then the limit \lim_{r \to 0} \frac{1}{\alpha(n)r^n}\int_{B(x,r)}f(y)~dy exists for all x outside a set E with \mathcal{H}^{n-p+\epsilon}(E) = 0 for all \epsilon > 0.

Put another way, every Sobolev function may be “precisely defined” outside a set of small dimension, where the dimension gets smaller as p gets larger.  I suppose a given representative may be worse, but this allows you to require that the member of the equivalence class of Sobolev functions has some nice properties.

The fibers of two functions in a sequence. I was thinking the above argument might imply that the limit was not Sobolev, but the limit is precisely represented outside a set with positive 1-dimensional measure, so the result is silent on this issue.

L’Hopital’s rule.

Two photos from a recent trip up north. Major bonus points for knowing which of New England's many trails this was taken on.

L’Hopital’s rule is really how every student of calculus (and I believe Leibniz, though I cannot find a reference) wishes the quotient rule worked.  Specifically, that

\lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}.

Of course, it can’t be that easy.  We also need that f and g are differentiable in a neighborhood of a, that both function approach 0, or they both approach \infty, or they both approach -\infty as x approaches this point a, and finally that the limit on the right hand side exists (though we all recall that if it does not work the first time, we may continue to apply L’Hopital until the limit does exist, which then justifies using L’Hopital in the first place).

I was thinking of this rule in relation to generating interesting examples of limits.  In particular, if we are in a situation where L’Hopital’s applies, then we can apply the rule in two ways:

\lim_{x\to a}\frac{f'(x)}{g'(x)}=\lim_{x \to a}\frac{f(x)}{g(x)}=\lim_{x\to a}\frac{\left(\frac{1}{g(x)}\right)'}{\left(\frac{1}{f(x)}\right)'}.

Proceeding informally (i.e., I’m not going to keep track of hypotheses), the right hand side of this evaluates to
\lim_{x\to a}\frac{f(x)}{g(x)}=\lim_{x\to a}\frac{f(x)^2}{g(x)^2}\frac{g'(x)}{f'(x)}.

This is all well and good- the right hand side looks appropriately ugly, but now the trick is picking f and g to get interesting limits.  I have worked out two reasonable examples:

1. Choosing f(x) = sin(x)g(x) = x, we get

\lim_{x \to 0} \frac{\sin{x}}{x^2}\tan{x} = 1.

Also, moderate amounts of bonus points for naming (at least) two universities in the northeast with this mascot.

2. Choosing f(x) = e^x-1 and g(x) = \log{x}, and applying (hopefully correctly!) a number of logarithm rules, we can get

\lim_{x \to 1} \frac{(e^x-1)^2}{\log{(x^{\log{(xe^x)}})}} = 0.

What would be interesting is to find an example where it is difficult/impossible to evaluate without recognizing that it was created using this process.  This second example might fit the “difficult” bill, as I would not want to take the derivative of the denominator directly, but factoring, you might recognize it as xe^x (\log{x})^2, and then be able to reverse engineer this process, somehow.

As usual, just a thought I’ve been playing with.

More with fibers of functions

I posted earlier on a way of visualizing the fibers of certain maps from high dimensions to low dimensions.  Specifically, if the range can be embedded in the domain so that f is the identity of the image of the range, then we can draw the inverse image at each point.  I had some images of functions whose inverse image was a torus, but had trouble making these sorts of images for maps f: \Omega \subset \mathbb{R}^3 \to \mathbb{R}^2, so that the inverse image of a point is a line.  Well, no more!  Here are two images, one is the projection of a cube onto a square, and the other is somewhat more complicated, and is the string hyperboloid map.  See the previous post for more details on these specific maps, but I just thought these were nice images!

Fibers of the projection map from the cube to the square.

Fibers of the "twisted cylinder", which are again straight lines.