The integral of csc(x)

[NOTE: At the end of editing this, I found that the substitution used below is famous enough to have a name, and for Spivak to have called it the “world’s sneakiest substitution”.  Glad I’m not the only one who thought so.]

In the course of working through some (very good) material on neural networks (which I may try to work through here later), I noticed that it was beneficial for a so-called “activation function” to be able to be written as the solution of an “easy” differential equation.  Here by “easy” I mean something closer to “short to write” than “easy to solve”.

The [activation] function sigma.

The [activation] function sigma.

In particular, two often used activation functions are
\sigma (x) := \frac{1}{1+e^{-t}}
\tau (x) := \tanh{x} = \frac{e^{2x}-1}{e^{2x}+1}.

One might observe that these satisfy the equations
\sigma' = \sigma (1-\sigma),
\tau' = 1-\tau^2.

By invoking some theorems of Picard, Lindelof, Cauchy and Lipschitz (I was only going to credit Picard until wikipedia set me right), we recall that we could start from these (separable) differential equations and fix a single point to guarantee we would end up at the functions above.  In seeking to solve the second, I found after substituting cos(u) =τ that
-\int\frac{du}{\sin{u}} = x+C,
and shortly after that, I realized I had no idea how to integrate csc(u).  Obviously the internet knows (substitute v = cot(u) + csc(u) to get the integral being –log(cot(u)+csc(u))), which is a really terrible answer, since I would never have gotten there myself.

Not the right approach.

Not the right approach.

Instinctually, I might have tried the approach to the right, which gets you back to where we started, or by changing the numerator to cos2x+sin2x, which leads to some amount of trouble, though intuitively, this feels like the right way to do it.  Indeed, eventually this might lead you to using half angles (and avoiding integrals of inverse trig functions).  We find
I = \int \frac{du}{\sin{u}} = \int \frac{\cos^2{u/2} + \sin^2{u/2}}{2\cos{u/2}\sin{u/2}}.
Avoiding the overwhelming temptation to split this integral into summands (which would leave us with a cot(u)), we instead divide the numerator and denominator by sin2(u) to find
I=\int \frac{1+\tan^2{u/2}}{2\tan{u/2}} du.
Now substituting v = tan(u/2)we find that dv = 1/2 (1+tan2(u/2))du = 1/2(1+v2)du, so making this substitution, and then undoing all our old substitutions:
I = \int \frac{1+v^2}{v}*\frac{2}{1+v^2}dv = \int \frac{dv}{v} = \log{|v|} + C = \log{|\tan{\frac{u}{2}}|}+C = \log{|\tan{\frac{\cos^{-1}\tau}{2}}|}+C.

The function tau we worry so much about.  Looks pretty much like sigma.

The function tau we worry so much about. Looks pretty much like sigma.

Using the half angle formulae that everyone of course remembers and dropping the C (remember, there’s already a constant on the other side of this equation), this simplifies to (finally)
I = \log{|\frac{\sqrt{1-\tau^2}}{1+\tau}|}.  Subbing back in and solving for \tau(x) gives, as desired,

\tau(x) = \frac{e^{2x}-1}{1+e^{2x}}.


Hamming codes in python

As a bit of an experiment I’ve started up a GitHub repository with some code from this blog here [LATE EDIT: you need numpy (and might as well get scipy and matplotlib) to run this].  In particular, this week I’ve implemented a script in Python which contains a script for generating Hamming matrices (one for encoding, one for parity check, one for decoding), which constitutes as much of a proof of existence as I’m willing to get into.

There is also a “Message” class inside where you can play around with how many corrupted bits your message might have versus how long your message is versus the size of Hamming matrix you use.  The defaults are set at 3 corrupt bits in a set of 4000 bits, with the error checking done with the Hamming(7,4) code.  You can run this by downloading and running “python” from the command line.

The specific directory with this project is located inside the “HammingCodes” folder.  Possible experiments with this code later, but now I need sleep!

Hamming Codes III

We are now in a position to actually write down the Hamming (7,4) code.  As explained in the previous three entries, we want some way of both detecting that an error occurred and of correcting that error.  As with many problems, this is much easier once you know it is possible (thanks, Richard Hamming!)  In particular, last time we proved that it is necessary to send a 7-bit message in our scheme of correcting errors for a 4-bit message, but is it sufficient?  An easy way to deduce the solution in this case (and then to see the pattern that proves the general case) is to require that our parity check detects the location of any 1-bit error.

Specifically, someone will receive a (possibly corrupt) 7-bit string v, and we want a matrix that will output 0 if all is well, or communicate a number 1-7 if one of the bits is corrupt.  It takes 3 bits to communicate 8 numbers (8 = 23), so our parity matrix H (following wikipedia’s notation) must be 3 x 7.  To make it easy to remember, we’ll even have column j be the binary representation for j.  More directly:

I am so fed up with wordpress' latex compiler...

I am so fed up with wordpress’ latex compiler…

Now we can work backwards (again, we’re assuming an answer exists),  and for reasons that may be clear later, we’ll set our three parity bits to be the three “singleton” columns of H, so that the “coded” message v = (p1,p2,d1,p3,d2,d3,d4).  Then if everything goes smoothly, we have that Hv0, so that

0 = p1+d1+d2+d4
0 = p2+d1+d3+d4
0 = p3+d2+d3+d4.

GNotice also that if one bit gets corrupted, this is equivalent to sending the message v+ej, and
G(v+ej) = 0+gj,
where gj is the jth column of G (which is the binary representation of the number j). Hence multiplying a message with a 1-bit mistake gives us the index of the corrupt bit.
But this tells us how we must encode our message m = (d1,d2,d3,d4) as well.  We want a matrix G so that G= (p1,p2,d1,p3,d2,d3,d4).   But the above gives us a linear condition for what this matrix must look like (and an explanation for why the parity bits are all “singletons”).


Finally we want to “decode” our message, which is also straightforward at this point, since it will just be the matrix which returns the non-parity bits from the encoded message.

As a review, and to wrap everything up:

1. Start with a message m = (1,0,0,1)

2. Transmit the message v = Gm = (0,0,1,1,0,0,1)
3. Check the parity by confirming that Hv = (0,0,0).
4. Decode the message Rv = (1,0,0,1), as desired.

Wikipedia’s also got an explanation involving Venn diagrams which I did not much like, though I may write a bit about Venn diagrams themselves in the future…

Hamming Codes II

No reasonable pictures for today! Have a cactus!

No reasonable pictures for today! Have a cactus!

So now we know the general idea behind Hamming error correcting codes, and how one might construct and visualize hypercubes.  Now suppose we want to encode, in an error correcting way, 4 bits.  Recall that this means finding a hypercube with enough vertices that we can designate 16 (=42) of them, *and* pick those 16 so that no two “symbol vertices” are closer than distance 2.  This means each “symbol vertex” has a disjoint neighborhood of distance 1.

A back of the envelope calculation gives a necessary condition to allow this: an n dimensional hypercube has 2n vertices and each vertex has n neighbors (so a “symbol neighborhood” takes up n+1 vertices).  Hence it is necessary that n satisfy

16*(n+1) ≤ 2n.

More generally, to encode m bits in n bits, we require 2m*(n+1) ≤ 2n.  Note without proof (for now, hopefully soon by construction) that this is also a sufficient condition.  Interesting from an efficiency point of view is seeing where equality exists.

Taking logs (base 2) of both sides, and realizing that log(n+1) is an integer only when (n+1) is a power of 2, so m = n-log(n+1), or, letting :=2k-1, m = 2k-1 – k.  In fact, one may (and we may) describe a whole class of Hamming (2k-1, 2k-1 – k) codes.

A brief foray into hypercubes

The discussion on error-correcting codes is about to get a little hypercube heavy (never a good state to be in), and a brief foray into how to construct/visualize them may be in order.  I’ll take the liberty of defining an n-dimensional (unit) hypercube as a shape whose

1. vertices are located at coordinates made of entirely 0’s and 1’s, and
2. has an edge wherever two vertices are distance 1 apart.

This would take two more things to make a complete definition: I should let you move the cube about however you like (no reason to have it fixed is space), and I should tell you about the 2-D faces, 3-D hyperfaces, and so on up to the (n-1)-D hyperfaces.  You can use that first one if you want, but I’ll ignore the second.  I think I did a good job of defining what’s called the 1-skeleton of a very particular n-dimensional hypercube.

Two sample vertices representing (1,1,0,0) and (1,0,1,0). These will not be connected in the hypercube.

Two sample vertices representing (1,1,0,0) and (1,0,1,0). These will not be connected in the hypercube.

Anyways.  Wednesday had pictures of a 2-cube and 3-cube.  What about the 4-cube?  Or 5-cube?  It will help to consider this all from a less analytic, more graph theory (or, if that sounds technical, “pictures and problem solving”) point of view.  Condition 1 for a hypercube says that there are 2n vertices, all the binary sequences of length n. Then condition 2 says that two vertices are connected if you can change one vertex’s binary sequence to the other’s by changing a single bit.   We’ll go one step further, by just coloring particles on a line: white for 0, black for 1 (this is something of a homage to my undergraduate thesis advisor’s work with polyhedra).

The only two things left to do are to draw the vertices and arrange them in nice ways (that is, fine a “nice” projection).


A projection of a 4-cube, with vertices labelled.

Below is the image from the wikipedia 5-, 6-, and 7- cubes.  Note the some of the vertices are laying on top of eachother.  I’ll leave it as an exercise to the reader to label these vertices with the appropriate binary sequences.








Hamming Error-Correcting Codes

This is part 1 in hopefully a series on the Hamming error-correcting codes, to be continued on Friday.  The problem is this:  suppose you wish to send a message composed entirely of 0’s and 1’s, but suspect part of it might get “corrupted”.  More concretely, I am going to send you the message


but there’s a chance that you receive instead something like 00101 (so the second bit was flipped), or 01111 (so the fourth bit was flipped).  Is there any way to control for this?  An idea that is both natural and wrong is to send two copies of the message.  In this case, I’d send along


Now if a bit gets flipped (note that there are now more bits that could be flipped), you can see exactly where — perhaps you receive

where the non-matching bits are highlighted.  The problem here is that you cannot tell whether the offending bit should be interpreted as a 0 or a 1 (which might be ok if you are only interested in error detection).  But if we want to be able to correct the error, we’ll have to be a little more clever.  As a very broad outline of our strategy, we are going to take each of the symbols we would like to transmit and encode them into a “big enough” space so that each symbol has a buffer zone around it to account for any errors in transmission.

In some sense, this is a familiar strategy: on touchscreen keyboards you do not have to hit the exact spot for a key to work, only be close enough.  In this case, the “symbols” I’m referring to are letters, and the “buffer zone” is the area of the screen you can touch so the computer recognizes that you are trying to touch this particular letter.

The trick here (and what is sort of confusing about this) is that the symbols we wish to transmit are bits, and the space that we will be embedding our symbols into will also be bits (rather than a shiny retina display!) As a hint of things to come, I’ve displayed below a representation of the space of 2-bit strings (which will not be big enough to detect 1-bit errors), and a representation of the space of 3-bit strings, which is of perfect size to detect 1-bit errors in a 1 bit message.

All 2-bit strings, with a line connecting them if the distance between them is 1, so you can get to one from the other by switching a single bit.




All 3-bit strings, connected whenever the distance between them is 1. Notice that the points (0,0,0) and (1,1,1) have disjoint balls of radius 1 around them. This will be the basis of the error correcting code, by agreeing that anything received in the red “ball” will be interpreted as the symbol (0,0,0), and anything received in the blue “ball” with be interpreted as (1,1,1). Then this can be turned into a 1 bit error correcting code.



Theme of the day is networks, and let’s even say self-organizing networks.  I expect to flesh this out in the future, but I have some pretty pictures now, so that’s a pretty good start.

A graph with 30 cliques of various sizes.

A general problem will be to take a graph (nodes and edges, etc), and identify sub-populations of that graph.  Two great questions, (with some reasonable answers):

– What kind of graph?
We could restrict the discussion to nodes connected by edges, with only one edge allowed between two nodes, and no self connections (this might be a “friend” graph in facebook), but perhaps it would be more useful to look at a directed graph (like a “follow” graph in Google+) where a connection is only one-way.  A weighted graph could let you have strong or weak connections – maybe you are interested in a “normalized” facebook friend graph, where the “strength” of a friendship (it strikes me that the quotation marks might just as well be around “friendship”) depends on how many friends each person has.  A bipartite graph might model facebook “likes”, since a person can like a product (but not other people).

The adjacency matrix for the above graph.

-What does it mean to “identify sub-populations”?
I hope to have some images below that help to give an intuitive understanding of this, and maybe flesh out that understanding in a later post.  This is actually a hard question to answer — must every node belong to some sub-population?  Must a node belong to only one sub-population?  In the first case, where does the crunchy peanut-butter eater belong, in a land of creamy peanut-butter eaters?  In the second case, where does a triathlete belong in a population of runners, swimmers and bikers?

The above matrix with the columns/rows acted on by the same permutation. This is what one might expect their network data to look like, and you’d hope to find the block structure above (or find an indication that such structure does not exist).

Anyways, in order to investigate some of this I’ve got some basic scripts set up that:
1) Generate a graph with 30 cliques of various sizes, sparsely connected to each other
2) Print an image of this graph
3) Print the adjacency matrix of the graph
4) Scramble the matrix, and print that
5) Run the affinity propagation algorithm from sklearn on the scrambled matrix, and print that

If affinity propagation was perfect, it would return a result very close to the original block matrix (possibly with the blocks permuted).

The results of affinity propagation. Notice that it identifies many more clusters than originally existed.  Also notice that this is also an adjacency matrix for the above graph.