Theme of the day is networks, and let’s even say self-organizing networks.  I expect to flesh this out in the future, but I have some pretty pictures now, so that’s a pretty good start.

A graph with 30 cliques of various sizes.

A general problem will be to take a graph (nodes and edges, etc), and identify sub-populations of that graph.  Two great questions, (with some reasonable answers):

– What kind of graph?
We could restrict the discussion to nodes connected by edges, with only one edge allowed between two nodes, and no self connections (this might be a “friend” graph in facebook), but perhaps it would be more useful to look at a directed graph (like a “follow” graph in Google+) where a connection is only one-way.  A weighted graph could let you have strong or weak connections – maybe you are interested in a “normalized” facebook friend graph, where the “strength” of a friendship (it strikes me that the quotation marks might just as well be around “friendship”) depends on how many friends each person has.  A bipartite graph might model facebook “likes”, since a person can like a product (but not other people).

The adjacency matrix for the above graph.

-What does it mean to “identify sub-populations”?
I hope to have some images below that help to give an intuitive understanding of this, and maybe flesh out that understanding in a later post.  This is actually a hard question to answer — must every node belong to some sub-population?  Must a node belong to only one sub-population?  In the first case, where does the crunchy peanut-butter eater belong, in a land of creamy peanut-butter eaters?  In the second case, where does a triathlete belong in a population of runners, swimmers and bikers?

The above matrix with the columns/rows acted on by the same permutation. This is what one might expect their network data to look like, and you’d hope to find the block structure above (or find an indication that such structure does not exist).

Anyways, in order to investigate some of this I’ve got some basic scripts set up that:
1) Generate a graph with 30 cliques of various sizes, sparsely connected to each other
2) Print an image of this graph
3) Print the adjacency matrix of the graph
4) Scramble the matrix, and print that
5) Run the affinity propagation algorithm from sklearn on the scrambled matrix, and print that

If affinity propagation was perfect, it would return a result very close to the original block matrix (possibly with the blocks permuted).

The results of affinity propagation. Notice that it identifies many more clusters than originally existed.  Also notice that this is also an adjacency matrix for the above graph.


Volume of Balls

I like the following question:

“Which dimension has the ‘biggest’ unit ball?”

To define a few words, a “ball” of radius r about a point x in a metric space with metric d (in particular, in R^n with standard metric) is defined by B_X(x,r) = \{ y \in X: d(x,y) \leq r \}.  So the unit ball in Euclidean space is the collection of points of distance less or equal to one from, say, the origin.  If you think I am being overly careful with this definition, you’re right!

1-, 2-, and 3-dimensional balls.

By “biggest”, I will mean the n-dimensional volume.  So if we define V(n) to be the volume of the unit ball in R^n, then V(1) = 2 (the 1-dimensional ball is a line), V(2) = \pi r^2 \approx 3.14, and V(3) = \frac{4}{3} \pi r^3 \approx 4.19.  So the volumes are getting bigger, and maybe we have a good question on our hands.  But now what’s the volume of a four dimensional sphere?

It turns out that one can derive the formula

V(n) = \frac{\pi^{n/2}}{\Gamma(n/2+1)},

where \Gamma is the gamma function, a generalization of the factorial (!) function!  In particular, \Gamma(n+1) = n! for integers n.  Then, for even dimensions,

V(2n) = \frac{\pi^{n}}{n!}.

Now so long as we are only worried about integer dimensions, we can use the half-integer identity for the gamma function:
\Gamma(n+1/2) = \frac{(2n)!}{4^nn!}\sqrt{\pi},
to get a similar formula for odd dimensions:
V(2n-1) = \frac{4^n\pi^{n-1}n!}{(2n)!}.

Then a quick calculation gives:

V(1) = 2.00
V(2) = 3.14
V(3) = 4.19
V(4) = 4.93
V(5) = 5.26
V(6) = 5.17
V(7) = 4.72
V(8) = 4.06
V(9) = 3.30.

We note that the denominator for both odd and even dimensions grows much faster than the numerator to conclude (in academia we could only nod our head suggestively so far at this conclusion, but we’re in industry now!) that the 5-dimensional ball has the most volume.

As an aside, and possibly a subject for a later post, if we allow for fractional dimensions (everything in V(n) is defined for n a real number, not just an integer), then the maximum value of V is at about 5.2569, where the unit ball has volume around 5.2778.

The function V(n), plotted for real numbers n, with max around (5.26, 5.28).

Arrogance Personified

Well, since my last post, I successfully defended (footage from after the defense), found a job as an analytics developer at a startup in Austin, TX, and even moved out here.  Feels great to be out of Houston and — despite the Texas summer — I’ve started playing outside again, running, climbing and biking.  I’ve somewhat competitive plans of doing well at next February’s Austin marathon, as well as some mountain trips further in the future…

For now though, having tons of fun at the job, playing with gigantic data sets, and learning all sorts of new things.  I’d expect more posts about manifold learning and extracting structure from large networks in the future, but for today a track and field video I’ve loved for a long time, featuring Bill McChesney, John Treacy and Steve Ovett.