Theme of the day is networks, and let’s even say self-organizing networks. I expect to flesh this out in the future, but I have some pretty pictures now, so that’s a pretty good start.
A graph with 30 cliques of various sizes.
A general problem will be to take a graph (nodes and edges, etc), and identify sub-populations of that graph. Two great questions, (with some reasonable answers):
– What kind of graph?
We could restrict the discussion to nodes connected by edges, with only one edge allowed between two nodes, and no self connections (this might be a “friend” graph in facebook), but perhaps it would be more useful to look at a directed graph (like a “follow” graph in Google+) where a connection is only one-way. A weighted graph could let you have strong or weak connections – maybe you are interested in a “normalized” facebook friend graph, where the “strength” of a friendship (it strikes me that the quotation marks might just as well be around “friendship”) depends on how many friends each person has. A bipartite graph might model facebook “likes”, since a person can like a product (but not other people).
The adjacency matrix for the above graph.
-What does it mean to “identify sub-populations”?
I hope to have some images below that help to give an intuitive understanding of this, and maybe flesh out that understanding in a later post. This is actually a hard question to answer — must every node belong to some sub-population? Must a node belong to only one sub-population? In the first case, where does the crunchy peanut-butter eater belong, in a land of creamy peanut-butter eaters? In the second case, where does a triathlete belong in a population of runners, swimmers and bikers?
The above matrix with the columns/rows acted on by the same permutation. This is what one might expect their network data to look like, and you’d hope to find the block structure above (or find an indication that such structure does not exist).
Anyways, in order to investigate some of this I’ve got some basic scripts set up that:
1) Generate a graph with 30 cliques of various sizes, sparsely connected to each other
2) Print an image of this graph
3) Print the adjacency matrix of the graph
4) Scramble the matrix, and print that
5) Run the affinity propagation algorithm from sklearn on the scrambled matrix, and print that
If affinity propagation was perfect, it would return a result very close to the original block matrix (possibly with the blocks permuted).
The results of affinity propagation. Notice that it identifies many more clusters than originally existed. Also notice that this is also an adjacency matrix for the above graph.