Wednesday, May 28, 2008

Common ancestors: analysis

After estimating via simulation the kinship distribution K(n) of our simple population model, let us try to solve the problem analytically. As was stated before, K(n) = 0 for n odd, so we will concentrate only on even values of n. First we will calculate K(n) recursively by solving the cases K(0), K(2) and the inductive step K(n + 2). In what follows N is the (constant) size of any generation and x and y denote arbitrary individuals of the same generation. We also define

s(n) := n·r(n)/∑m m·r(m),
S := ∑ n·s(n),
D(n) := 1 − (K(0) + K(2) + ··· + K(n)) = P(k(x,y) > n).

s(n) expresses the probability that an individual have n siblings including herself, as proved in a prior entry.

K(0): this value is simply the probability that x and y are the same individual:

K(0) = P(x = y) = 1/N.

K(2): this is the probability that x and y are siblings. If x and her siblings add up to n this probability is (n − 1)/N, so in general

K(2) = P(x and y are siblings) =
= ∑n > 0 P(x and her siblings add up to n) P(y is one of x's siblings) =
= ∑n > 0 s(n)(n − 1)/N = (S − 1)/N.

K(n + 2), n ≥ 2: this value is the probability that k(x,y) = n + 2, i.e. x and y have a youngest common ancestor (n + 2)/2 generations before theirs. We can express this probability like this:

P(k(x,y) = n + 2) = P(k(x,y) = n + 2 | k(x,y) > n) · P(k(x,y) > n) =
= P(k(x,y) = n + 2 | k(x,y) > n) · D(n),

where we have just conditioned the event "k(x,y) = n + 2" to the enclosing event "k(x,y) > n".

Now, from the very definition of the kinship proximity function, k(x,y) > n if and only if

k(m(x),m(y)) > n − 2,
k(m(x),f(y)) > n − 2,
k(f(x),m(y)) > n − 2,
k(f(x),f(y)) > n − 2;


P(k(x,y) = n + 2 | k(x,y) > n) =
= P([k(m(x),m(y)) = n | k(m(x),m(y)) > n − 2] or
[k(m(x),f(y)) = n | k(m(x),f(y)) > n − 2] or
[k(f(x),m(y)) = n | k(f(x),m(y)) > n − 2] or
[k(f(x),f(y)) = n | k(f(x),f(y)) > n − 2]).

Without further justification (though the final results are consistent), we assume that these clauses are statistically independent:

P(k(x,y) = n + 2 | k(x,y) > n)=
= 1 − P([k(m(x),m(y)) ≠ n | k(m(x),m(y)) > n − 2] and
[k(m(x),f(y)) ≠ n | k(m(x),f(y)) > n − 2] and
[k(f(x),m(y)) ≠ n | k(f(x),m(y)) > n − 2] and
[k(f(x),f(y)) ≠ n | k(f(x),f(y)) > n − 2]) =
= 1 P(k(m(x),m(y)) ≠ n | k(m(x),m(y)) > n − 2) ·
· P(k(m(x),f(y)) ≠ n | k(m(x),f(y)) > n − 2) ·
· P(k(f(x),m(y)) ≠ n | k(f(x),m(y)) > n − 2) ·
· P(k(f(x),f(y)) ≠ n | k(f(x),f(y)) > n − 2).

As kinship proximity is not affected by the individuals' sex (beyond level 0), the four factors have the same value and the expression is equivalent to:

P(k(x,y) = n + 2 | k(x,y) > n)=
1 − (1 P(k(x',y') = n | k(x',y') > n − 2))4,

where x' and y' are arbitrary individuals of the generation preceding x and y. Using basic probability properties we have

P(k(x',y') = n | k(x',y') > n − 2) =
P(k(x',y') = n and k(x',y') > n − 2)/P(k(x',y') > n − 2) =
P(k(x',y') = n)/P(k(x',y') > n − 2) =
K(n)/D(n − 2),

and thus

K(n + 2) = D(n)(1 − (1 − K(n)/D(n − 2))4) =
= D(n)(1 − ((D(n − 2) − K(n))/D(n − 2))4) =
= D(n)(1 − (D(n)/D(n − 2))4)=
= D(n) − D(n)5/D(n − 2)4.

To summarize, the recursive calculation of K(n) can be performed as follows:

  • K(0) ← 1/N
  • D(0) ← (N − 1)/N
  • K(2) ← (S − 1)/N
  • D(2) ← (NS)/N
  • K(n + 2) ← D(n) − D(n)5/D(n − 2)4
  • D(n + 2) ← D(n) − K(n + 2)

We can also provide a closed formula for K(n). The recursive equation

K(n + 2) = D(n) − D(n)5/D(n − 2)4

can be rewritten as

D(n) − D(n + 2) = D(n) − D(n)5/D(n − 2)4


D(n + 2) = D(n)5/D(n − 2)4.

If we define

d(n) := ln D(2n)

the equality can be expressed then as

d(n + 2) = 5d(n + 1) − 4d(n),

which is a standard second-order linear recurrence equation. The roots of the associated characteristic polynomial are 1 and 4, so the classical theory of linear recurrence equations tells us that d(n) can be expressed as

d(n) = a + 4n,

or, turning to D(n),

D(n) = A·B2n,

where A and B can be determined by using the known values D(0) and D(2). If we solve this and express K(n) in terms of D(n) we have:

K(0) = 1/N,
K(n) = A · (B2n − 2B2n), n ≥ 2,
A = (N − 1)4/3/(N·(NS)1/3),
B = ((NS)/(N − 1))1/3,

which is the closed expression we were after. This formula matches perfectly the experimental results we had previously obtain via simulation. A further confirmation that the analysis is consistent comes from the fact that ∑ K(n) = 1, as it must be for a probability distribution: ∑ K(n) = 1 − limn → ∞D(n), and D(n) = A·B2n tends to 0 when n → ∞ because B = ((NS)/(N − 1))1/3 < 1 (SR and R = 2 by hypothesis).

Saturday, May 24, 2008

Making choices

In his talk The Paradox of Choice, Barry Schwartz comments on an experiment conducted around speed dating: two speed dating evenings were set up; in the first one each participant had 12 dates, and only 6 in the second event; results showed that more matches were made on the second case despite the lower number of dates. Schwartz explains the paradox as follows: each person evaluates their potential partners according to a set of traits, like sex appeal, empathy, intelligence and so on. Fully evaluating 12 candidates is a harder task than when there are only 6 candidates, so in the first case the participants decide to simplify the evaluation strategy and focus on only one personal trait, usually sex appeal, which results in worse choices than fully evaluating less candidates.

Does this explanation stand formal analysis? The hypothesis is that full evaluation of a few candidates gives better results than partial evaluation of more candidates. At first sight this is not entirely obvious, since there is a trade-off between evaluation effort per candidate and number of candidates evaluated: it well could be that sampling more candidates compensates for the poorer evaluation.

Let us build a very simple mathematical model of the situation: we have N candidates c1,...,cN to choose from, each one scoring on M different evaluation traits so that sij is the score of ci on the j-th trait. The global score of ci is then si := ∑jsij. Let us assume that the different sij are independent random variables following a normal distribution with mean 0 and variance 1. We define a selection strategy parameterized according to an integer 1 ≤ tM:

Take N/t candidates at random and choose among these one with maximum partial score s'i := si1+ ··· +sit.

So, for t = 1 the strategy consists in selecting among all the candidates one with maximum score in the first trait; for t = 2, we select half the candidates at random and choose among them based on their first two traits; for t = M, the evaluation is based on all the traits, but only for a fraction 1/M of the candidate pool. The number of trait evaluations for any value of t remains constant at (N/t candidates)·(t traits per candidate) = N, which is an expression of the expected trade-off between evaluation depth and candidate coverage. Schwartz's hypothesis states that the best strategy is t = M.

I have written a Boost-based C++ program to simulate the model for N = 60, M = 5. The figure shows the average score of the candidate chosen by the different selection strategies for t = 1,...,5, along with the the average result for the optimum choice.

So, Schwartz's hypothesis is correct (at least for this scenario): the best strategy is to do full evaluation at the expense of not covering all the available candidates. We can also analyze the results of the simulation by depicting, for every t, the probability pt(n) that the strategy with that t chooses the n-th best candidate.

With t = 5, the probability that we actually select the optimum candidate is 20% (obviously, since we are fully evaluating 20% of the candidate pool), while for t = 1 the probability decreases to 12%.

From this experiment, we can extract an interesting piece of advice for everyday life: when confronted with the task of selecting an option among a wide array of candidates, it is better to carefully study a small subset of candidates than to skim through them all.

From a broader perspective, the strategies we have studied are only a subset of the possible selection schemes constrained to using a fixed amount N of trait evaluations. Seen this way, there are in fact more efficient strategies than these, as we will see in a later entry.

Tuesday, May 20, 2008

Functional characterization of C++ template metaprogramming

C++ template metaprogramming is effectively a tiny programming sublanguage of a functional nature. Let us characterize it from the point of view of functional programming:

Purity. C++ TMP is pure, i.e. without side effects, except for one detail commented below. The unit of computation is template instantiation, which can only result in the definition of embedded types and numerical constants and further instantiations of other templates: none of these processes are dependent on the particular execution time they are triggered or the relative order of execution with respect to other computations. The only observable side effects are the potential warning messages issued by the compiler during the compilation process, but these do not affect the computations of the metaprogram itself.

Strictness. Strictness is associated with call-by-value semantics (predominant in imperative languages and many functional languages such as ML or Scheme), where non-strict or lazy languages do evaluate arguments only as needed (the main representative of this class being Haskell). C++ TMP can be implemented in either way, or even mixedly: the de facto standard library for C++ TMP, Boost.MPL, chooses call-by value as the default (most functions expect their arguments to be fully evaluated at the point of invocation), but some lazy constructs are provided as well. We enjoy this freedom of choice at the language level because computation (instantiation of some given type T) must be invoked explicitly (in Boost.MPL, by referring to the embedded T::type type): merely mentioning T does not cause T to instantiate.

Typing. In C++ TMP, the run-time objects of computation or values are C++ types. What are the types of C++ TMP then? It turns out there are no types in C++ TMP if by type we mean static sets of values along with compatible functions. A C++ TMP function is a class template dependent on n arguments like this:

template<Arg1,...Argn> struct F;
where each Argi is any of the following:

Non-type template parameters are seldom used in C++ TMP, and template template parameters are usually dropped in favor of equivalent representations based solely on type parameters (for instance, by preferring metafunction classes over metafunctions), so the language is virtually untyped. C++ TMP implements a form of dynamic duck typing: the program assumes that their arguments have some specific embedded types and values, and if this is not the case compilation (which is the run-time phase of C++ TMP) fails. For instance, the arithmetic metafunctions of Boost.MPL only work with types representing numerical values; in general types, type classes etc. only exist on the programmer's mind and are not enforced by the language. The absence of types in C++ TMP is a minor nuissance only, since static typing is mostly about anticipating run-time errors during the compilation phase, and C++ TMP has no actual compilation phase (compilation is C++ TMP run time). With no types in C++ TMP, questions about polymorphism in the language mostly do not apply, although a form of metafunction overloading can be exercised via partial specialization based pattern matching.

Pattern matching. This is one of the hallmarks of functional languages like Haskell. Pattern matching works over algebraic types whose structure is defined recursively from a set of syntactic constructors. It turns out that C++ TMP is not only able to implement pattern matching, in fact pattern matching is its basic computational device. Recursive template instantiations play the role of algebraic values, and their internal structure can be disentangled by means of partial template specialization. For instance:

‎struct zero;

‎template<typename N>
‎struct suc;

‎template<typename N,typename M>
‎struct sum;

‎template<typename M>
‎struct sum<zero,M>
‎ typedef M type;

‎template<typename N,typename M>
‎struct sum<suc<N>,M>
‎ typedef suc<typename sum<N,M>::type> type;

So, C++ template metaprogramming can be regarded as a pure, non-strict, untyped functional language with pattern matching.

Friday, May 16, 2008

Word frequency and length

What is the relationship between the length of a word and the frequency with which it appears in texts? At first thought it is apparent that longer words tend to be rarer, but we will try to estimate this relationship numerically.

A little Boost-based C++ program has been used to count the occurrences and length of the 1,000 most frequent words in two texts in different languages: Dickens' Our Mutual Friend and Leopoldo Alas' La Regenta. The results are shown in the figures above, where word frequency is plotted in logarithmic scale against word length.

Fig. 1: Word frequency vs. word length, Our Mutual Friend.

Fig. 2: Word frequency vs. word length, La Regenta.

The red line corresponds to the median frequency value as a function of word length: as expected, this frequency decreases as length grows, and seems to do so in an approximately exponential manner (or linear, if frequency is plotted on a logarithmic scale as above).

Another interesting way to look at this information is by estimating the entropy of the words with length n, in bits per character, for each value of n.

The exact formula for this function is

f(n) := (1/n)∑wh(occurrences(w)/∑voccurrences(v)),

with w, v ranging over words of length n and h(x) defined as:

h(x) := −x·log2x.

The fact that the information per character diminishes as word length grows is explained by the relative sparsity of longer words, which is tantamount to saying that word spelling is more redundant as words grow longer.

Monday, May 12, 2008

Common ancestors: simulation

We wish to study the kinship distribution of a closed population in the very simplified situation in which the population is closed, constant in size and stratified (split up in perfectly separate generations such that mating occurs only between members of the same generation). We choose to model the fertility distribution of the population r(n) as a Poisson distribution with median value 2 (since the population size is constant). Intersibling incest is banned, and for the sake of simplicity we also assume that a strict monogamy applies, i.e. siblings always share mother and father.

Before attempting to calculate the kinship distribution of such population analytically, I have written a small C++ program to estimate it via simulation. You will need Boost to compile the code. The simulation proceeds as follows:

  • A pool of individuals is maintained, with new offspring being inserted after their parent generation.
  • At each iteration, females of the last generation are traversed and mated with available males at random. The couple is assigned a number of children randomly generated according to a Poisson distribution with median 2.
  • After a number of iterations the kinship distribution of the population is assumed to stabilize. The kinship histogram is calculated from a random sample of pairs of members of the last generation.

The figure shows the results for generation sizes N = 1,000, 10,000 and 100,000 with simulations of 100 generations each. We only depict values of K(n) for n even, since, because of the stratification property, K(n) = 0 if n is odd.

The kinship distribution K(n), n even, initially grows exponentially (in fact, as a·2n, as we will see when we derive analytical expressions for the function) until a maximum value between 0.4 and 0.5, from where it decays abruptly. The most probable kinship relationships between arbitrary individuals for N = 1,000, 10,000 and 100,000 are fifth, sixth and eighth cousins, respectively.

In a later entry we will try to determine K(n) analytically from basic probability theory.

Thursday, May 8, 2008

Imperative C++ metaprogramming

C++ template metaprogramming is fundamentally a functional business, as it deals with immutable data like types and numeric compile-time constants. How can we go about introducing some degree of imperativeness in C++ metaprogramming? The one thing that distinguishes imperative programming is the existence of variables, entities whose associated value can change over time. In a sense, a variable can be seen as a function of time or the program counter; this can be modeled as the following C++ class template:

template<std::size_t Id,std::size_t T>
struct var
typedef typename var<Id,T-1>::type type;
var implements an Id-indexed family of variables {vId} dependent on the time argument T. By the way var is defined, the value of vId at time t + 1 is the same it had at time t; this is expected, since a variable does not change value over time unless the program explicitly does so. Setting a new value for a variable in our model can only be done by partially specializing var as follows:
// set var(0) to 5 starting at time 69
struct var<0,69>
typedef boost::mpl::int_<5> type;
(We are using MPL constants here.) The problem with this approach is that classical template metaprogramming cannot be used to perform a variable setting operation: all that template metaprogramming reduces to is template instantiation, that cannot possibly result in a non-predefined partial specialization. So we need to step back and use preprocessor metaprogramming to effectively use var:
template<typename T> struct unparens_helper;
template<typename T> struct unparens_helper<void(T)>{typedef T type;};
#define UNPARENTHESIZE(f) unparens_helper<void f>::type

#define VAR(Id) var<Id,__COUNTER__>::type
#define LET(Id,x) template<> struct var<Id,__COUNTER__>:UNPARENTHESIZE(x){};
The code above uses the MSVC-specific macro __COUNTER__ to easily simulate a program counter. As the second argument of LET is likely to contain commas, it is required that it be enclosed in parentheses so as not to fool the preprocessor: UNPARENTHESIZE implements some magic to internally remove these parentheses. The computation
v0 = 2;
v1 = 3;
v1 = v0 + v1;
is implemented like this:
A complete program exercising these ideas is provided. This method of implementing imperativeness inside C++ metaprogramming is less than ideal as it forces us to do the non-functional part in the preprocessor realm; this makes it impossible, for instance, to define an MPL metafunction whose implementation is done imperatively.

Monday, May 5, 2008

Innovation and evolution

Innovation is a buzzword en vogue as of lately in the business and technological world: much material has been written on how to promote, foster and drive innovation. I would like to suggest some analogies between the innovation process and evolution as modeled by computational techniques such as genetic algorithms.

Innovation consists in the introduction of new methods, processes or technologies with the goal of improving some business metric (revenue, efficiency, etc.) Thus, the innovation process can be regarded as the optimization of a metric function across a space of possible solutions. Innovation can be incremental if the improvements are based on an already established solution, and disruptive when the solution tried completely departs from preexisting practices: it is generally assumed that incremental innovation ultimately brings marginally decreasing benefits, and that it is only through disruptive innovation that real quantum leaps can be made, even if it can take some adjusting time to catch up with established solutions. Disruptive innovation is usually associated with a radical change in the underlying technology (for instance, digital vs. chemical photography).

On the other hand, an evolutionary system implicitly optimizes some metric function (individual fitness in a given ecosystem) through extensive search across the genetic space. Genetic algorithms mimic this model for the more mundane task of looking for global maxima of numerical functions. The solution space is sampled by maintaining a pool of candidate solutions that are made to evolve according to:

  • Selection of fitter solutions.
  • Mating between solutions, using crossover (mixing of parental genes) and mutation (random alteration of the genetic material).

Allegedly, crossover is efficient at fine-tuning solutions when the population has reached an evolutionary niche, while mutation allows the population to escape from local maxima and explore new areas of the solution space: the combination of these two aspects yield a mechanism able to efficiently explore vast fitness landscapes of complex shape.

The similarities between these two worlds are obvious. Innovation also takes place in an a priori complex solution space, and different solutions thrive or perish according to its market fitness. Incremental innovation can be compared to local evolution as performed by crossover, while disruptive innovation is akin to the kind of far reaching displacements achieved primarily by mutation. If we are to take this analogy seriously, genetic algorithm techniques can provide us with some advices for improving the innovation process:

  • Maintain a large pool of "beta" solutions to test for fitness.
  • Quickly discard less promising solutions and replace with new ones, keeping a fast turn-around cycle. On the other hand, you must allow for a sufficiently large survival rate so that solutions can mature and some diversity is maintained.
  • Incremental innovation is achieved by combining aspects of similar preexisting solutions (crossover).
  • Disruptive innovation results from combining distant solutions and/or radically altering some of the aspects that define a candidate solution. Diversity is key to this process, which could explain why monolithic markets tend to hamper disruptive innovation.