One of the requirements of our objective versus subjective rationality model is to have learning agents that act rationally on their subjective representation of the world. The easiest parameter to consider learning is the probability of agents around you cooperating or defecting. In an one shot game without memory, your partner cannot condition their strategy on your current (or previous actions) directly. However, we don’t want to build knowledge of this into the agents, so we will allow them to learn the conditional probabilities of seeing a cooperation if they cooperate, and of seeing a cooperation if they defect. If the agents learning accurately reflects the world then we will have .
For now, let us consider learning , the other case will be analogous. In order to be rational, we will require the agent to use Bayesian inference. The hypotheses will be for — meaning that the partner has a probability of cooperation. The agent’s mind is then some probability distribution over , with the expected value of being . Let us look at how the moments of change with observations.
Suppose we have some initial distribution , with moments . If we know the moments up to step then how will they behave at the next time step? Assume the partner cooperated:
If the partner defected:
Although tracking moments is easier than updating the whole distribution and sufficient for recovering the quantity of interest ( — average probability of cooperation over ), it can be further simplified. If is the uniform distribution, then . What are the moments doing at later times? They’re just counting, which we will prove by induction.
Our inductive hypothesis is that after observation, with of them being cooperation (and thus defection), we have:
Note that this hypothesis implies that
If we look at the base case of (and thus ) then this simplifies to
Our base case is met, so let us consider a step. Suppose that our -st observation is a cooperation, then we have:
Where the last line is exactly what we expect: observing a cooperation at step means we have seen a total of cooperations.
If we observe a defection on step , instead, then we have:
Which is also exactly what we expect: observing a defection at step means we have seen a total of defections. This completes our proof by induction, and means that our agents need to only store the number of cooperations and defections they have experienced.
I suspect the above theorem is taught in any first statistics course, unfortunately I’ve never had a stats class so I had to recreate the theorem here. If you know the name of this result then please leave it in the comments. For those that haven’t seen this before, I think it is nice to see explicitly how rationally estimating probabilities based on past data reduces to counting that data.
Our agents are then described by two numbers giving their genotype, and four for their mind. For the genotype, there is the values of and that mean that the agent thinks it is playing the following cooperate-defect game:
For the agents’ mind, we have which is the number of cooperations and defections the agents saw after cooperation, and is the same following a defection. From these values and the theorem we just proved, the agent knows that and . With these values, the agent can calculate the expected subjective utility of cooperating and defecting:
If then the agent will cooperate, otherwise — defect. This has a risk of locking an agent into one action forever, say cooperate, and then never having a chance to sample results for defection and thus never update . To avoid this, we use the trembling-hand mechanism (or -greedy reinforcement learning): with small probability the agent performs the opposite action of what it intended.
The above agent is rational with respect to its subjective state but could be acting irrationally with respect to the objective game and proportion of cooperation .