So I’ve been trying to work on my networking skills in preparation for a transition to a postdoc by literally chatting up every single person within a 15 foot radius. This is a surprisingly pleasant exercise. The weird thing is, I met no less than 3 people named Rachel this week. It’s really not that common of a name (in Montreal at least). So I was like, what are the odds?
Turns out I’m a statistics nerd, so I figured I could actually compute this. The US census bureau has a list of the 5000 most common boy and girl first names, with their respective populations. Looking at the (truncated) probability density function, the distribution of names is well approximated by a Zipf-Mandelbrot law with q = 55 and s = 1.6 (this was determined through eyeballing, as seen above). Note that the initial outlier is the name Mary.
The rest, as they say, is cake. We can answer a question like: “what are the odds that if you meet N people of the same sex, n or more of them turn out to have the same, relatively uncommon name”. For instance, 0.5% of girls in the US are called Sarah. This is easily answered by simulating Zipf-Mandelbrot draws (you can do this through multinomial sampling if you’re lazy like I am). So, to answer my question, if you meet 15 girls, the odds that 3 (or more) of them will have the same name, and that name is less common than Sarah is about 0.02%.
So yes, it’s pretty unlikely, and assuming you keep meeting 15 new girls every week for 50 years, there’s a 1 in 2 chance this will ever happen to you. So, yes, I’m feeling pretty special. Ain’t statistics grand?
Here’s the code I used (this is admittedly some of the worst code I’ve ever written):
function [odds] = namesdistro(N,n,cutoff) %Computes the approximate odds (as a fraction of 1) that after meeting %N people of the same sex, n of these people will have the same name, and that name will %be shared by less than cutoff percentage of the general population %do this by simulation batchsize = 1000; ndraws = 1e6; nmax = 5e3; thepdf = 1./( ((1:nmax)'+55).^1.6 ); thepdf = thepdf/sum(thepdf); odds = 0; for ii = 1:ndraws/batchsize ns = mnrnd(N,thepdf',batchsize)'; odds = odds + sum(any(bsxfun(@and,ns>n,thepdf < cutoff)))/ndraws; end end