ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.
Today as small return for the ReproduceIt series I try to reproduce a simple but nice data analysis and webapp that braid.io did called Most Beyonces are 14 years old and most Kanyes are about 11.
The article analyses the trend of names of some music artits (Beyonce, Kanye and Madona) in the US, it also has some nice possible explanations for the ups and downs in time, its a quick read. The data is based on Social Security Office and can be downloaded from the SSN website: Beyond the Top 1000 Names
The data is very small and loading it into pandas and plotting using bokeh it was very easy.
import pandas as pd
data_dir = os.path.expanduser("~/data/names/names")
files = os.listdir(data_dir)
data = pd.DataFrame(columns=["year", "name", "sex", "occurrences"])
for fname in files: if fname.endswith(".txt"): fpath = os.path.join(data_dir, fname) df = pd.read_csv(fpath, header=None, names=["name", "sex", "occurrences"]) df["year"] = int(fname[3:7]) data = data.append(df)
data.year = data.year.astype(int)
name object occurrences float64 sex object year int64 dtype: object
Now that the data is into a simple dataframe we can just filter by the name we want and make a Bar Chart.
beyonce = data[data["name"] == "Beyonce"][["year", "occurrences"]]
from bokeh.charts import ColumnDataSource, Bar, output_notebook, show
from bokeh.models import HoverTool
p = Bar(data=beyonce, label="year", values="occurrences", title="No. Babies named Beyoncé", color="#0277BD", ylabel='', tools="save,reset") show(p)
<Bokeh Notebook handle for In>
And thats it! Nothing crazy or big data this time but a nice example on how to get something done in python in 30 minutes. Go to the article page and you can search for your own name in a nice webapp.