Timeline release statistics

Published on 2015-07-01.

In [1]:
import re
import subprocess
import dateutil.parser
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline

PATH_TO_REPO = "main-sf"
PATH_TO_CHANGELOG = "%s/doc/changelog.rst" % PATH_TO_REPO

In this notebook I will explore data from Timeline to try to learn something.

In particular, I'm interested in knowing what makes a Timeline release successful. What did we do right that we can continue to do more of so that future releases of Timeline will be successful.

That is a quite vague direction, but I hope I can ask some more specific questions as I go along.

When did releases happen?

Where do we start? Let's start somewhere.

Let's start by figuring out when the different releases happened. We can parse that information from the changelog:

In [2]:
release_dates = []
release_versions = []
with open(PATH_TO_CHANGELOG) as f:
    while True:
        line = f.readline()
        if not line:
            break
        match = re.match(r"^Version (\d+\.\d+\.\d+)$", line)
        if match:
            version = match.group(1)
            f.readline()
            f.readline()
            match = re.match(r"^\*\*Released on (.*)\.\*\*$", f.readline())
            if match:
                release_dates.append(dateutil.parser.parse(match.group(1)).date())
                release_versions.append(version)

Let's load that into Pandas so that we can more easily work with it:

In [3]:
releases = pd.DataFrame({
    "date": release_dates,
    "version": release_versions,
})

releases.head()
Out[3]:
date version
0 2015-04-30 1.6.0
1 2015-01-31 1.5.0
2 2014-11-12 1.4.1
3 2014-11-09 1.4.0
4 2014-06-30 1.3.0

5 rows × 2 columns

What does the frequency look like?

Let's plot when releases occured in time to get a feel for the distribution.

Let's first add a dummy column that we will use for plotting purposes.

In [4]:
releases["dummy"] = np.zeros(len(release_dates))
releases.head()
Out[4]:
date version dummy
0 2015-04-30 1.6.0 0
1 2015-01-31 1.5.0 0
2 2014-11-12 1.4.1 0
3 2014-11-09 1.4.0 0
4 2014-06-30 1.3.0 0

5 rows × 3 columns

Let's also filter out the major releases as we only want to show them on the x-axis:

In [5]:
major_releases = releases[releases["version"].str.endswith(".0")]
major_releases.head()
Out[5]:
date version dummy
0 2015-04-30 1.6.0 0
1 2015-01-31 1.5.0 0
3 2014-11-09 1.4.0 0
4 2014-06-30 1.3.0 0
9 2014-04-05 1.2.0 0

5 rows × 3 columns

Now we are ready to plot:

In [6]:
releases.plot(x="date", y="dummy", style="o", figsize=(15, 3))
plt.xticks(major_releases["date"].values, major_releases["version"].values, rotation=90)
plt.yticks([])
plt.xlabel("")
plt.show()

We see that there are some blue circles to the right of the vertical lines. Those are the minor releases. For example, the circle to the right of 0.12.0 is probably release 0.12.1.

Now we've got an intuitive feel for the distribution. Let's see if we can plot it more precicely:

In [7]:
sorted_major_releases = major_releases.sort("date")
sorted_major_releases["time_in_development"] = sorted_major_releases.date.diff()
sorted_major_releases["days_in_development"] = sorted_major_releases.dropna().time_in_development.map(lambda x: x.item() / 1000000000.0 / 60.0 / 60.0 / 24.0)
sorted_major_releases.head()
Out[7]:
date version dummy time_in_development days_in_development
38 2009-04-11 0.1.0 0 NaT NaN
37 2009-07-05 0.2.0 0 85 days 85
36 2009-08-01 0.3.0 0 27 days 27
35 2009-09-01 0.4.0 0 31 days 31
34 2009-10-01 0.5.0 0 30 days 30

5 rows × 5 columns

And now we are ready to plot:

In [8]:
sorted_major_releases.days_in_development.plot(kind="bar")
plt.xticks(
    np.arange(sorted_major_releases.days_in_development.shape[0])+1, # Not sure why +1 is needed
    sorted_major_releases.version.values,
    rotation=90
)
plt.title("Days in development for Timeline releases")
plt.show()

That was fun!

Looks like we released often in the beginning and then changed release period at version 0.10.0. I know that we decided some time to have a new release roughly every 3rd month. Was it around 0.10.0? Then why are releases 0.11.0, 0.17.0, and 1.4.0 significantly longer?

From the changelog, it looks like version 0.11.0 contained very few changes. So maybe it was "delayed" because we had nothing useful to release.

The same goes for version 0.17.0.

Version 1.4.0 contained the undo feature that I remember that we wanted to test a bit more before making the release. So that is probably the cause of 1.4.0 being a little late.

What about commit frequency?

Now, let's extract some data about commits to see if the data there supports our guesses above.

Let's start by extracting the dates of all commits:

In [9]:
output = subprocess.check_output([
    "hg", "log",
    "--template", "{date|isodate}\n"
], cwd=PATH_TO_REPO)

commits = pd.DataFrame({
    "date": [dateutil.parser.parse(x).date() for x in output.strip().split("\n")]
})
commits = commits.sort("date")
In [10]:
commits.head()
Out[10]:
date
3147 2008-10-28
3146 2008-10-29
3145 2008-11-01
3144 2008-11-02
3143 2008-11-03

5 rows × 1 columns

In [11]:
commits.tail()
Out[11]:
date
0 2015-06-19
1 2015-06-19
2 2015-06-19
3 2015-06-19
4 2015-06-19

5 rows × 1 columns

In [12]:
commits.describe()
Out[12]:
date
count 3148
unique 718
top 2014-09-09
freq 32

4 rows × 1 columns

From that we can create a series that has the number of commits per day:

In [13]:
commit_frequency = commits.groupby("date").count().rename(columns={"date": "number_of_commits"}).asfreq("D").fillna(0)
commit_frequency.head()
Out[13]:
number_of_commits
2008-10-28 1
2008-10-29 1
2008-10-30 0
2008-10-31 0
2008-11-01 1

5 rows × 1 columns

Let's plot it to see what it looks like:

In [14]:
commit_frequency.plot(figsize=(15, 5))
plt.title("Number of commits over time")
plt.xticks(
    sorted_major_releases.date.values,
    sorted_major_releases.version.values,
    rotation=90
)
plt.show()

Now let's see if we can look at a particular release. Let's look at the three we found took longer: 0.11.0, 0.17.0, and 1.4.0:

In [15]:
def plot_commit_stat(start_release, end_release):
    span = major_releases[(major_releases.version == start_release) | (major_releases.version == end_release)]
    start_date = span.date.min()
    end_date = span.date.max()
    commit_frequency[start_date:end_date].plot(figsize=(15, 3))
    labels = major_releases[(major_releases.date >= start_date) & (major_releases.date <= end_date)]
    plt.title("Number of commits between %s - %s" % (start_release, end_release))
    plt.xticks(
        labels.date.values,
        labels.version.values,
        rotation=90
    )

plot_commit_stat("0.10.0", "0.11.0")
plot_commit_stat("0.16.0", "0.17.0")
plot_commit_stat("1.3.0", "1.4.0")
plt.show()

The two earlier releases seem to contain fewer commits per day. The same that we saw from the changelog. Version 1.4.0 seems to have quite steady commits from the middle of the period. Looking at the overall commit frequency graph we also see that 1.4.0 contains the peak commits per day at around 32. So it's possible that it took longer because we wanted to add some more tests.

I'm not sure if I can draw any conclusions from this, but looking at data graphically is quite fun, and I've learned how to use the Pandas library.

What is Rickard working on and thinking about right now?

Every month I write a newsletter about just that. You will get updates about my current projects and thoughts about programming, and also get a chance to hit reply and interact with me. Subscribe to it below.