Published on 2015-07-01.
import re
import subprocess
import dateutil.parser
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
PATH_TO_REPO = "main-sf"
PATH_TO_CHANGELOG = "%s/doc/changelog.rst" % PATH_TO_REPO
In this notebook I will explore data from Timeline to try to learn something.
In particular, I'm interested in knowing what makes a Timeline release successful. What did we do right that we can continue to do more of so that future releases of Timeline will be successful.
That is a quite vague direction, but I hope I can ask some more specific questions as I go along.
Where do we start? Let's start somewhere.
Let's start by figuring out when the different releases happened. We can parse that information from the changelog:
release_dates = []
release_versions = []
with open(PATH_TO_CHANGELOG) as f:
while True:
line = f.readline()
if not line:
break
match = re.match(r"^Version (\d+\.\d+\.\d+)$", line)
if match:
version = match.group(1)
f.readline()
f.readline()
match = re.match(r"^\*\*Released on (.*)\.\*\*$", f.readline())
if match:
release_dates.append(dateutil.parser.parse(match.group(1)).date())
release_versions.append(version)
Let's load that into Pandas so that we can more easily work with it:
releases = pd.DataFrame({
"date": release_dates,
"version": release_versions,
})
releases.head()
Let's plot when releases occured in time to get a feel for the distribution.
Let's first add a dummy column that we will use for plotting purposes.
releases["dummy"] = np.zeros(len(release_dates))
releases.head()
Let's also filter out the major releases as we only want to show them on the x-axis:
major_releases = releases[releases["version"].str.endswith(".0")]
major_releases.head()
Now we are ready to plot:
releases.plot(x="date", y="dummy", style="o", figsize=(15, 3))
plt.xticks(major_releases["date"].values, major_releases["version"].values, rotation=90)
plt.yticks([])
plt.xlabel("")
plt.show()
We see that there are some blue circles to the right of the vertical lines. Those are the minor releases. For example, the circle to the right of 0.12.0 is probably release 0.12.1.
Now we've got an intuitive feel for the distribution. Let's see if we can plot it more precicely:
sorted_major_releases = major_releases.sort("date")
sorted_major_releases["time_in_development"] = sorted_major_releases.date.diff()
sorted_major_releases["days_in_development"] = sorted_major_releases.dropna().time_in_development.map(lambda x: x.item() / 1000000000.0 / 60.0 / 60.0 / 24.0)
sorted_major_releases.head()
And now we are ready to plot:
sorted_major_releases.days_in_development.plot(kind="bar")
plt.xticks(
np.arange(sorted_major_releases.days_in_development.shape[0])+1, # Not sure why +1 is needed
sorted_major_releases.version.values,
rotation=90
)
plt.title("Days in development for Timeline releases")
plt.show()
That was fun!
Looks like we released often in the beginning and then changed release period at version 0.10.0. I know that we decided some time to have a new release roughly every 3rd month. Was it around 0.10.0? Then why are releases 0.11.0, 0.17.0, and 1.4.0 significantly longer?
From the changelog, it looks like version 0.11.0 contained very few changes. So maybe it was "delayed" because we had nothing useful to release.
The same goes for version 0.17.0.
Version 1.4.0 contained the undo feature that I remember that we wanted to test a bit more before making the release. So that is probably the cause of 1.4.0 being a little late.
Now, let's extract some data about commits to see if the data there supports our guesses above.
Let's start by extracting the dates of all commits:
output = subprocess.check_output([
"hg", "log",
"--template", "{date|isodate}\n"
], cwd=PATH_TO_REPO)
commits = pd.DataFrame({
"date": [dateutil.parser.parse(x).date() for x in output.strip().split("\n")]
})
commits = commits.sort("date")
commits.head()
commits.tail()
commits.describe()
From that we can create a series that has the number of commits per day:
commit_frequency = commits.groupby("date").count().rename(columns={"date": "number_of_commits"}).asfreq("D").fillna(0)
commit_frequency.head()
Let's plot it to see what it looks like:
commit_frequency.plot(figsize=(15, 5))
plt.title("Number of commits over time")
plt.xticks(
sorted_major_releases.date.values,
sorted_major_releases.version.values,
rotation=90
)
plt.show()
Now let's see if we can look at a particular release. Let's look at the three we found took longer: 0.11.0, 0.17.0, and 1.4.0:
def plot_commit_stat(start_release, end_release):
span = major_releases[(major_releases.version == start_release) | (major_releases.version == end_release)]
start_date = span.date.min()
end_date = span.date.max()
commit_frequency[start_date:end_date].plot(figsize=(15, 3))
labels = major_releases[(major_releases.date >= start_date) & (major_releases.date <= end_date)]
plt.title("Number of commits between %s - %s" % (start_release, end_release))
plt.xticks(
labels.date.values,
labels.version.values,
rotation=90
)
plot_commit_stat("0.10.0", "0.11.0")
plot_commit_stat("0.16.0", "0.17.0")
plot_commit_stat("1.3.0", "1.4.0")
plt.show()
The two earlier releases seem to contain fewer commits per day. The same that we saw from the changelog. Version 1.4.0 seems to have quite steady commits from the middle of the period. Looking at the overall commit frequency graph we also see that 1.4.0 contains the peak commits per day at around 32. So it's possible that it took longer because we wanted to add some more tests.
I'm not sure if I can draw any conclusions from this, but looking at data graphically is quite fun, and I've learned how to use the Pandas library.
What is Rickard working on and thinking about right now?
Every month I write a newsletter about just that. You will get updates about my current projects and thoughts about programming, and also get a chance to hit reply and interact with me. Subscribe to it below.