Pythonic python 1: Hamming distance

This workshop is designed to introduce Python to bioinformaticians. It follows on from the introductory workshops on Python programming. In particular, we'll build on our solution to the Hamming distance workshop, and use the various language features we learn to improve our code.

Getting help

If you're using Jupyter, you can always use the built-in help() function or the ? operator to see documentation on a function or type. For instance, help(range) or ?range will give you documentation on the built-in range() function.

You can try this while going through the below. If you want to get help on a Python keyword such as for or if, or on an operator, you'll need to make sure to put quotes around it, like help("for") or help("!=").

Setup

This workshop has been written for Python 3. If you are using Python 2, you should be able to make everything work by first running the following statement:

In [1]:
from __future__ import division, print_function

Hamming distance

In this workshop, we'll use our solution to the novice workshop on Hamming Distance as an example.

The Hamming distance between two strings is the number of positions at which they have different characters.

For instance, compare these strings:

GATTACA
GACTATA

They are different at two locations:

GATTACA
  |  |
GACTATA

So the Hamming distance is 2.

Hamming distance is the simplest edit distance algorithm for string alignment. It tells us the edit distance between two strings if we're only allowed to make substitutions - no insertions or deletions.

We will consider the Hamming distance to be defined only if the two strings are the same lengths. So the Hamming distance between "CAT" and "SUPERCALIFRAGILISTICEXPIALIDOCIOUS" is undefined.

Hamming distance: initial solution

Let's illustrate some basic Python syntax using a solution to the Hamming distance problem. If you've done the introductory workshop, you might have come up with a similar solution yourself.

Exercise: If you already have some experience with Python and want to test yourself on the basics, try to solve this problem before you continue. Write a function called hamming_distance which takes as input two strings, and returns the number of characters which do not match. You may assume the two strings are the same length. You'll be able to use your solution as the basis for the rest of the workshop.

For example, your function should give the following results when called:

hamming_distance("GATTACA","GACTATA")
2
hamming_distance("AGCAGACCCACAA","TGCAGCGAAACGA")
6

The solution is written below, so free to stop here and try to solve this, and to ask for help if you get stuck. Asking for hints is a much better way to develop your understanding than reading the solution.

A couple of hints: you will need to understand strings, loops, and conditionals. You will probably also want to use the built-in Python functions range() and len(). Remember you can use help() to look at the documentation for these built-in functions.

SPOILER ALERT: If you are new to programming, and would like to do the novice workshop, you might want to do it before looking at the solution below!

Here is one possible solution to the Hamming distance problem. The basic plan here is:

  • check the length of the strings so we know how many characters we need to compare
  • look at each pair of corresponding characters in turn by looping over the indices
  • for each pair of characters that are not the same, add 1 to the Hamming distance
  • return the final count
In [2]:
# Return the Hamming distance between string1 and string2.
# string1 and string2 should be the same length.
def hamming_distance(string1, string2): 
    # Start with a distance of zero, and count up
    distance = 0
    # Loop over the indices of the string
    L = len(string1)
    for i in range(L):
        # Add 1 to the distance if these two characters are not equal
        if string1[i] != string2[i]:
            distance += 1
    # Return the final count of differences
    return distance

We can call this function by passing in two strings as the arguments string1 and string2:

In [3]:
example_dist = hamming_distance("GATTACA", "GACTATA")
print(example_dist)
2

Of course, now that we have a function, we can reuse it on any pair of strings, including ones where the answer is not so obvious.

In [4]:
hamming_distance("ACTGTGCAATACCTAAGTGAAAGGGGTGCATAGATCATTCTTTCGTTACCTCGGGTGCTATGA",
                 "ACTCCTCTGTGTCTAAGTGAAAGGGGTGCTTGCAGGGTAATCCTTCCACCTGATACCGACTCG")
Out[4]:
32

Exercise: To check everything is working properly, write or paste this function into your own Notebook and calculate the Hamming distance between "AGCAGACCCACAA" and "TGCAGCGAAACGA".

This Hamming distance solution works, but there's definitely room for improvement. As we go through some code features and tips and tricks of Python, we'll try to use them to improve this Hamming distance solution.

More tricks with strings

The first thing we're going to try to fix is documentation, and for this we need to learn more about strings. We have comments in our function, but Python has better documentation methods.

We'll also print out strings to test our code as we learn, so it's worth taking some time at the start to learn more about using them.

String formatting

Say we want to print out some debugging information to see what our program is doing. For instance, imagine we're trying to talk about these variables:

In [6]:
string1 = "GATTACA"
string2 = "GACTATA"
L = len(string1)

You may have already seen string concatenation, where we add strings together, and the str() function which converts values to strings. This will let us build a string with our variables in it, but it can get quite clunky to write:

In [7]:
print("String 1 is " + string1 + ", string 2 is " + string2 + ". The lengths are " + str(L) + ".")
String 1 is GATTACA, string 2 is GACTATA. The lengths are 7.

More convenient and flexible string formatting in Python can be done with the .format() method, which is used on strings. It looks like this:

In [8]:
print("String 1 is {0}, string 2 is {1}. The lengths are {2}.".format(string1, string2, L))
String 1 is GATTACA, string 2 is GACTATA. The lengths are 7.

The .format() method prints its arguments into the "replacement fields" surrounded by curly braces {} in the string. It automatically tries to convert variables to strings as it does so. The first argument goes into {0}, the second into {1}, and so on. There's nothing to stop us from using a replacement field more than once:

In [49]:
print("Strings are {0} and {1}, length is {2}.\nThe distance between {0} and {1} is {3}".format(string1, string2, L, hamming_distance(string1, string2)))
Strings are GATTACA and GACTATA, length is 7.
The distance between GATTACA and GACTATA is 2

Notice that here we have added a "\n" to the string, and that Python printed it as a newline. We'll look at this in more detail in a moment.

It's also possible to use named keywords instead of numbers, like so:

In [10]:
"The Hamming Distance between {s1} and {s2} is {dist}".format(s1=string1, s2=string2, dist=hamming_distance(string1, string2))
Out[10]:
'The Hamming Distance between GATTACA and GACTATA is 2'

The replacement fields have many options to change the way variables are displayed. For instance, we can change the number of decimal places printed, or whether a string is left-aligned. We can't cover these here, but you can look up these options in the format string syntax documentation.

Escape characters and multi-line strings

Above, we wrote "\n" in a string, and it caused Python to print a newline (move down to the next line).

A backslash \ indicates an escape character, which means that the character after the backslash is not printed normally, but treated as a special character. "\n" inserts a newline, and "\t" inserts a tab.

In [11]:
s = "Line 1.\nLine 2.\n\t\tLine 3 is indented with two tabs."
In [12]:
print(s)
Line 1.
Line 2.
		Line 3 is indented with two tabs.

Using a backslash in a string can also be used to 'escape' a character that would otherwise have a special meaning, like a quote:

In [13]:
print("The two types of single-quote characters are \" and \'.")
The two types of single-quote characters are " and '.

But what if you want to write an actual backslash in your string, not use it to escape another character? Since backslash itself has a special meaning, the way to write it is to escape the backslash itself!

In [14]:
print("The escape character is \\.")
The escape character is \.

If we have a very long string that will span multiple lines, and we want it to be readable, there is a particularly convenient method: the triple-quote operator, """ or '''. Everything from the first triple-quote to the next will be part of the string, even if it spans multiple lines. Whitespace characters like tabs or spaces, and non-triple quote characters, will just be included as part of the string.

In [15]:
bulldog = '''... clutching out, he found his arms full of totally unexpected bulldog.

"Get out!" whispered Sam tensely, recovering his faculties with a jerk. "Go away!"
'''
In [16]:
print(bulldog)
... clutching out, he found his arms full of totally unexpected bulldog.

"Get out!" whispered Sam tensely, recovering his faculties with a jerk. "Go away!"

Strings created this way are normal strings: they can be indexed, sliced etc.

In [17]:
bulldog[53:71]
Out[17]:
'unexpected bulldog'
In [18]:
bulldog[72]
Out[18]:
'\n'

Docstrings

A docstring is a string which Python interprets as documentation. Any string placed at the start of a function definition will be interpreted as a docstring.

In [19]:
def double(x):
    "Return two times x."
    return(2*x)
In [20]:
double(4)
Out[20]:
8

Docstrings are a very useful convention for human readability, and what's more, they are also treated as documentation by Python itself. If we now use the built-in help() function:

In [21]:
help(double)
Help on function double in module __main__:

double(x)
    Return two times x.

It's a good idea to include a docstring in all your function definitions!

Docstrings can also be used in other Python objects such as class definitions (in object-oriented programming) and in Python scripts themselves (by adding a string at the top of the script file).

Often we'll want to write a longer docstring than will fit on one line. The triple-quote operator is perfect for this.

In [22]:
def double(x):
    """
    Take a numeric value x.
    Return two times x.
    """
    return 2*x

Ideally, a function docstring should tell us:

  • what the function is for
  • what the inputs mean, and the expected types
  • what the output will be, and its type

There is a standard for describing all this information which is followed by numpy, and which is a good convention to follow. That standard is described here. Not only will this make it easy to understand your programs, it also means that you can use tools to automatically produce good documentation from your code. Numpy reference pages, like these, are produced this way. You can even use the example input and output in such a docstring to automatically generate tests for your code.

Exercise 1: Add a docstring to your hamming_distance function to describe what it does. Test it with the help() function.

Exercise 2: Optionally, upgrade your docstring to a numpy-style docstring.

Python-style iteration

We've seen that for loops in Python have the general form

for item in iterable_variable:
    do_something

This means that we can loop over the characters of a string directly, like

In [23]:
for char in "CAT":
    print(char*3)
CCC
AAA
TTT

Or over any kind of list:

In [24]:
odd_numbers = [1, 3, 5, 7, 9, 11]
odd_strings = ["one", "three", "five", "seven", "nine", "eleven"]

for num in odd_strings:
    print(num)
one
three
five
seven
nine
eleven

There are several built-in Python functions that can help take advantage of this kind of iteration.

enumerate() will return an iterable together with its index values, where each item is replaced by a tuple:

In [25]:
for val in enumerate(odd_strings):
    print(val)
(0, 'one')
(1, 'three')
(2, 'five')
(3, 'seven')
(4, 'nine')
(5, 'eleven')

This can be useful when we actually care about the indices:

In [26]:
for (i,num) in enumerate(odd_numbers):
    print("Odd number {0} is {1}".format(i+1, num))
Odd number 1 is 1
Odd number 2 is 3
Odd number 3 is 5
Odd number 4 is 7
Odd number 5 is 9
Odd number 6 is 11

This is a more concise and readable alternative to writing

for i in range(len(odd_numbers)):

and using indexing.

Also notice that we've written

for (i,num) in...

Each item in enumerate(odd_numbers) is a tuple, and this syntax assigns each of these tuples in turn to the tuple of variables (i,num).

Another useful function is zip(). zip() takes multiple lists and turns them into one list of tuples, like so:

In [27]:
zip(odd_numbers, odd_strings)
Out[27]:
[(1, 'one'),
 (3, 'three'),
 (5, 'five'),
 (7, 'seven'),
 (9, 'nine'),
 (11, 'eleven')]

This makes it easy to iterate over multiple lists at once:

In [28]:
for (num,numstring) in zip(odd_numbers, odd_strings):
    print("Number {0} is written {1}.".format(num, numstring))
Number 1 is written one.
Number 3 is written three.
Number 5 is written five.
Number 7 is written seven.
Number 9 is written nine.
Number 11 is written eleven.

Exercise 1:

Write a loop which uses odd_numbers and odd_strings to print out

Odd number 1 is 1, written one.
Odd number 2 is 3, written three.
Odd number 3 is 5, written five.
....

There is more than one way to solve this. Compare your solution to your neighbour's.

Exercise 2:

In our Hamming distance function, we have to loop over two strings simultaneously to compare the characters. So, we currently use a rather clunky construction: we use len() to find the length of the strings, range() to generate numbers, and then we index the strings at every step:

L = len(string1)
for i in range(L):
    if string1[i] != string2[i]:
        distance += 1

Rewrite your Hamming distance function to take advantage of zip() so that you don't need to use string indexing at all.

List comprehensions

List comprehensions are a concise way to create lists in Python, where you would ordinarily have to use a for loop. They act like very compact loops, and in some situations, they can make your program quite clear to read.

The basic form of a list comprehension is:

[f(x) for x in list]

For instance, here is list comprehension that just doubles every number in the original list:

In [29]:
[num*2 for num in odd_numbers]
Out[29]:
[2, 6, 10, 14, 18, 22]

Of course, it's easy to assign the new list to a variable:

In [30]:
doubled_odd_numbers = [num*2 for num in odd_numbers]

To create doubled_odd_numbers using a loop, we would have had to write:

doubled_odd_numbers = []
for num in odd_numbers:
    doubled_odd_numbers.append(2*num)

Which of these makes it most clear what we are trying to do?

Here's another simple example that uses the uppercase conversion method for strings, .upper():

In [31]:
[each_string.upper() for each_string in odd_strings]
Out[31]:
['ONE', 'THREE', 'FIVE', 'SEVEN', 'NINE', 'ELEVEN']

The longer form of list comprehensions includes a conditional, and looks like

[f(x) for x in list if condition]

For instance, all the odd number names that are shorter than five letters long:

In [32]:
[each_string.upper() for each_string in odd_strings if len(each_string) < 5]
Out[32]:
['ONE', 'FIVE', 'NINE']

Or we can leave out the .upper() and just get the value itself, filtered by the condition. We still need the for keyword though, so our list comprehension ends up looking like this:

In [33]:
[each_string for each_string in odd_strings if len(each_string) < 5]
Out[33]:
['one', 'five', 'nine']

Challenge (hard!)

If you have a Hamming distance function that uses zip(), can you think of a way to replace the for loop in your function with a list comprehension?

There are a couple of ways to do this, but it takes some lateral thinking. As a hint, if you're going to produce a list of values, you will need to use a function like len() or sum() to convert them into your final number that gives Hamming distance.

Do you think that this list comprehension makes your program easier to understand, or harder? In general, if you ever find that using a list comprehension is actually making your program less clear rather than more clear, you should consider using a loop instead - it may not be quite as concise, but clarity is much more important.

Error handling: Exceptions

One serious problem with our current solution is the way it handles errors.

We only have a good definition for Hamming distance when the strings are the same length. In our function, we've assumed string1 and string2 will always be the same length. But what if they're not?

In [34]:
hamming_distance("SHORT","THIS_IS_A_LONG_STRING")
Out[34]:
4
In [35]:
hamming_distance("THIS_IS_A_LONG_STRING","SHORT")
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-35-8877eff1c5e5> in <module>()
----> 1 hamming_distance("THIS_IS_A_LONG_STRING","SHORT")

<ipython-input-2-79e0fea0a434> in hamming_distance(string1, string2)
      8     for i in range(L):
      9         # Add 1 to the distance if these two characters are not equal
---> 10         if string1[i] != string2[i]:
     11             distance += 1
     12     # Return the final count of differences

IndexError: string index out of range

This doesn't look very good. We get a different result depending on which order we pass the strings in - that's definitely not something we want. And neither result really makes sense.

In the first case, L=len(string1) took on a value of 5, and our function just compared "SHORT" to "THIS_". It didn't warn us about all the extra, un-matched letters.

In the second case, we get an error message, which is arguably a better thing to happen. It isn't a very informative error message though, especially for someone who's trying to reuse our function and doesn't know the details of how it works: "IndexError: string index out of range".

Stop and think: do you understand what went wrong in this case?

What's happened here is that L=len(string1) took on a value of 21. Since string2 doesn't have 21 letters, the attempt to index string2[i] failed as soon as we reached i==5.

If we are practicing defensive programming, we should check that the inputs to our function are correct, and give an appropriate, informative error message if they're not. This is more useful behaviour than reporting "IndexError", and much more useful behaviour than reporting no errors and just giving the wrong answer.

The standard way to create an error in Python is to raise an Exception. Exceptions are a data type specifically used to represent different types of errors. You've actually seen them already: every time we write a program that crashes, the error message that gets printed out is the result of an Exception. This includes the IndexError message our Hamming distance function gives currently.

We can manually create an error with the raise() command.

In [36]:
raise(Exception("Oh no! Something bad happened!"))
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-36-8f5202952286> in <module>()
----> 1 raise(Exception("Oh no! Something bad happened!"))

Exception: Oh no! Something bad happened!

Here we have created a new Exception with Exception(error_message), and passed it to raise(), which stops the program and prints the error. Exceptions are objects and can be assigned to variables:

In [37]:
my_error = Exception("Oh no!")
print("Nothing bad has happened yet.")
raise(my_error)
Nothing bad has happened yet.
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-37-5f9d4b0bab10> in <module>()
      1 my_error = Exception("Oh no!")
      2 print("Nothing bad has happened yet.")
----> 3 raise(my_error)

Exception: Oh no!

Exercise: Add a test to the start of your hamming_distance() function to test that the length of string1 is the same as the length of string2. If they are different, raise an Exception with an appropriate error message. Test out your new, improved function with some incorrect strings.

Note: There are actually many different subclasses of Exceptions, some of which you've probably seen: IndexError, TypeError, ValueError, etc. You can find the (long) list of Exception types for Python 2 at https://docs.python.org/2/library/exceptions.html . If you'd like, you can make your program even more debugger-friendly by creating the right type of Exception to describe what went wrong - for instance, if hamming_distance() is called with strings of different lengths, that is probably a ValueError. For our purposes today it is fine to just create a generic Exception.

Catching exceptions

A nice side-effect of Python's method of reporting errors is that we can have our program choose what to do about certain kinds of errors, or even ignore them completely, instead of crashing. This is done with a try-except block. We try to execute some code, and if it fails, we can choose what to do about the errors.

The basic form is

try:
    do_something
except:
    handle_errors

The code in the except block is only called if the code in the try block crashes. It's a little like an if-else statement.

For instance, say we want to write a function which calculates the average of a list of numbers and turns it into a string.

In [38]:
def average(numbers):
    """
    Calculate the average of a list of numbers, and return a string representation.
    """
    av = sum(numbers)/len(numbers)
    return str(av)

Since we have a function, we can find the average of many lists, with a loop:

In [39]:
many_lists = [[3,4,5],[4,2,6,3],[4],[],[2,3,4],[7,3,4,6,2],[]]
In [40]:
for number_list in many_lists:
    print(average(number_list))
4.0
3.75
4.0
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-40-d6548a04280b> in <module>()
      1 for number_list in many_lists:
----> 2     print(average(number_list))

<ipython-input-38-5c4b88a1efba> in average(numbers)
      3     Calculate the average of a list of numbers, and return a string representation.
      4     """
----> 5     av = sum(numbers)/len(numbers)
      6     return str(av)

ZeroDivisionError: division by zero

The program crashed at the first empty list, because the length of the list was zero. We didn't get to see any results after this one.

Since we're producing a string anyway, maybe we don't want the entire program to crash if a list is empty - we'd just like to return a special string. One way to do this would be to test the length of the list at the start of the function. Another way is to use try-except, and catch the kind of exception we'd like to handle differently: in this case, a ZeroDivisionError.

In [41]:
def average(numbers):
    """
    Calculate the average of a list of numbers, and return a string representation.
    If the list is empty, return "Empty".
    """
    try:
        av = sum(numbers)/len(numbers)
        return str(av)
    except ZeroDivisionError:
        return "Empty"
In [42]:
for number_list in many_lists:
    print(average(number_list))
4.0
3.75
4.0
Empty
3.0
4.4
Empty

Even though an exception was raised (twice!), the program kept running, and it was handled by our except block.

Any kinds of exception other than a ZeroDivisionError will still cause the program to crash, which is a good thing - we want to know if something unexpected goes wrong.

Testing and defensive programming: assertions

As you've probably discovered by now, it's very easy to make mistakes while programming. How can we try to find out if there are still errors lurking in the code we have? And how can we guard against introducing new errors in code as we modify it?

One way is to add assertions to our program so that it checks itself as it runs. An assertion is a statement which will raise an exception, causing the program to stop, if the condition in it is not true. For instance:

In [43]:
# This will do nothing, because the condition is true
assert 5+3==8
In [44]:
# This will stop running, because the condition is false
assert 5+3==10
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-44-daa426fc6a12> in <module>()
      1 # This will stop running, because the condition is false
----> 2 assert 5+3==10

AssertionError: 

Programs like the Firefox browser are littered with assertions: 10-20% of the code they contain are there to check that the other 80-90% are working correctly.

One way we can use assertions is to write tests which our functions should pass. Here's an example function which is supposed to find the number 10 in a list, but which contains bugs.

In [45]:
def find_ten(values):
    """
    Find the index of the first value in the list equal to 10.
    values should be a list of numbers, at least one of which is equal to 10.
    """
    index = 0
    for n in values:
        if n==10:
            return n
        index + 1

We can assert some things about the expected output of this function. If we give it the list [10, -5, 3, 20], we'd expect it to report that 10 is in the first location, index 0. Here are some assertions which our function should pass if it's working correctly.

In [46]:
assert find_ten([10, -5, 3, 20])==0
assert find_ten([20, 32, 10, -5])==2
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-46-a8792ec6866f> in <module>()
----> 1 assert find_ten([10, -5, 3, 20])==0
      2 assert find_ten([20, 32, 10, -5])==2

AssertionError: 

The first assertion failed! Why is this?

In [47]:
find_ten([10, -5, 3, 20])
Out[47]:
10

It looks like we are getting the answer 10 instead of the expected 0.

Exercise: Try to debug the function so that it passes both assertions.

You'll probably find that a bug which doesn't trigger one assertion may trigger the other. Part of writing good tests is thinking of example inputs that could catch different kinds of problems.

But assertions aren't just about catching errors: they also help people understand programs. Each assertion gives the person reading the program a chance to check (consciously or otherwise) that their understanding matches what the code is doing.

Most good programmers follow two rules when adding assertions to their code. The first is, "fail early, fail often". The greater the distance between when and where an error occurs and when it's noticed, the harder the error will be to debug, so good code catches mistakes as early as possible.

The second rule is, "turns bugs into assertions or tests". If you made a mistake in a piece of code, the odds are good that you have made other mistakes nearby, or will make the same mistake (or a related one) the next time you change it. Writing assertions to check that you haven't regressed (i.e., haven't re-introduced an old problem) can save a lot of time in the long run, and helps to warn people who are reading the code (including your future self) that this bit is tricky.