Introductory Programming in Python

This workshop aims to teach some fundamentals of programming, using Python. We won't have time to cover all the concepts involved in programming, but we will give an introduction.

In this workshop we'll work towards a final goal: write a program in Python to calculate the Hamming distance.

To do this we'll need to learn about:

  • writing functions
  • using variables
  • conditionals (if statements)
  • loops (for statements)
  • lists
  • strings

If you can do this, you'll have learnt lots of important programming concepts, and implemented our first bioinformatics algorithm!

Hamming distance

The Hamming distance between two strings is the number of positions at which they have different characters.

For instance, compare these strings:

GATTACA
GACTATA

They are different at two locations:

GATTACA
  |  |
GACTATA

So the Hamming distance is 2.

Hamming distance is the simplest edit distance algorithm that we looked at for string alignment. It tells us the edit distance between two strings if we're only allowed to make substitutions - no insertions or deletions.

The concept of Hamming distance applies to any kind of string, not just DNA sequences. As an exercise, work out (on paper) the Hamming distance between QUANTUM and QUENTIN.

Short words are easy to do on paper, or in your head. Can you find the Hamming distance between:

ACTGTGCAATACCTAAGTGAAAGGGGTGCATAGATCATTCTTTCGTTACCTCGGGTGCTATGAAGTGGACCATCATTGAGAC
ACTCCTCTGTGTCTAAGTGAAAGGGGTGCTTGCAGGGTAATCCTTCCACCTGATACCGACTCGGGTGGACCATCATTGACGG

This is pretty painful to do by hand. For real analysis you might need to compare strings hundreds or thousands of letters long, and do it again and again.

But this is exactly what computers are for. Once we've written a program to calculate Hamming distance, we'll be able to compare lots of long strings easily.

Our final goal for the workshop is to solve this problem!

Data types and variables

We want to write programms so we can perform calculations on data. For instance, to answer the question above, we want to deal with the strings "GATTACA" and "GACTATA". We can tell the program to keep these strings in memory by assigning them to a variable.

A variable is just a name for a value, such as x, current_temperature, or subject_id. We can create a new variable simply by assigning a value to it using =:

In [1]:
dna_string = "GACTATA"

Once a variable has a value, we can print it:

In [2]:
print dna_string
GACTATA

Notice that we told Python that "GACTATA" is our string data by putting quotes around it. The variable names do not need quotes.

The value we assign to a variable does not need to be a string. It could be a number, for instance:

In [3]:
weight_kg = 55

Just as before, now the variable has a value, we can print it:

In [4]:
print weight_kg
55

and since it's a number, we can do arithmetic with it:

In [5]:
print 'weight in pounds:', 2.2 * weight_kg
weight in pounds: 121.0

We can also change a variable's value by assigning it a new one:

In [6]:
weight_kg = 57.5
print 'weight in kilograms is now:', weight_kg
weight in kilograms is now: 57.5

As the example above shows, we can print several things at once by separating them with commas.

If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value:

Variables as Sticky Notes

This means that assigning a value to one variable does not change the values of other variables. For example, let's store the subject's weight in pounds in a variable:

In [7]:
weight_lb = 2.2 * weight_kg
print 'weight in kilograms:', weight_kg, 'and in pounds:', weight_lb
weight in kilograms: 57.5 and in pounds: 126.5

Creating Another Variable

and then change weight_kg:

In [8]:
weight_kg = 100.0
print 'weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb
weight in kilograms is now: 100.0 and weight in pounds is still: 126.5

Updating a Variable

Challenges

  1. Draw diagrams showing what variables refer to what values after each statement in the following program:

    mass = 47.5
    age = 122
    mass = mass * 2.0
    age = age - 20
    
  2. Think and answer: what would the following program print out?

    first = 3
    second = 7
    first = 2
    print first, second
    

Data can be of different types, and this is important when programming. For instance, you can do arithmetic with the variable weight_kg because it is a number. If you try to do arithmetic with dna_string you will get an error, as this doesn't make sense for strings.

In [9]:
print (weight_kg / 2)
50.0

In [10]:
print (dna_string / 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-d57b29287da7> in <module>()
----> 1 print (dna_string / 2)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

You can see that in this error message, Python is complaining that the variable was of the wrong type to do division.

You can see the type of a variable yourself, using Python's built-in type() function.

In [11]:
x = 5
print type(x)
<type 'int'>

This means that x is an integer.

Numbers can also be "floating point" numbers, which means they have a decimal part:

In [12]:
x = 5.5
print type(x)
<type 'float'>

And of course, our data can be strings:

In [13]:
print type(dna_string)
<type 'str'>

To solve the Hamming distance problem, we can see we're going to need to deal with strings, since that's the input data. We can also see we're going to need numerical variables, since the output we want from our program is the number of positions which differ.

A bit of terminology: strings are made of single-letter characters. For instance, the string "3 dogs!" is made of the characters '3', ' ', 'd', 'o', 'g', 's', and '!'. Notice that the space and the punctuation count as characters too.

Characters, like strings, should always be enclosed in quotes to distinguish them from variable names and from Python's special keywords.

Python considers characters to be just single-letter strings, so it gives them type 'str':

In [14]:
char = 'C'
print type(char)
<type 'str'>

Defining a function

Functions are reusable pieces of a program. Let's look at some examples.

We'll start by defining a function celsius_to_kelvin that converts temperatures from degrees Celsius to Kelvin:

In [15]:
def celsius_to_kelvin(celsius_temp):
    kelvin_temp = celsius_temp + 273.15
    return kelvin_temp

The definition opens with the word def, which is followed by the name of the function and a parenthesized list of parameter names. The body of the function—the statements that are executed when it runs—is indented below the definition line, typically by four spaces.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Let's try running our function. Calling our own function is no different from calling any other function:

In [16]:
print 'freezing point of water:', celsius_to_kelvin(0)
print 'boiling point of water:', celsius_to_kelvin(100)
freezing point of water: 273.15
boiling point of water: 373.15

Functions can take more than one argument. For instance, here's a function which calculates the difference between two temperatures:

In [17]:
def temperature_difference(temp1, temp2):
    difference = temp1 - temp2
    return difference

When we run this function, the values we give it are assigned (in order) to temp1 and temp2, and then the body of the function is executed, and the value returned.

In [18]:
print temperature_difference(21,12)
9

In [19]:
print temperature_difference(12,21)
-9

A function doesn't have to have a return statement. For instance we might want to write a function that does something directly, like print to the screen. It doesn't need to return anything, so there's no need for a return statement.

In [20]:
def write_announcement(winner):
    print "We are glad to announce that",winner,"has won the contest!"
    print "Please congratulate",winner,"!"
In [21]:
write_announcement("Clare")
We are glad to announce that Clare has won the contest!
Please congratulate Clare !

Because this function has no return statement, it will actually return the special value None. This is not a string but a special type that tells Python that the variable has no value.

In [22]:
x = write_announcement("Andrew")
print x
We are glad to announce that Andrew has won the contest!
Please congratulate Andrew !
None

Challenge

Define a function, and name it double. Write the function so that it takes in a single number as a parameter, and returns double that number. Once you've defined the function, some example output should look like this:

print double(3)
6
print double(532)
1064

Conditionals: making choices

We want to be able to write programs that do different things depending on what data they are given.

For instance, to be able to calculate the Hamming distance, we need to do something different depending on whether each letter does or does not match the corresponding letter in the other string.

The tool Python gives us for doing this is called a conditional statement, and looks like this:

In [23]:
num = 37
if num > 100:
    print 'greater'
else:
    print 'not greater'
print 'done'
not greater
done

The second line of this code uses the keyword if to tell Python that we want to make a choice. If the test that follows it is true, the body of the if (i.e., the lines indented underneath it) are executed. If the test is false, the body of the else is executed instead. Only one or the other is ever executed:

Executing a Conditional

Conditional statements don't have to include an else. If there isn't one, Python simply does nothing if the test is false:

In [24]:
num = 53
print 'before conditional...'
if num > 100:
    print '53 is greater than 100'
print '...after conditional'
before conditional...
...after conditional

We can also chain several tests together using elif, which is short for "else if". This makes it simple to write a function that returns the sign of a number:

In [25]:
def sign(num):
    if num > 0:
        return 1
    elif num == 0:
        return 0
    else:
        return -1

print "sign of -3 is", sign(-3)
sign of -3 is -1

One important thing to notice in the code above is that we use a double equals sign == to test for equality rather than a single equals sign, because the latter is used to mean assignment. So

num = 5

changes num to be equal to 5, while

num == 5

doesn't change num at all, but returns True or False depending on whether num is equal to 5.

The operator for "is not equal" is also not obvious: it's !=, like this:

In [26]:
num = 5
if num != 37:
    print "The number is not 37"
The number is not 37

We can also combine tests using and and or. and is only true if both parts are true:

In [27]:
if (1 > 0) and (-1 > 0):
    print 'both parts are true'
else:
    print 'one part is not true'
one part is not true

while or is true if either part is true:

In [28]:
if (1 < 0) or ('left' < 'right'):
    print 'at least one test is true'
at least one test is true

In this case, "either" means "either or both", not "either one or the other but not both".

True and False

Python has special values called True and False, which are what is returned if we evaluate a comparison. For our first example, we wrote an if statement which looked like

num = 37
if num > 100:
    ...

Of course, num was not more than 100 in this case. What if we look directly at the result of that comparison?

In [29]:
num = 37
print (num > 100)
False

False is not a string - it's a special Python value that literally means false, or not true. There's also True:

In [30]:
num = 121
print (num > 100)
True

Challenge

  1. To calculate the Hamming distance, we will need to be able to test if characters are the same. In fact the simplest Hamming distance calculation is between just two characters, for instance:

    G
    G
    

    Here the characters are the same, so the Hamming distance is zero.

    G
    T
    

    Here the characters are different, so the Hamming distance is 1.

    Define a function called character_distance which takes two parameters and returns 1 if the two characters are different, and 0 if they are the same. Some example output should look like:

    print character_distance("G", "G")
    0
    
    print character_distance("A", "T")
    1
    

Calculating many things: for loops

A big advantage of writing a program is that we only need to write it once, but can run it again and again, saving effort.

If we need to run an algorithm thousands or millions of times, we don't want to have to call it each time. So, we need to be able to tell the computer to repeat things.

Suppose we want to print each character in the sequence "GACT" on a line of its own. We can specify a character of a string using square brackets []. If the variable that holds the string is called s, then s[0] is the first character, s[1] is the second character, and so on. This is called string indexing.

So, one way to print out each character of the string would be to use four print statements:

In [31]:
def print_characters(s):
    print s[0]
    print s[1]
    print s[2]
    print s[3]
In [32]:
print_characters('GACT')
G
A
C
T

but that's a bad approach for two reasons:

  1. It doesn't scale: if we want to print the characters in a string that's hundreds of letters long, we'd be better off just typing them in.

  2. It's fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we're asking for characters that don't exist.

In [33]:
print_characters("CAT")
C
A
T

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-33-35bac565a5ec> in <module>()
----> 1 print_characters("CAT")

<ipython-input-31-d4c50380fc24> in print_characters(s)
      3     print s[1]
      4     print s[2]
----> 5     print s[3]

IndexError: string index out of range

Here's a better approach:

In [34]:
def print_characters(s):
    for char in s:
        print char
In [35]:
print_characters('GACT')
G
A
C
T

This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well:

In [36]:
print_characters('oxygen')
o
x
y
g
e
n

You might be able to see how this ability to do something with each character in turn, no matter how long the string, will be useful in solving our Hamming distance problem!

The improved version of print_characters uses a for loop to repeat an operation---in this case, printing---once for each thing in a collection. The general form of a loop is:

for variable in collection:
    do things with variable

We can call the loop variable anything we like, but there must be a colon at the end of the line starting the loop, and we must indent the body of the loop.

Here's another loop that repeatedly updates a variable:

In [37]:
length = 0
for vowel in "aeiou":
    length = length + 1
print "There are", length, "vowels"
There are 5 vowels

It's worth tracing the execution of this little program step by step. Since there are five characters in 'aeiou', the statement on line 3 will be executed five times. The first time around, length is zero (the value assigned to it on line 1) and vowel is 'a'. The statement adds 1 to the old value of length, producing 1, and updates length to refer to that new value. The next time around, vowel is 'e' and length is 1, so length is updated to be 2. After three more updates, length is 5; since there is nothing left in 'aeiou' for Python to process, the loop finishes and the print statement on line 4 tells us our final answer.

Note that a loop variable is just a variable that's being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:

In [38]:
letter = 'z'
for letter in 'abc':
    print letter
print 'after the loop, letter is', letter
a
b
c
after the loop, letter is c

Note also that finding the length of a string is such a common operation that Python actually has a built-in function to do it called len:

In [39]:
print len("aeiou")
5

len is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven't met yet, so we should always use it when we can.

Challenge (hard)

Loops can also be nested: if you put a for loop inside a for loop, every pass through the outer loop will do an entire run-through of all values of the inner loop.

Using nested loops, write a Python program to write out every two-letter combination of A,C,G and T. The output should look like:

A A
A C
A G
A T
C A
C C
C G
C T
G A
G C
G G
G T
T A
T C
T G
T T

Let's think about for loops for solving our Hamming distance problem. We saw above how to loop over the characters in a string:

In [40]:
def print_characters(my_string):
    for char in my_string:
        print char

print_characters("GACT")
G
A
C
T

For Hamming distance, we need to compare each character in one string to the corresponding character in the other. We might think about applying a loop go through each character in turn. Given

GATTACA
GACTATA

we could use a loop to compare G to G, then A to A, then C to T, and so on.

But to do this we need to be able to loop over both strings at once. How can we do this? To understand one way to do it, we'll first learn about lists.

Lists and loops

Just as a for loop is a way to do operations many times, a list is a way to store many values. We create a list by putting values inside square brackets:

In [41]:
odds = [1, 3, 5, 7, 9, 11]
print 'odd numbers are:', odds
odd numbers are: [1, 3, 5, 7, 9, 11]

Just like characters in a string, we select individual elements from lists by indexing them:

In [42]:
print 'first and last:', odds[0], odds[-1]
first and last: 1 11

In [43]:
print 'first, third and fourth:', odds[0], odds[2], odds[3]
first, third and fourth: 1 5 7

and if we loop over a list, the loop variable is assigned elements one at a time:

In [44]:
for number in odds:
    print number
1
3
5
7
9
11

There is one important difference between lists and strings: we can change the values in a list, but we cannot change the characters in a string. For example:

In [45]:
names = ['Newton', 'Darwing', 'Turing'] # typo in Darwin's name
print 'names is originally:', names
names[1] = 'Darwin' # correct the name
print 'final value of names:', names
names is originally: ['Newton', 'Darwing', 'Turing']
final value of names: ['Newton', 'Darwin', 'Turing']

works, but this doesn't:

In [46]:
name = 'Bell'
name[0] = 'b'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-220df48aeb2e> in <module>()
      1 name = 'Bell'
----> 2 name[0] = 'b'

TypeError: 'str' object does not support item assignment

Python has a useful built-in function called range that creates a list of the first $N$ numbers starting from zero.

In [47]:
print range(3)
[0, 1, 2]

In [48]:
print range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

This is particularly useful with for loops! For instance to write something out 5 times:

In [49]:
for i in range(5):
    print "Hello!"
Hello!
Hello!
Hello!
Hello!
Hello!

Challenges

  1. Using range, write a function that prints the $N$ natural numbers, i.e. counting from $1$:

    print_N(3)
    1
    2
    3
    
  2. Write a function which takes two strings, and prints out the first characters of both strings on the first line, the second characters of both strings on the second line, and so on.

    print_pairs("CAT","DOG")
    C D
    A O
    T G
    

    Hints:

    • remember that you can get the length of a string using the built-in len() function
    • remember that you can use a character from a string by indexing it, like s[0]

Putting it all together: nesting loops and conditionals

Another thing to realize is that if and for statements can be combined with one another just as easily as they can be combined with functions. For example, if we want to sum the positive numbers in a list, we can write this:

In [50]:
numbers = [-5, 3, 2, -1, 9, 6]
total = 0
for n in numbers:
    if n >= 0:
        total = total + n
print 'sum of positive values:', total
sum of positive values: 20

We could equally well calculate the positive and negative sums in a single loop:

In [51]:
pos_total = 0
neg_total = 0
for n in numbers:
    if n >= 0:
        pos_total = pos_total + n
    else:
        neg_total = neg_total + n
print 'negative and positive sums are:', neg_total, pos_total
negative and positive sums are: -6 20

We can even put one loop inside another:

In [52]:
for consonant in 'bcd':
    for vowel in 'ae':
        print consonant + vowel
ba
be
ca
ce
da
de

As the diagram below shows, the inner loop runs from start to finish each time the outer loop runs once:

Execution of Nested Loops

Challenge

Write a function to count the number of 'C's and 'G's in a DNA sequence. Some sample output should look like:

print count_CG("GATTACA")
2
print count_CG("ACAAAACAGCGAACACTCGC")
10

Final challenge!

\Write a function which takes in two strings as parameters, and returns the Hamming distance between them. You can assume the input strings are the same length.

Some sample output should look like this (you can cut and paste these strings to check you get the same answers):

print hamming_distance("GATTACA","GACTATA")
2
print hamming_distance("CAT","DOG")
3
print hamming_distance("CAT", "ACT")
2
print hamming_distance("GTTCTTGGACGACGAAAAGA", "GTTCGTGAGGCCACATCCCG")
13
print hamming_distance("GTGCTTCCAGTCACGCTGTCTTGGGGTAGC", "TCGAGCAGTATCACATTACTAAGGATGTGC")
20
print hamming_distance("GCGAGCCCAGTCACGCTGTCTTGGGGTAGC", "TCGAGCAGTATCACGCTACTAAGGGGGTGC")
12

If you can get this working, well done! This is our first bioinformatics algorithm.