Introductory Programming in Python

This workshop aims to teach some fundamentals of programming, using Python. We won't have time to cover all the concepts involved in programming, but we will give an introduction.

In this workshop we'll work towards a final goal: write a program in Python to calculate the Hamming distance.

To do this we'll need to learn about:

  • writing functions
  • using variables
  • conditionals (if statements)
  • loops (for statements)
  • lists
  • strings

If you can do this, you'll have learnt lots of important programming concepts, and implemented our first bioinformatics algorithm!

Note: This workshop has been written for Python 3. If you are using Python 2, you should be able to make everything work by first running the Python command:

from __future__ import print_function, division

Hamming distance

The Hamming distance between two strings is the number of positions at which they have different characters.

For instance, compare these strings:

GATTACA
GACTATA

They are different at two locations:

GATTACA
  |  |
GACTATA

So the Hamming distance is 2.

Hamming distance is the simplest edit distance metric that we will look at for string alignment. It tells us the edit distance between two strings if we're only allowed to make substitutions - no insertions or deletions.

The concept of Hamming distance applies to any kind of string, not just DNA sequences. As an exercise, work out (on paper) the Hamming distance between QUANTUM and QUENTIN.

Short words are easy to do on paper, or in your head. Can you find the Hamming distance between:

ACTGTGCAATACCTAAGTGAAAGGGGTGCATAGATCATTCTTTCGTTACCTCGGGTGCTATGAAGTGGACCATCATTGAGAC
ACTCCTCTGTGTCTAAGTGAAAGGGGTGCTTGCAGGGTAATCCTTCCACCTGATACCGACTCGGGTGGACCATCATTGACGG

This is pretty painful to do by hand. For real analysis you might need to compare strings hundreds or thousands of letters long, and do it again and again.

But this is exactly what computers are for. Once we've written a program to calculate Hamming distance, we'll be able to compare lots of long strings easily.

Our final goal for the workshop is to solve this problem!

Data types and variables

We want to write programms so we can perform calculations on data. For instance, to answer the question above, we want to deal with the strings "GATTACA" and "GACTATA". We can tell the program to keep these strings in memory by assigning them to a variable.

A variable is just a name for a value, such as x, current_temperature, or subject_id. We can create a new variable simply by assigning a value to it using =:

In [1]:
dna_string = "GACTATA"

Once a variable has a value, we can print it:

In [2]:
print(dna_string)
GACTATA

Notice that we told Python that "GACTATA" is our string data by putting quotes around it. The variable names do not need quotes.

The value we assign to a variable does not need to be a string. It could be a number, for instance:

In [3]:
weight_kg = 55

Just as before, now the variable has a value, we can print it:

In [4]:
print(weight_kg)
55

and since it's a number, we can do arithmetic with it:

In [5]:
print('weight in pounds:', 2.2 * weight_kg)
weight in pounds: 121.00000000000001

We can also change a variable's value by assigning it a new one:

In [6]:
weight_kg = 57.5
print('weight in kilograms is now:', weight_kg)
weight in kilograms is now: 57.5

As the example above shows, we can print several things at once by separating them with commas.

If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value:

Variables as Sticky Notes

This means that assigning a value to one variable does not change the values of other variables. For example, let's store the subject's weight in pounds in a variable:

In [7]:
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
weight in kilograms: 57.5 and in pounds: 126.50000000000001

Creating Another Variable

and then change weight_kg:

In [8]:
weight_kg = 100.0
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
weight in kilograms is now: 100.0 and weight in pounds is still: 126.50000000000001

Updating a Variable

Challenges

  1. Draw diagrams showing what variables refer to what values after each statement in the following program:

    mass = 47.5
    age = 122
    mass = mass * 2.0
    age = age - 20
    
  2. Think and answer: what would the following program print out?

    first = 3
    second = 7
    first = 2
    print(first, second)
    

Data can be of different types, and this is important when programming. For instance, you can do arithmetic with the variable weight_kg because it is a number. If you try to do arithmetic with dna_string you will get an error, as this doesn't make sense for strings.

In [9]:
print(weight_kg / 2)
50.0
In [10]:
print(dna_string / 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-dc5dcd030843> in <module>()
----> 1 print(dna_string / 2)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

You can see that in this error message, Python is complaining that the variable was of the wrong type to do division.

You can see the type of a variable yourself, using Python's built-in type() function.

In [11]:
x = 5
print(type(x))
<class 'int'>

This means that x is an integer.

Numbers can also be "floating point" numbers, which means they have a decimal part:

In [12]:
x = 5.5
print(type(x))
<class 'float'>

And of course, our data can be strings:

In [13]:
print(type(dna_string))
<class 'str'>

To solve the Hamming distance problem, we can see we're going to need to deal with strings, since that's the input data. We can also see we're going to need numerical variables, since the output we want from our program is the number of positions which differ.

A bit of terminology: strings are made of single-letter characters. For instance, the string "3 dogs!" is made of the characters '3', ' ', 'd', 'o', 'g', 's', and '!'. Notice that the space and the punctuation count as characters too.

Characters, like strings, should always be enclosed in quotes to distinguish them from variable names and from Python's special keywords.

Python considers characters to be just single-letter strings, so it gives them type 'str':

In [14]:
char = 'C'
print(type(char))
<class 'str'>

Defining a function

Functions are reusable pieces of a program. Let's look at some examples.

We'll start by defining a function celsius_to_kelvin that converts temperatures from degrees Celsius to Kelvin:

In [15]:
def celsius_to_kelvin(celsius_temp):
    kelvin_temp = celsius_temp + 273.15
    return kelvin_temp

The definition opens with the word def, which is followed by the name of the function and a parenthesized list of parameter names. The body of the function—the statements that are executed when it runs—is indented below the definition line, typically by four spaces.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Let's try running our function. Calling our own function is no different from calling any other function:

In [16]:
print('freezing point of water:', celsius_to_kelvin(0))
print('boiling point of water:', celsius_to_kelvin(100))
freezing point of water: 273.15
boiling point of water: 373.15

Functions can take more than one argument. For instance, here's a function which calculates the difference between two temperatures:

In [17]:
def temperature_difference(temp1, temp2):
    difference = temp1 - temp2
    return difference

When we run this function, the values we give it are assigned (in order) to temp1 and temp2, and then the body of the function is executed, and the value returned.

In [18]:
print(temperature_difference(21,12))
9
In [19]:
print(temperature_difference(12,21))
-9

A function doesn't have to have a return statement. For instance we might want to write a function that does something directly, like print to the screen. It doesn't need to return anything, so there's no need for a return statement.

In [20]:
def write_announcement(winner):
    print(f"We are glad to announce that {winner} has won the contest!")
    print(f"Please congratulate {winner}!")
In [21]:
write_announcement("Clare")
We are glad to announce that Clare has won the contest!
Please congratulate Clare!

By the way, the letter f before the string, and the {winner) variable inside the string, are an example of "f-strings", which are new in Python 3.6. You may also see the .format() function used for string formatting. We won't go into this just yet, but you can ask your demonstrator if you're interested.

Because our write_announcement function has no return statement, it will actually return the special value None. This is not a string but a special type that tells Python that the variable has no value.

In [22]:
result = write_announcement("Andrew")
print(result)
We are glad to announce that Andrew has won the contest!
Please congratulate Andrew!
None

Challenge

Define a function, and name it double. Write the function so that it takes in a single number as a parameter, and returns double that number. Once you've defined the function, some example output should look like this:

print(double(3))
6
print(double(532))
1064

Conditionals: making choices

We want to be able to write programs that do different things depending on what data they are given.

For instance, to be able to calculate the Hamming distance, we need to do something different depending on whether each letter does or does not match the corresponding letter in the other string.

The tool Python gives us for doing this is called a conditional statement, and looks like this:

In [23]:
num = 37
if num > 100:
    print('greater')
else:
    print('not greater')
print('done')
not greater
done

The second line of this code uses the keyword if to tell Python that we want to make a choice. If the test that follows it is true, the body of the if (i.e., the lines indented underneath it) are executed. If the test is false, the body of the else is executed instead. Only one or the other is ever executed:

Executing a Conditional

Conditional statements don't have to include an else. If there isn't one, Python simply does nothing if the test is false:

In [24]:
num = 53
print('before conditional...')
if num > 100:
    print('53 is greater than 100')
print('...after conditional')
before conditional...
...after conditional

We can also chain several tests together using elif, which is short for "else if". This makes it simple to write a function that returns the sign of a number:

In [25]:
def sign(num):
    if num > 0:
        return 1
    elif num == 0:
        return 0
    else:
        return -1

print("sign of -3 is", sign(-3))
sign of -3 is -1

One important thing to notice in the code above is that we use a double equals sign == to test for equality rather than a single equals sign, because the latter is used to mean assignment. So

num = 5

changes num to be equal to 5, while

num == 5

doesn't change num at all, but returns True or False depending on whether num is equal to 5.

The operator for "is not equal" is also not obvious: it's !=, like this:

In [26]:
num = 5
if num != 37:
    print("The number is not 37")
The number is not 37

We can also combine tests using and and or. and is only true if both parts are true:

In [27]:
if (1 > 0) and (-1 > 0):
    print('both statements are true')
else:
    print('one statement is not true')
one statement is not true

or is true if either part is true:

In [28]:
if (1 < 0) or ('left' < 'right'):
    print('at least one test is true')
at least one test is true

For or statements, "either" means "either or both", not "either one or the other but not both".

By the way, why is "left" < "right"? When we ask Python to compare numbers, it considers the smaller number to be less than the bigger number; when we ask Python to compare strings, it looks at their alphabetical ordering. "l" comes before "r" in the alphabet, so this comparison returns True.

True and False

Python has special values called True and False, which are what is returned if we evaluate a comparison. For our first example, we wrote an if statement which looked like

num = 37
if num > 100:
    ...

Of course, num was not more than 100 in this case. What if we look directly at the result of that comparison?

In [29]:
num = 37
print(num > 100)
False

False is not a string - it's a special Python value that literally means false, or not true. There's also True:

In [30]:
num = 121
print(num > 100)
True

Challenge

  1. To calculate the Hamming distance, we will need to be able to test if characters are the same. In fact the simplest Hamming distance calculation is between just two characters, for instance:

    G
    G
    
    

    Here the characters are the same, so the Hamming distance is zero.

    G
    T
    
    

    Here the characters are different, so the Hamming distance is 1.

    Define a function called character_distance which takes two parameters and returns 1 if the two characters are different, and 0 if they are the same. Some example output should look like:

    print(character_distance("G", "G"))
    0
    
    print(character_distance("A", "T"))
    1
    

Calculating many things: for loops

A big advantage of writing a program is that we only need to write it once, but can run it again and again, saving effort.

If we need to run an algorithm thousands or millions of times, we don't want to have to call it each time. So, we need to be able to tell the computer to repeat things.

Suppose we want to print each character in the sequence "GACT" on a line of its own. We can specify a character of a string using square brackets []. If the variable that holds the string is called s, then s[0] is the first character, s[1] is the second character, and so on. This is called string indexing.

So, one way to print out each character of the string would be to use four print statements:

In [31]:
def print_characters(s):
    print(s[0])
    print(s[1])
    print(s[2])
    print(s[3])
In [32]:
print_characters('GACT')
G
A
C
T

but that's a bad approach for two reasons:

  1. It doesn't scale: if we want to print the characters in a string that's hundreds of letters long, we'd be better off just typing them in.

  2. It's fragile: if we give it a longer string, it only prints part of the data, and if we give it a shorter one, it produces an error because we're asking for characters that don't exist.

In [33]:
print_characters("CAT")
C
A
T
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-33-35bac565a5ec> in <module>()
----> 1 print_characters("CAT")

<ipython-input-31-3846d08d71a9> in print_characters(s)
      3     print(s[1])
      4     print(s[2])
----> 5     print(s[3])

IndexError: string index out of range

Here's a better approach:

In [34]:
def print_characters(s):
    for char in s:
        print(char)
In [35]:
print_characters('GACT')
G
A
C
T

This is shorter---certainly shorter than something that prints every character in a hundred-letter string---and more robust as well:

In [36]:
print_characters('oxygen')
o
x
y
g
e
n

You might be able to see how this ability to do something with each character in turn, no matter how long the string, will be useful in solving our Hamming distance problem!

The improved version of print_characters uses a for loop to repeat an operation---in this case, printing---once for each thing in a collection. The general form of a loop is:

for variable in collection:
    do things with variable

We can call the loop variable anything we like, but there must be a colon at the end of the line starting the loop, and we must indent the body of the loop.

Here's another loop that repeatedly updates a variable:

In [37]:
count = 0
for vowel in "aeiou":
    count = count + 1
print("There are",count,"vowels")
There are 5 vowels

It's worth tracing the execution of this little program step by step. Since there are five characters in 'aeiou', the statement on line 3 will be executed five times. The first time around, count is zero (the value assigned to it on line 1) and vowel is 'a'. The statement adds 1 to the old value of count, producing 1, and updates count to refer to that new value. The next time around, vowel is 'e' and count is 1, so count is updated to be 2. After three more updates, count is 5; since there is nothing left in 'aeiou' for Python to process, the loop finishes and the print statement on line 4 tells us our final answer.

Note that a loop variable is just a variable that's being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:

In [38]:
letter = 'z'
for letter in 'abc':
    print(letter)
print('after the loop, letter is', letter)
a
b
c
after the loop, letter is c

Note also that finding the length of a string is such a common operation that Python actually has a built-in function to do it called len:

In [39]:
print(len("aeiou"))
5

Built-in functions like len are often much faster than functions we write ourselves, and much easier to read; len will also give us the length of many other things that we haven't met yet, so it is the better option here.

Challenge (hard)

Loops can also be nested: if you put a for loop inside a for loop, every pass through the outer loop will do an entire run-through of all values of the inner loop.

Using nested loops, write a Python program to write out every two-letter combination of A,C,G and T. The output should look like:

A A
A C
A G
A T
C A
C C
C G
C T
G A
G C
G G
G T
T A
T C
T G
T T

Let's think about for loops for solving our Hamming distance problem. We saw above how to loop over the characters in a string:

In [40]:
def print_characters(my_string):
    for char in my_string:
        print(char)

print_characters("GACT")
G
A
C
T

For Hamming distance, we need to compare each character in one string to the corresponding character in the other. We might think about applying a loop go through each character in turn. Given

GATTACA
GACTATA

we could use a loop to compare G to G, then A to A, then C to T, and so on.

But to do this we need to be able to loop over both strings at once. How can we do this? To understand one way to do it, we'll first learn about lists.

Lists and loops

Just as a for loop is a way to do operations many times, a list is a way to store many values. We create a list by putting values inside square brackets:

In [41]:
odds = [1, 3, 5, 7, 9, 11]
print('odd numbers are:', odds)
odd numbers are: [1, 3, 5, 7, 9, 11]

Just like characters in a string, we select individual elements from lists by indexing them:

In [42]:
print('first and last:', odds[0], odds[-1])
first and last: 1 11
In [43]:
print('first, third and fourth:', odds[0], odds[2], odds[3])
first, third and fourth: 1 5 7

and if we loop over a list, the loop variable is assigned elements one at a time:

In [44]:
for number in odds:
    print(number)
1
3
5
7
9
11

There is one important difference between lists and strings: we can change the values in a list, but we cannot change the characters in a string. For example:

In [45]:
names = ['Newton', 'Darwing', 'Turing'] # typo in Darwin's name
print('names is originally:', names)
names[1] = 'Darwin' # correct the name
print('final value of names:', names)
names is originally: ['Newton', 'Darwing', 'Turing']
final value of names: ['Newton', 'Darwin', 'Turing']

works, but this doesn't:

In [46]:
name = 'Bell'
name[0] = 'b'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-220df48aeb2e> in <module>()
      1 name = 'Bell'
----> 2 name[0] = 'b'

TypeError: 'str' object does not support item assignment

Python has a useful built-in function called range that creates a list of the first $N$ numbers starting from zero.

In [47]:
print(list(range(3)))
[0, 1, 2]
In [48]:
print(list(range(10)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Notice that we had to use list() to see the contents of range(3). This is because range() actually returns an iterator, a fairly advanced topic that we won't cover yet. list() converts our iterator into a list so that we can print it out.

range() is particularly useful with for loops! For instance to write something out 5 times:

In [50]:
for i in range(5):
    print("Hello!)
  File "<ipython-input-50-cedd704b721c>", line 2
    print("Hello!)
                  ^
SyntaxError: EOL while scanning string literal

Challenges

  1. Using range, write a function that prints the $N$ natural numbers, i.e. counting from $1$:

    print_N(3)
    1
    2
    3
    
  2. Write a function which takes two strings, and prints out the first characters of both strings on the first line, the second characters of both strings on the second line, and so on.

    print_pairs("CAT","DOG")
    C D
    A O
    T G
    

    Hints:

    • remember that you can get the length of a string using the built-in len() function
    • remember that you can use a character from a string by indexing it, like s[0]

Putting it all together: nesting loops and conditionals

Another thing to realize is that if and for statements can be combined with one another just as easily as they can be combined with functions. For example, if we want to sum only the positive numbers in a list, we can write this:

In [51]:
numbers = [-5, 3, 2, -1, 9, 6]
total = 0
for n in numbers:
    if n >= 0:
        total = total + n
print('sum of positive values:', total)
sum of positive values: 20

We could equally well calculate the positive and negative sums in a single loop:

In [52]:
pos_total = 0
neg_total = 0
for n in numbers:
    if n >= 0:
        pos_total = pos_total + n
    else:
        neg_total = neg_total + n
print('negative and positive sums are:', neg_total, pos_total)
negative and positive sums are: -6 20

We can even put one loop inside another:

In [53]:
for consonant in 'bcd':
    for vowel in 'ae':
        print(consonant + vowel)
ba
be
ca
ce
da
de

As the diagram below shows, the inner loop runs from start to finish each time the outer loop runs once:

Execution of Nested Loops

Challenge

Write a function to count the number of 'C's and 'G's in a DNA sequence. Some sample output should look like:

print(count_CG("GATTACA"))
2
print(count_CG("ACAAAACAGCGAACACTCGC"))
10

Final challenge!

\Write a function which takes in two strings as parameters, and returns the Hamming distance between them. You can assume the input strings are the same length.

Some sample output should look like this (you can cut and paste these strings to check you get the same answers):

print(hamming_distance("GATTACA","GACTATA"))
2
print(hamming_distance("CAT","DOG"))
3
print(hamming_distance("CAT", "ACT"))
2
print(hamming_distance("GTTCTTGGACGACGAAAAGA", "GTTCGTGAGGCCACATCCCG"))
13
print(hamming_distance("GTGCTTCCAGTCACGCTGTCTTGGGGTAGC", "TCGAGCAGTATCACATTACTAAGGATGTGC"))
20
print(hamming_distance("GCGAGCCCAGTCACGCTGTCTTGGGGTAGC", "TCGAGCAGTATCACGCTACTAAGGGGGTGC"))
12

If you can get this working, well done! This is our first bioinformatics algorithm.