Introductory Python II: k-mer counting

In this tutorial we'll work out how to do some basic tasks required for de Bruijn graph-based genome assembly. These are:

  • split a read (or set of reads) into k-mers
  • find the unique k-mers, and count how many times they occur (calculate k-mer coverage)
  • find the suffix or prefix of a k-mer (this is needed to find overlaps)

In order to do these things, we'll learn about string slicing, and about two new Python data types: sets and dictionaries. Like lists, sets and dictionaries are used for building collections of things, but they have some different features.

Note: This workshop has been written for Python 3. If you are using Python 2, you should be able to make everything work by first running the Python command:

from __future__ import print_function, division

k-mers

The k-mers in a string are all the substrings of length k. So, given the string "ELEPHANT", and k-mer length k=4, the k-mers are:

ELEP
LEPH
EPHA
PHAN
HANT

When we count k-mers, we want to find out which k-mers occur in our string, and how often they occur. So, for the string "BANANA" and k=2, we would find counts like

BA  1
AN  2
NA  2

In the Assembly lectures we will see how breaking data into k-mers helps us to efficiently assemble genomes. In applying these algorithms, we will be dealing with large amounts of DNA sequence, and values of k will of course be much larger (e.g. in the range 20-80, and sometimes larger; values depend strongly on data).

Finding k-mers: string slicing

In the first workshop you were introduced to string indexing, using square brackets [].

In [1]:
dna_string = "GATTACA"
In [2]:
# First, third and last
print(dna_string[0], dna_string[2], dna_string[-1])
G T A
In [3]:
# Second-last
print(dna_string[-2])
C

Python also allows us to get multiple letters from a string at once using slicing. For instance, s[2:4] will give a slice of the string, starting at index 2 and going up to, but not including, index 4.

In [4]:
print(dna_string[2:4])
TT

We can also leave one (or both) of the numbers out, to go all the way from the start, or all the way until the end of the string.

String slicing will be useful for finding the k-mers in a read, and for finding the overlapping suffixes and prefixes of length (k-1).

Here are some slicing examples: play with this yourself until you get a feel for it.

In [5]:
print(dna_string[1:3])
AT
In [6]:
print(dna_string[1:4])
ATT
In [7]:
# From the start up until just before index 4
print(dna_string[:4])
GATT
In [8]:
# From index zero (first letter) until index 4 - same result as last example
print(dna_string[0:4])
GATT
In [9]:
# From index 4 until the end
print(dna_string[4:])
ACA
In [10]:
# From the second-last character until the end
print(dna_string[-2:])
CA
In [11]:
# From index 0 to index 6
# Since the length of the string is 7, this returns the entire string!
# It's the same as dna_string[:], which means 'from the start till the end'
print(dna_string[0:7])
print(dna_string[:])
GATTACA
GATTACA

Note that because dna_string[:4] gives us the string until just before index 4, and dna_string[4:] gives us the string starting at index 4, this actually splits the string into two parts with no overlaps or missing pieces. That is:

dna_string == "GATTACA"
dna_string[:4] == "GATT"
dna_string[4:] == "ACA"
dna_string[:4] + dna_string[4:] == "GATT" + "ACA" == dna_string

Similarly, dna_string[:2] and dna_string[2:], for instance, split the string neatly into "GA" and "TTACA".

This can be a useful way to remember and understand why a slice starts on the first index, but only goes up to just before the second index, and does not include it.

Challenge

  1. Write a function which takes as input a string and a kmer size k, and prints out all the kmers. It's ok to print a kmer twice. Some output should look like:

    print_kmers("GATTACA",4)
    GATT
    ATTA
    TTAC
    TACA
    
    print_kmers("MISSISSIPPI",5)
    MISSI
    ISSIS
    SSISS
    SISSI
    ISSIP
    SSIPP
    SIPPI
    

    Hints: You will need to use string slicing for this! You will probably want to use a for loop, and the len() function.

  2. Modify your function so that intead of printing out each kmer, it prints out, for each k-mer, the length (k-1) prefix (all but the last letter) and the length (k-1) suffix (all but the first letter). Some sample output should look like:

    print_prefixes_and_suffixes("GATTACA",4)
    GAT ATT
    ATT TTA
    TTA TAC
    TAC ACA
    
    print_prefixes_and_suffixes("MISSISSIPPI",5)
    MISS ISSI
    ISSI SSIS
    SSIS SISS
    SISS ISSI
    ISSI SSIP
    SSIP SIPP
    SIPP IPPI
    

Collections of things: Lists and Sets

In this workshop we're going to want to find all the kmers in our reads, and count kmers. To do this we'll have to be able to build collections of strings.

Python has a few different data types for collecting several values together, which will be useful for us here.

Lists

You've already seen lists: they are ordered collections of values. We'll review them and then introduce one new useful function: append().

We define lists with square brackets and commas, like this:

In [12]:
odd_numbers = [1,3,5,7]
print(odd_numbers)
[1, 3, 5, 7]

The values in a list don't have to be numbers, but can be of any type.

In [13]:
odd_number_strings = ["one","three","five","seven"]
print(odd_number_strings)
['one', 'three', 'five', 'seven']

We can refer to elements of the list by indexing with square brackets, []. Remember that we count from zero when doing this. We can also slice lists, like strings.

In [14]:
# Second element of list (index 1)
print(odd_numbers[1])
3
In [15]:
print(odd_numbers[1:4])
[3, 5, 7]

There are several built-in functions that can be applied to lists. One that we've already seen is len(), to find the length.

In [16]:
len(odd_numbers)
Out[16]:
4

There are also special functions in Python which apply only to certain data types. These functions are called methods and are used by writing them after the variable name, with a dot, like variable.function(). For instance, lists have a method called append() which adds an item to the end of the list:

In [17]:
odd_numbers = [1,3,5,7]
print("Before append:",odd_numbers)
odd_numbers.append(9)
print("After append:",odd_numbers)
Before append: [1, 3, 5, 7]
After append: [1, 3, 5, 7, 9]

The append() function doesn't return a new list. It modifes the list it is attached to.

append() is useful when building up a list piece by piece, for example, in a for loop.

append() is not the only method that works on lists. For instance, try reverse():

In [18]:
odd_numbers = [1,3,5,7]
odd_numbers.reverse()
print(odd_numbers)
[7, 5, 3, 1]

There's no need to learn all these functions at once. You will learn about them as you need them.

Challenge

Write a function which takes a string and a kmer length k as input. This time, instead of printing out the kmers directly inside the function, build and return a list of all the kmers in the string. As before, it's ok to include a kmer more than once. Some example output should look like:

print(find_kmers("GATTACA",4))
['GATT', 'ATTA', 'TTAC', 'TACA']

print(find_kmers("MISSISSIPPI",5))
['MISSI', 'ISSIS', 'SSISS', 'SISSI', 'ISSIP', 'SSIPP', 'SIPPI']

print(find_kmers("MISSISSIPPI",3))
['MIS', 'ISS', 'SSI', 'SIS', 'ISS', 'SSI', 'SIP', 'IPP', 'PPI']

Sets

Sets are a Python data type which, like lists, hold collections of values. The main differences between sets and lists are:

  • a set can only have one copy of each value at most: values must be unique
  • the values in a set are not in order - Python may return them in a random order each time!

If you're familiar with sets as used in mathematics, Python's sets are actually quite similar, and can be used for set calculations by calling set methods like .union() and .intersection(). However we won't use those features in this workshop.

We can define a set by passing in a list to set(), like this:

In [19]:
odd_number_set = set([1,3,5,7])
print(odd_number_set)
{1, 3, 5, 7}

So far, this looks a lot like a list! But what if there are repeated numbers?

In [20]:
odd_number_set = set([1,3,3,5,7,1])
print(odd_number_set)
{1, 3, 5, 7}

The set will contain only unique values - i.e. only one of each value that's been added to it. This can be useful when you're trying to detect all the unique items in a large collection of things:

In [21]:
big_list = ['3', '7', '3', '3', '5', '3', '5', '1', '1', '7', '3', '5', '5',
       '3', '5', '5', '5', '1', '7', '7', '1', '1', '7', '7', '3', '7',
       '3', '5', '1', '5', '5', '3', '3', '5', '1', '1', '1', '1', '7',
       '3', '7', '1', '5', '1', '7', '3', '5', '5', '6', '5', '7', '1',
       '5', '1', '5', '1', '3', '7', '7', '5', '3', '1', '1', '5', '5',
       '5', '1', '5', '1', '7', '3', '7', '1', '7', '5', '7', '1', '5',
       '3', '5', '3', '5', '3', '5', '5', '1', '3', '5', '3', '7', '1',
       '5', '7', '3', '1', '7', '5', '7', '7', '1']
unique_values = set(big_list)
print(unique_values)
{'7', '3', '6', '1', '5'}

One function which might be useful for building a collection of k-mers is the add() method for sets, called like this:

In [22]:
odd_numbers = set([1,3,5,7])
print("Before add:",odd_numbers)
odd_numbers.add(11)
print("After add:",odd_numbers)
Before add: {1, 3, 5, 7}
After add: {1, 3, 5, 7, 11}

Notice that the values are not in any particular order! Sometimes you may notice Python apparently printing out sets in a certain order, but you should never rely on this - it could be random another time.

Adding an item to a set that is already there has no effect:

In [23]:
odd_numbers = set([1,3,5,7])
print("Before add:",odd_numbers)
odd_numbers.add(3)
print("After add:",odd_numbers)
Before add: {1, 3, 5, 7}
After add: {1, 3, 5, 7}

It's also possible to use a for loop with sets, just like lists. The difference is that we can't guarantee what order we will go through the items in, since sets have no order.

In [24]:
odd_number_set = set([1,3,1,7,9,3])
for num in odd_number_set:
    print(num)
1
3
9
7

Challenge

Write a function which takes a string and a kmer length k, and returns a set of all the unique kmers in that string. Some example output should look like this:

print(find_unique_kmers("GATTACA",4))
set(['GATT', 'TACA', 'TTAC', 'ATTA'])

print(find_unique_kmers("MISSISSIPPI",5))
set(['SISSI', 'MISSI', 'SSIPP', 'ISSIP', 'ISSIS', 'SSISS', 'SIPPI'])

print(find_unique_kmers("MISSISSIPPI",3))
set(['SIS', 'SIP', 'ISS', 'PPI', 'IPP', 'SSI', 'MIS'])

Since sets don't preserve the order of the items inside them, your sets may display the kmers in a different order - this is still correct.

If you compare your output to the output from the previous challenge, you should see that if the input string has repeats, there are less unique kmers than total observed kmers:

print(find_kmers("MISSISSIPPI",3))
['MIS', 'ISS', 'SSI', 'SIS', 'ISS', 'SSI', 'SIP', 'IPP', 'PPI']

print(len(find_kmers("MISSISSIPPI",3)))
9

print(find_unique_kmers("MISSISSIPPI",3))
set(['SIS', 'SIP', 'ISS', 'PPI', 'IPP', 'SSI', 'MIS'])

print(len(find_unique_kmers("MISSISSIPPI",3)))
7

Counting k-mers: Dictionaries

We're going to look at one more Python data type for storing collections of values: the dictionary.

This is another built-in Python data type. Dictionaries are a bit like sets - they store unordered collections of items. But dictionaries store key-value pairs. "Keys" are used for indexing and must be unique. "Values" do not have to be unique.

So, just like in sets, all keys in a dictionary must be unique - only one of each can be stored. Also like a set, the keys in a dictionary are stored in no particular order.

Here's an example dictionary, which we've defined using curly braces {}, and by putting a colon : between the each key and its corresponding value. This dictionary stores some people's heights in centimetres.

In [25]:
heights = {"Sam":201, "Fiona":167, "Quentin":167}
print(heights)
{'Sam': 201, 'Fiona': 167, 'Quentin': 167}
In [26]:
print(type(heights))
<class 'dict'>

We can retrieve a value using its key. We use square brackets [], just like getting an item from a list.

In [27]:
print(heights["Sam"])
201

Trying to look up a key that isn't in the dictionary will give an error (a KeyError).

In [28]:
print(heights["Anne"])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-96f1ce2d9468> in <module>()
----> 1 print(heights["Anne"])

KeyError: 'Anne'

We can also add a key-value pair to the dictionary this way:

In [29]:
heights["Mary"] = 180
print(heights)
{'Sam': 201, 'Fiona': 167, 'Quentin': 167, 'Mary': 180}

Or we can assign a new value to an existing key. If we do this, the old value is lost. This means keys will always be unique - trying to add a new key that is the same simply overwrites the old one.

In [30]:
# apparently Quentin is still growing.. update his height to the new value
heights["Quentin"] = 170
print(heights)
{'Sam': 201, 'Fiona': 167, 'Quentin': 170, 'Mary': 180}

We can avoid errors like the one we got with height["Anne"] by checking to see if a key is in the dictionary or not. We use the Python keyword in, which returns True or False:

In [31]:
print("Anne" in heights)
False
In [32]:
print("Fiona" in heights)
True
In [33]:
if "Fiona" in heights:
    print("Fiona",heights["Fiona"])
if "Anne" in heights:
    print("Anne",heights["Anne"])
Fiona 167

Dictionaries have methods called keys() and values() which return lists of keys and values (watch out - these can be in random order!)

In [34]:
print(heights.keys())
dict_keys(['Sam', 'Fiona', 'Quentin', 'Mary'])
In [35]:
print(heights.values())
dict_values([201, 167, 170, 180])

In the above, our keys were strings and our values were numbers, but that doesn't have to be the case. Our keys could be numbers, for instance. Our values could be strings, or could even be lists:

In [36]:
prime_factors = {12: [3,4],
                 100: [2,2,5,5],
                 9: [3,3]}
print(prime_factors)
{12: [3, 4], 100: [2, 2, 5, 5], 9: [3, 3]}
In [37]:
print(prime_factors[12])
[3, 4]

Don't be fooled - even though writing prime_factors[12] with a number between the square brackets looks a bit like list indexing, it's not. The difference is that our dictionary only contains the keys 9, 12, and 100, because those are the only ones we've added. A list that contains the index 9 must also contain 0,1,2,3,4,5,6,7 and 8, because the indices just give the order of the items in the list.

One common use of dictionaries, which is what we'd like to do, is to count things. We can make the keys the things we want to count, and the values the number of times we've seen each one.

In [38]:
big_list = ['3', '7', '3', '3', '5', '3', '5', '1', '1', '7', '3', '5', '5',
       '3', '5', '5', '5', '1', '7', '7', '1', '1', '7', '7', '3', '7',
       '3', '5', '1', '5', '5', '3', '3', '5', '1', '1', '1', '1', '7',
       '3', '7', '1', '5', '1', '7', '3', '5', '5', '6', '5', '7', '1',
       '5', '1', '5', '1', '3', '7', '7', '5', '3', '1', '1', '5', '5',
       '5', '1', '5', '1', '7', '3', '7', '1', '7', '5', '7', '1', '5',
       '3', '5', '3', '5', '3', '5', '5', '1', '3', '5', '3', '7', '1',
       '5', '7', '3', '1', '7', '5', '7', '7', '1']

How many times does each number appear in big_list? We could write a for loop and store the counts in variables. But we'd need a new variable for the number of '1's, the number of '3's, the number of '5's, etc, and we might not know in advance how many variables we need.

This is where a dictionary is handy: each time we see a new value that needs counting, we can add it to the dictionary.

In [39]:
# Start by making counts an empty dictionary, using {}
counts = {}

# Now loop through the values in big_list and count them
for num in big_list:
    
    # Check to see if this key is already in the dictionary
    # If not, add it with an initial count of zero
    if not (num in counts):
        counts[num] = 0
    
    # Now that we are sure the key is in the dictionary, we can increment the count
    counts[num] += 1

# After the loop is finished, counts should contain the right counts for each number seen
print(counts)
{'3': 21, '7': 22, '5': 31, '1': 25, '6': 1}

Notice that we could get the unique items in big_list using counts.keys(). This will give us the same unique items that we found before using set(). The keys of a dictionary are a lot like sets, and sets behave a lot like dictionaries with no values.

Dictionaries: algorithmic considerations

In fact, dictionaries (and sets) are an implementation of hash tables, and work by carrying out a hash function on the keys to decide where to store them in the computer's memory. This means that dictionaries are very fast at retrieving information. Even if you have a very large dictionary, finding an item in it using a key takes a roughly constant amount of time that does not grow with the size of the dictionary - the performance of the algorithm is O(1).

"Exact string matching" is extremely fast and simple since we just need to test for equality, not carry out an alignment. If we store our known k-mers as keys of a dictionary, and we are given a new k-mer which we would like to match against them, then we also can search for matches across a huge dataset of known k-mers in O(1) time simply by looking up the key.

Performing inexact string matching can't be done using a hash table, since the hash function only gives the same result if the key is exactly the same. For approximate string matching, we need some kind of alignment algorithm. Not only is this slower for each string, we may need to carry it out on a large number of strings to find a match. This is why, when there are many reads, a hash-based algorithm like De Bruijn assembly is faster than an alignment-based assembly algorithm. It's also why some alignment algorithms use seeds.

Challenge

Write a function which takes a string and a kmer length k, and returns a dictionary. The keys of the dictionary should be the kmers found in the string, and the values should be the kmer counts.

print(count_kmers("GATTACA",4))
{'ATTA': 1, 'GATT': 1, 'TACA': 1, 'TTAC': 1}

print(count_kmers("MISSISSIPI",3))
{'IPI': 1, 'ISS': 2, 'MIS': 1, 'SIP': 1, 'SIS': 1, 'SSI': 2}

print(count_kmers("CTCGGGGGCTATTAATACCTAAGTGCTCGGGGGCTATTAATACCTAAGTGGGGCTATT",3))
{'AAG': 2,
 'AAT': 2,
 'ACC': 2,
 'AGT': 2,
 'ATA': 2,
 'ATT': 3,
 'CCT': 2,
 'CGG': 2,
 'CTA': 5,
 'CTC': 2,
 'GCT': 4,
 'GGC': 3,
 'GGG': 8,
 'GTG': 2,
 'TAA': 4,
 'TAC': 2,
 'TAT': 3,
 'TCG': 2,
 'TGC': 1,
 'TGG': 1,
 'TTA': 2}