Pythonic python 2: k-mer counting

This workshop is designed to introduce Python to bioinformaticians. It follows on from the introductory workshops on Python programming. In particular, we'll build on our solution to the k-mer counting workshop, and use the various language features we learn to improve our code.

Getting help

If you're using Jupyter, you can always use the built-in help() function or the ? operator to see documentation on a function or type. For instance, help(range) or ?range will give you documentation on the built-in range() function.

You can try this while going through the below. If you want to get help on a Python keyword such as for or if, or on an operator, you'll need to make sure to put quotes around it, like help("for") or help("!=").

Setup

This workshop has been written for Python 3. If you are using Python 2, you should be able to make everything work by first running the following statement:

In [1]:
from __future__ import division, print_function

K-mers

The k-mers in a string are all the substrings of length k. So, given the string "ELEPHANT", and k-mer length k=4, the k-mers are:

ELEP
LEPH
EPHA
PHAN
HANT

Finding all the k-mers in a set of DNA sequences is useful in several bioinformatics applications. One of the best-known uses is to perform genome assembly by constructing a de Bruijn graph.

K-mer counting: initial solution

Let's start from a basic solution to the k-mer counting problem. If you've done the introductory workshop, you might have come up with a similar solution yourself.

Exercise: If you have not already done so in the introductory workshop, try to solve this problem before you continue. Write a function called count_kmers which takes as input a string and an integer k, and returns a dictionary as output. The keys of the dictionary should be all the k-mers that were seen, and the values should be the number of times each k-mer occurs in the string. Make sure to give your function a docstring! You'll be able to use your solution as the basis for the rest of the workshop.

For example, your function should give the following result when called with these arguments:

count_kmers("GATGAT",3)
{'ATG': 1, 'GAT': 2, 'TGA': 1}

You can test it quickly as you're writing it by running the following assertions:

assert count_kmers("GATGAT",3)["GAT"]==2
assert count_kmers("GATGAT",3)["ATG"]==1
assert len(count_kmers("GATGAT",3))==3
assert len(count_kmers("GATGAT",5))==2

The solution is written below, so free to stop here and try to solve this, and to ask for help if you get stuck. Asking for hints is a much better way to develop your understanding than reading the solution.

A couple of hints: you will need to understand string slicing to find the k-mers, and you will need to think about where the starting position of the last k-mer in the string is. You may want the built-in function len(). Remember you can use help() to look at the documentation for built-in functions. You may also need to check whether a k-mer is already in the dictionary before you try to increment the count.

SPOILER ALERT: If you are new to programming, and would like to do the novice workshop, you might want to do it before looking at the solution below!

Here is one possible solution to the k-mer counting problem. The basic plan here is:

  • start with an empty dictionary to hold our results
  • find the length of the string L, and use L and k to work out how many k-mers there are in the string
  • loop over all possible starting positions of the k-mers
  • at each position, extract the k-mer by taking a slice of the string
  • if the k-mer is not already in the dictionary, add it with a count of 0
  • increment the count for this k-mer (if we just added it, it will now be 1)
In [5]:
def count_kmers(read, k):
    """Count kmer occurrences in a given read.

    Parameters
    ----------
    read : string
        A single DNA sequence.
    k : int
        The value of k for which to count kmers.

    Returns
    -------
    counts : dictionary, {'string': int}
        A dictionary of counts keyed by their individual kmers (strings
        of length k).

    Examples
    --------
    >>> count_kmers("GATGAT", 3)
    {'ATG': 1, 'GAT': 2, 'TGA': 1}
    """
    # Start with an empty dictionary
    counts = {}
    # Calculate how many kmers of length k there are
    num_kmers = len(read) - k + 1
    # Loop over the kmer start positions
    for i in range(num_kmers):
        # Slice the string to get the kmer
        kmer = read[i:i+k]
        # Add the kmer to the dictionary if it's not there
        if kmer not in counts:
            counts[kmer] = 0
        # Increment the count for this kmer
        counts[kmer] += 1
    # Return the final counts
    return counts

We can now call our function by passing in a string and value for k:

In [6]:
count_kmers("GATGAT",3)
Out[6]:
{'ATG': 1, 'GAT': 2, 'TGA': 1}
In [7]:
count_kmers("GATGAT",2)
Out[7]:
{'AT': 2, 'GA': 2, 'TG': 1}
In [8]:
count_kmers("GATGAT",5)
Out[8]:
{'ATGAT': 1, 'GATGA': 1}

Exercise 1: Type or paste this function into your own Notebook and check that it gives the same results as above.

Exercise 2: Modify your count_kmers() function so that instead of taking a string, it takes in a list of strings (representing a set of reads) and counts the kmer totals across all of them together. You still only need to return one dictionary. So, for instance, your function should return the following result:

count_kmers(["GATGAT", "GATGAT", "CCGAT"],3)
{'CGA': 1, 'ATG': 2, 'CCG': 1, 'GAT': 5, 'TGA': 2}

Libraries

Often, some of the code we need to write has been written before, and we can reuse it. We do this by importing a Python library. There are many libraries for Python, and some of them are very useful for scientific computing.

To import a library, use import:

In [9]:
import random

Now all the functions in the random library are available to us.

In [10]:
for i in range(10):
    print(random.randint(1,100))
58
73
60
8
43
90
91
39
44
79

One useful library is the collections library, which you can load with

import collections

It provides several extra data structures: namedtuple, dequeue, Counter, OrderedDict and defaultdict. We're going to look at defaultdict.

You might have noticed, in solving the basic k-mer counting problem, we had to deal with a slightly irritating issue: before incrementing any count, we had to check, every time, that the key was in the dictionary.

# Add the kmer to the dictionary if it's not there
if kmer not in counts:
    counts[kmer] = 0
# Increment the count for this kmer
counts[kmer] += 1

Trying to run counts[kmer] += 1 without doing this will crash the program with a KeyError, because the key isn't in the dictionary.

A similar pattern occurs whenever we use a dictionary to track a set of things, without knowing in advance what all the keys of the dictionary will be.

Here's a similar problem: count the occurrence of each value in the following list.

In [11]:
big_list = ['3', '7', '3', '3', '5', '3', '5', '1', '1', '7', '3', '5', '5',
       '3', '5', '5', '5', '1', '7', '7', '1', '1', '7', '7', '3', '7',
       '3', '5', '1', '5', '5', '3', '3', '5', '1', '1', '1', '1', '7',
       '3', '7', '1', '5', '1', '7', '3', '5', '5', '6', '5', '7', '1',
       '5', '1', '5', '1', '3', '7', '7', '5', '3', '1', '1', '5', '5',
       '5', '1', '5', '1', '7', '3', '7', '1', '7', '5', '7', '1', '5',
       '3', '5', '3', '5', '3', '5', '5', '1', '3', '5', '3', '7', '1',
       '5', '7', '3', '1', '7', '5', '7', '7', '1']
In [12]:
counts = {}
for num in big_list:
    if num not in counts:
        counts[num] = 0
    counts[num] += 1
print(counts)
{'1': 25, '3': 21, '5': 31, '7': 22, '6': 1}

The collections library will let us handle this pattern more elegantly. Firstly, import it.

In [13]:
import collections

defaultdict is a data type which behaves like a dictionary, but automatically initialises any key that doesn't exist. We have to specify what the types of the values should be. For instance:

In [14]:
# A defaultdict with values of type int
counts = collections.defaultdict(int)

counts['new_key']
Out[14]:
0

Examining counts['new_key'] caused it to be created with defaultdict's default value for ints, which is zero. We haven't tried to retrieve any other keys, so they don't exist yet:

In [15]:
counts
Out[15]:
defaultdict(<type 'int'>, {'new_key': 0})
In [16]:
counts.keys()
Out[16]:
['new_key']

We can specify other data types:

In [17]:
coloured_things = collections.defaultdict(list)

coloured_things['yellow']
Out[17]:
[]

Examining coloured_things['yellow'] caused it to be created with defaultdict's default value for lists, which is the empty list [].

This means that we can use .append() to build up lists without worrying about whether the lists already exist:

In [18]:
coloured_things['yellow'].append('sun')
coloured_things['yellow'].append('bananas')
coloured_things['green'].append('grass')
In [19]:
coloured_things
Out[19]:
defaultdict(<type 'list'>, {'green': ['grass'], 'yellow': ['sun', 'bananas']})

Exercise: Use a defaultdict to alter your count_kmers() function so that it does not need to check whether a kmer is already in the dictionary.

There are many other useful libraries, far too many to cover. In general you will learn about these as you need them. A standard place to find Python libraries that you don't already have installed is the Python Package Index, PyPI: https://pypi.python.org/pypi

The easiest tool to install new packages is pip, which you run from the command-line, with a call like

pip install collections

If you are using Anaconda, you can also install libraries from the Anaconda package index using conda instead of pip.

Reading and writing files

If we want to count k-mers in real sequencing data, we'll be dealing with a lot of data. We can't type the strings in by hand! We need to know how to read in experimental data from a file.

Here's an example file containing some sequence data: example_reads.fastq. Download this file to your computer by right-clicking it, and save it in the same directory you are running IPython Notebook from.

Firstly, you can see what files are present using %ls, a special IPython command. These commands, which start with %, are called "magics". %ls in IPython Notebook is the same as running ls from the command line, and will list all files in the directory.

Try running this command now. The output you see will depend on the computer you're on. If you saved the example reads file to the directory you're running IPython Notebook in, you should see it listed.

In [37]:
%ls
LICENSE.md                  hammingdist.html            kmer_counting.html
README.md                   hammingdist.ipynb           kmer_counting.ipynb
TODO                        img/                        old_base_img/
Untitled0.ipynb             improved_hammingdist.html   quick_fundamentals.html
css/                        improved_hammingdist.ipynb  quick_fundamentals.ipynb
example_reads.fastq         improved_kmers.html         setup_python.html
extrabits.html              improved_kmers.ipynb        setup_python.ipynb
extrabits.ipynb             index.html                  tiny_reads.fastq
gloss.html                  js/

The most basic way to open a file for reading or writing is with the built-in open() command. This looks like

file_handle = open(filename, mode)

where mode specifies how the file should be opened. Some important options for mode include "r" for reading and "w" for writing.

When we've finished, we can close the file with

file_handle.close()

If we open a file for writing that doesn't exist, it will be created. Let's try creating a new file and writing some text to it.

In [21]:
notefile = open("notes.txt", "w")

Exercise: Our file handle, notefile, is a variable which we use to access the file. What is the type of this variable?

We can write text to the file with a command like

In [22]:
notefile.write("Hello world!\nThis is a new line.\n")

Notice that we explicitly wrote in "\n" as a newline. We could also write in a series of lines using notefile.writelines().

Once we've finished, we close the file with .close().

In [23]:
notefile.close()

Examine this file using a text editor, and you should see the text we just wrote out.

There are Python libraries which give more advanced methods for reading and writing files than the built-in functions. For instance, the csv library provides functions to read and write CSV files. In fact, there are also libraries designed to parse bioinformatics data formats such as FASTQ, but we're going to do this ourselves in order to learn how to work directly with files.

Now let's try to read in our example data from example_reads.fastq. This file is in FASTQ format, which is a very common way to receive data from NGS sequencing experiments. It represents a set of DNA sequences, or "reads".

First, let's just read the lines and print them out.

In [24]:
# Open the file for reading
# Use b for "binary" mode, which is sometimes necessary on Windows
fastq_filehandle = open("example_reads.fastq", "rb")

# Loop over each line in the file
for row in fastq_filehandle:
    # Use the built-in .strip() method to remove whitespace and newlines from the line
    row = row.strip()
    # Print it out
    print(row)

# Close the file
fastq_filehandle.close()
@HWUSI-EAS-100R_0001:7:1:1:701#TGACCA/1
TCGTACCGTAAGGAACGGTGGACTGGATACGAGTGAGAATGTTGGGATAGATAGATAGATAGATAGATAGATAGATAGA
+
abaabbbb_^__Y_W`_^U[\_a`^^BaY`\\\\__Z[\\^[]ZW\R[BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS-100R_0001:7:1:1:701#TGACCA/2
GGCCTTAGGATTACACAATAACATACCTGTGTCGGTTTCAGTATAGTGCCATCCTTCTGTCTCTAGACACTCTTCCGTG
+
abbbbaBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS-100R_0001:7:1:1:1658#TGACCA/1
TTGGGATATCGCCAAGCGGTAAGGCAACGGACTTTGACTCCGTCATTCGCAGGTTCGAATCCTGCTATCCCAGCCAAAA
+
a`_\X_ababbbbaa_aa`[`_TZa``a]Z^aaa``\_[_]X]`Y\Y\Q[TXLMT^XQLR^[OV[ZVZZ]YJDW\MWJW
@HWUSI-EAS-100R_0001:7:1:1:1658#TGACCA/2
AAAAAAGCATCGACAAAAAAAATATGCTTGGTGCACGACATGTGTATCGAACCCTGGACACCCTGATTCAAAGCTAACT
+
abbbbbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS-100R_0001:7:1:2:928#TGACCA/1
TAGATAGATATTCCTCGATTAAGAGTAATGCAAGGGATGTCAAGTGTAGGTAAGGTTCTTCGCTCGTACCGTAAGGAGT
+
abbbZa``a`abaaba`aaaaaaYYVV``aaa^`a_^Y^ZV`][\V_QJVQLYLXXLX_VWYSRABBBBBBBBBBBBBB

This is FASTQ data. Each sequencing read is represented by four lines, and it looks like there are just five reads in our example file. Here's the meaning of each four lines:

@HWUSI-EAS-100R_0001:7:1:1:701#TGACCA/1
TCGTACCGTAAGGAACGGTGGACTGGATACGAGTGAGAATGTTGGCATCAGTAGCGCGATGTGGGTGAGAATCCCCCAG
+
abaabbbb_^__Y_W`_^U[\_a`^^BaY`\\\\__Z[\\^[]ZW\R[BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

  • The first line, which must start with @, is the name of the sequence. It often gives information about the experiment.
  • The second line is the DNA sequence itself, which is what we're interested in.
  • The third line must start with + and may repeat the read name
  • The fourth line is the ascii-encoded quality scores showing how confident we are that the sequencing experiment gave the correct results.

We will ignore the quality scores for our purposes, and just read in the DNA sequences. This means that out of every four lines, we'd like to just store the second. From our example_reads.fastq, we'd like to end up with a list of five DNA sequence strings.

Exercise: Write a function called read_all_rows() which takes in a filename and produces a list of strings, one per row of the file. You can do this with a loop or, if you're feeling ambitious, a list comprehension. You should use strip() on every line to remove any whitespaces or newlines from the beginning and end of the line.

Note: File handles have a method, .readlines(), which will read all rows from a file handle into a list. You can use this function if you want to, but it won't strip the newlines for you.

If you're feeling really ambitious, try to work out how to store only every second row out of each four (i.e. only the DNA sequences). We'll show a way to do this in a moment.

If you've written a function to store all rows, you can use indexing to get the second row out of every four:

In [25]:
all_lines = read_all_rows("example_reads.fastq")
dna_strings = all_lines[1::4]

This is an indexing syntax we haven't looked at yet. Instead of just writing all_lines[start:end], we've written all_lines[start:end:stride]. The stride tells Python how many list items to skip at each step. It means "start at the second row (index 1), go to the end, taking every 4th row".

If this is confusing, don't worry about it too much at this point.

dna_strings should now look like this:

In [26]:
dna_strings
Out[26]:
['TCGTACCGTAAGGAACGGTGGACTGGATACGAGTGAGAATGTTGGGATAGATAGATAGATAGATAGATAGATAGATAGA',
 'GGCCTTAGGATTACACAATAACATACCTGTGTCGGTTTCAGTATAGTGCCATCCTTCTGTCTCTAGACACTCTTCCGTG',
 'TTGGGATATCGCCAAGCGGTAAGGCAACGGACTTTGACTCCGTCATTCGCAGGTTCGAATCCTGCTATCCCAGCCAAAA',
 'AAAAAAGCATCGACAAAAAAAATATGCTTGGTGCACGACATGTGTATCGAACCCTGGACACCCTGATTCAAAGCTAACT',
 'TAGATAGATATTCCTCGATTAAGAGTAATGCAAGGGATGTCAAGTGTAGGTAAGGTTCTTCGCTCGTACCGTAAGGAGT']

Note that reading in all the rows like this, and then picking out every fourth, is not a very memory-efficient solution, as we read all the rows into the variable all_lines even though we weren't interested in keeping most of them. Can you think of better approaches?

Counting real k-mers

Now that we're able to read in lines from a FASTQ file, we can use this together with our count_kmers() function to count the k-mers in real NGS sequence data.

Exercise: Write a function, count_fastq_kmers(), which takes a filename , reads in the DNA sequences, and counts the k-mers. It should return a dictionary of k-mer counts. This function allows you to analyse real sequencing data.

You can test your function by downloading this file: tiny_reads.fastq which should give the output

count_fastq_kmers("tiny_reads.fastq", 3)
{'CGA': 1, 'ATG': 2, 'CCG': 1, 'GAT': 5, 'TGA': 2}

Here are some assertions using example_reads.fastq, that you can run to test your function:

ngs_kmers_6 = count_fastq_kmers("example_reads.fastq", 6)
# Specific k-mer counts from this dataset
assert ngs_kmers_6["ATAGTG"]==1
assert ngs_kmers_6["AGATAG"]==8
# A k-mer that isn't in the reads, isn't in the dictionary
assert "ATCAAA" not in ngs_kmers_6
# K-mers are of length k
assert len(ngs_kmers_6.keys()[0])==6
# If k-mers are the length of the reads, we get back the five reads
assert len(count_fastq_kmers("example_reads.fastq",79))==5
# k==1 returns the four individual letters
assert len(count_fastq_kmers("example_reads.fastq",1))==4
In [34]:
ngs_kmers = count_fastq_kmers("example_reads.fastq", 6)

ngs_kmers is now a dictionary of counts. Even with only five example sequences, using real NGS data means that there will be a lot of unique k-mers and our count dictionary is probably quite big.

In [35]:
len(ngs_kmers)
Out[35]:
320

However, many of these k-mers probably only occur once. In a later workshop, you'll see how to use plotting functions to display information like this. For now, let's just try to filter out everything with a count below 3.

Exercise: Write a loop which prints out each k-mer, and the count for that k-mer, only if the count is 3 or more. You'll have to think about how to iterate through the items in a dictionary.