mok0's world

Python for crystallographers 4

Posted in Biopython, Programming, Python, Tutorial by mok0 on January 30, 2012

4 Python for crystallographers 4

This is the fourth in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

4.1 Python modules

We have seen that simple Python modules are often just files containing Python code, typically functions and classes. In presentation 2, we saw the following construct:

from utils import parse_list

that simply imports the function parse_list() from the file This allows the programmer to organize the code in elements that are more easily maintained. However, many large projects are exported as packages. A package is normally a collection of modules, but can also be a single module.

To define a module called course, we can create a directory called course, and in that directory, we place a file called The presence of this file in a directory tells Python that it is a module.

mkdir course/
touch course/

Now, inside Python, we can:

import course

but of course, the module is empty. There are several ways to put code into a module. Most often, is empty, but it can contain Python code. Let us enter the following into the file course/

def hello():
   print "Python is cool"

Now, in IPython, this function can be called:

import course

You can see that the function hello() is now in the modules’ namespace. The directory course can also contain other Python files, which then become part of the module. Let us copy our file from earlier to the new course module:

cp course/

Now, is a part of the course module, and needs to be addressed as a part of that. In Python, we can import the function
parse_list() like this:

from course.utils import parse_list

This imports the function parse_list() directly into the current namespace. However, we can also write:

import course

but then we need to use the fully qualified namespace to access the function which should be called course.utils.parse_list(). It is a matter of taste, and a matter of code clarity, how to use the import statement in your program. As a rule-of-thumb, if you need to use the function many places in your code, it is convinient to import its name into the current namespace. If you only need to call a function (or class) once, it is more clear to leave it fully qualified, because the code then will show its origin.

4.2 The Bio.PDB module

We will now look at the PDB module from Biopython. This module is quite sophisticated, and is supposedely be very well tested. Basically, the PDB module regards a macromolecular structure from a PDB entry as a hierachy called “SMCRA”, which stands for

  • Structure
  • Model
  • Chain
  • Residue
  • Atom

In a typical X-ray structure, the model level is not used, but in a typical NMR structure entry, it is typical to have 20 models or more in one file. Each Model contains one or more chains, each Chain contains one or more residues, and each Residue contains one or more atoms.

The PDB module is designed so it is possible to access this information in several ways. How, exactly, this is done in your program depends on the requirements of the specific application.

4.2.1 Parsing a PDB file

To parse a PDB file we need to instantiate a parser. The parser object has a few methods to access the content of the PDB file, but one method is always used, namely get_structure() which – as the name implies – returns a Structure object.

1: from Bio import PDB
3: p = PDB.PDBParser()
4: s = p.get_structure("3ns0", "3ns0.pdb")

In line 3 the parser object is instantiated in the object
p. Next, in line 4 we retrieve the Structure object. The method get_structure() needs an identifier string (can be any meaningful string, here we use the PDB ident) and the file name of a PDB entry.

In fact, the second argument to get_structure() can also be a file handle, or generally, an object that has a readlines() method. This is useful if for example you would like to read a gzipped PDB entry:

import gzip
fd ="1psr.pdb.gz")
s1 = p.get_structure("1psr", fd)

The gzip module transparently opens both gzipped and text files.

4.2.2 The Structure object

The Structure object is the top-level object, and contains methods to access the child objects. Many of these methods are in fact common to the lower-level objects also, because they are inherited from a common base class called Entity(). The methods include:

return level in hierarchy (“A”, “R”, “C”, “M” or “S”)
return list of children
return the ID of the object

The objects in the SMCRA hierachy also inherit attributes from the base class, including child_dict, which is a dictionary containing the child objects with the child IDs as keys.

However, the Structure object also contains convienient methods that more directly access the content of the structure:

return a chain iterator
return a residues iterator
return an atom iterator

Recall that iterators are objects that cannot be accessed directly,
but must be used in a loop contruct.

Finally, the Entity() base class defines the special method __getitem()= which enables the use of the “[]” operator on the object. In our example, s[0] thus contains the child (a Model object) with ID 0. Another usefule attribute is parent, which contains a reference to the parent object.

4.2.3 The Model object

The Model object is used in PDB entries containing NMR models. An NMR entry typically contains 20 models or more. PDB entries of structures determined by X-Ray crystallography normally only has one model.

4.2.4 The Chain object

Each Model object contains a list of one or more Chain objects. One Chain object is generated from each unique chain identifier in the PDB file (column 5), and the Chain object controls access to the residues that are a part of that chain.

4.2.5 The Residue object

The pattern should be obvious by now. The Residue object defines one residue in the structure.

To retrieve residues in a structure, we first store the model in object m, then loop over chains c, and for each chain, we can loop over residues r:

m = s[0]
for c in m:
   for r in c:
      print r

However, there really is no need to loop over models, then chains, then residues unless you actually need to examine the structure in this way. We can use a convienience method in the Structure object to retrieve all residues from a structure directly:

1: R = s.get_residues()
2: for r in R:
3:  print,

Line 1 gives us an iterator that can be used to visit all residues belonging to structure s. In line 3 we print the ids of all residues. This is the number given in column 6 of a PDB file. Many molecular graphics programs prefix the residue number with the chain ID (for example “A129”) but here, we retrieve the ID of the parent object (the parent of Residue objects is the Chain).

4.2.6 The Atom object

The Atom() class is the most fundamental class in the SMCRA hierachy, and also the class with the most methods. In particular, we have the get_coord() method, which returns the coordinates of the atom in a Numpy array. We can retrieve atoms from a structure using the get_atoms() method:

4: A = s.get_atoms() # get an atom iterator 5: L = []
6: for a in A:
7:  L.append(a.get_coord())
8: print L

In line 7, we stash away the coordinates of a structure in a list L. The Atom object also exposes methods to retrieve the B-factor, the occupancy, serial number, etc.

4.3 List comprehension

In this and earlier presentations, you have seen that we routinely extract objects and organize them into a list. For example in the example above we first created an empty list L, and in the following loop, we extract the objects we are interested in and append them to the list. This construction is used to often in Python, that a special idiom in the language has been created to deal with it. It is called list comprehension. Let us revisit the case of retrieving atoms from a structure:

A = s.get_atoms() # get an atom iterator L = [a for a in A]

This construction might look a bit weird when you first see it, but it quickly becomes a part of your Python vocabulary. Let us break the syntax down a bit. The initial a is the object that gets appended to the list, or in other words, the list will consist of a‘s. The following “for” statement determines what a‘s will be selected, namely those generated by the iterator A.

We can condense it even further:

L = [a for a in s.get_atoms()]

This gives us a list of atoms in L. However, in the example in the previous section (7) we retrieved a list of coordinates not atoms. Using a list comprehension, we can write:

L = [a.get_coord() for a in s.get_atoms()]

A list comprehension can also introduce an if-filter, for example:

T = [r for r in s.get_residues() if r.resname=='TYR']

creates a list T containing tyrosine residues.

4.4 Distance matrix plot

We will now use Biopython and Numpy to write a program to make a C-alpha distance matrix plot. The core logic of this program looks like this:

1:     s = parser.get_structure(id, fnam)
3:  chains = [c for c in s.get_chains()]
5:     x = []
6:     for r in chains[0]:
7:  if 'CA' in r.child_dict:
8:             ca = r.child_dict['CA']
9:  x.append(ca.get_coord())

First, we generate a list of chains (line 3), and then, in the loop, we step through the residues of the first chain only. It is left as an exercise to make the program more general so that it can deal with all chains, or deal with a specific chain specified by the user on the command line. In line 7, we check to see if the key ‘CA’ is in the residues’ child dictionary. For a residue, this dictionary contains the atoms of the residue, indexed by their names. If the atom has the name ‘CA’, the atom object is retrieved, and its coordinate appended to the list x.

The next step is to convert the list of coordinates (each of which is a Numpy array) to a large Numpy array of dimensions (N,3) where N is the number of atoms:

coords = np.array(x)

The final step is to compute the distance matrix. Fortunately, this is possible by using a module from scipy. The function cdist is able to compute a distance matrix between two different sets of coordinates, in our case, we need to compute a distance matrix between the coordinates to themselves. This is why the first two arguments passed to cdist are both coords. The last argument specifies that we want to compute the eucledian distance (normal geometry).

import scipy.spatial

data = scipy.spatial.distance.cdist(coords, coords, 'euclidean')

Now, with the distance matrix stored in the Numpy array data, all we need to do is plot it:

plot (data)

The code in the plot() function was copied from the Gallery on the matplotlib web site, after first finding a plot that looks like the one we want.

The entire source of the program, including the function plot() is shown in the Appendix.


The source code of the program In the plot() function, the commented-out code is needed to create a plot to a PNG file instead of to the computer screen. To generate a plot, the lines:

# import matplotlib # matplotlib.use( 'Agg' ) 

should be activated, and the function pylab.savefig() should be used instead of

import sys
import os
from Bio.PDB.PDBParser import PDBParser
import numpy as np
import warnings

def plot(data):    

# import matplotlib # matplotlib.use( 'Agg' ) 
    import pylab

    fig = pylab.figure()
    ax = fig.add_subplot(111)

    cax = ax.imshow(data, interpolation='nearest')
    ax.set_title('Ca-Ca distance plot')

    # Add colorbar, make sure to specify tick locations to match desired ticklabels 
    min = np.min(data)
    max = np.max(data)
    cbar = fig.colorbar(cax, ticks=[min, max])
# pylab.savefig( 'distmat.png', format='png' ) 

if __name__ == '__main__':
    import re

    fnam = sys.argv[1]
    if not os.path.exists(fnam):
        print "file not found, duh!"
        raise SystemExit

    id = fnam.split('.')[0]

    parser = PDBParser()

    s = parser.get_structure(id, fnam)

    chains = [c for c in s.get_chains()]

    x = []
    for r in chains[0]:
        if 'CA' in r.child_dict:
            ca = r.child_dict['CA']

    coords = np.array(x)

    import scipy.spatial

    data = scipy.spatial.distance.cdist(coords, coords, 'euclidean')
    print data.shape

Date: 2012-01-30 13:59:32 CET

HTML generated by org-mode 6.21b in emacs 23