mok0's world

Python for crystallographers 5

Posted in Biopython, Programming, Python, Tutorial by mok0 on February 1, 2012

5 Python for crystallographers 5

This is the fifth and last in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

5.1 Variable function arguments

Variable arguments can be used in a function when the number of arguments that the function needs to process cannot be predicted. The archetypical example is the C printf function, which accepts a format string as the first parameter, and any number of variables following that. In Python, variable function arguments are specified in the function definition by prefixing the name of the formal parameter with an asterisk:

def func(*args):
    print type (args)
    for i in args:
        print i,

From the point of view of the function, the passed parameters are contained in a tuple:

In [6]: func(1,2,'a')
<type 'tuple'>
1 2 a

In [7]: func(1,2,3,4,5,6)
<type 'tuple'>
1 2 3 4 5 6

The function must of course know what to do with the varying number of arguments, and that is the job of the programmer.

5.2 Keyworded function arguments

This section introduces keyworded function arguments.

Keyworded arguments are somewhat similar to variable argument lists from the previous section, however, the function identifies the arguments by the parameter name, and arguments kan be skipped or they can be placed out of order. This is often a convenience to the programmer, and keyworded arguments are typically used to set various options in the function. Keyworded function arguments are specified in the function definition by prefixing the name of the formal parameter with a double asterisk:

def func1 (**kwargs):
    print type(kwargs)
    for k in kwargs:
        print k, kwargs[k],

From the point of view of the function, the passed parameters are contained in a dictionary. Let’s try to call this function using various keyworded arguments (recall that IPython’s %cpaste magic can be used to insert code snippets from the clipboard):

func1(residues=100, atoms=1000)
In [11]: func1(x=1)
<type 'dict'>
x 1

In [12]: func1(residues=100)
<type 'dict'>
residues 100

In [13]: func1(residues=100, atoms=1000)
<type 'dict'>
residues 100 atoms 1000

In [14]: func1(atoms=1000,residues=100)
<type 'dict'>
residues 100 atoms 1000

Quite typically, keyworded arguments are used to set optional parameters, but the function might still want one or more fixed and required parameters. A template example of such a use is given below:

def func2 (a, b, **kwargs):

and an example of a function call might look like this:

func2(filenam, 0.2, atoms=1000,residues=100)

5.3 Exceptions

Python is extremely smart about errors that arise during excecution of programs. When programs written in compiled languages such as C encounter an error, for example if a number is divided by zero, the operating system simply terminates the process. The programmer then needs to track down where the error happened, fix it, and recompile the program.

When running Python interactively, for example via IPython, the process survives, but it throws an exception, meaning that it stops what is was doing when the error happened. But since the process survives, you can examine the content of the variables, for example by printing them, and this can give vital clues about what went wrong. Of course, you can also use the Python debugger which streamlines this process, enables you to set breakpoints in the program and so on. Curiously to crystallographers, the Python debugger is called PDB. However, this subject is outside the scope of this presentation. IPython has a smart interface to the Python debugger, which you are encouraged to check out.

Here, we attempt to divide by zero:

In [15]: 1/0
ZeroDivisionError                         Traceback (most recent call last)

/Users/mok/dropbox/python-course-2011/<ipython console> in <module>()

ZeroDivisionError: integer division or modulo by zero

Python does not like that, and raises an exception named ZeroDivisionError. The programmer can also raise exceptions, both predefined, or home-made exceptions. Here, we try to raise at ZeroDivisionError:

In [16]: raise ZeroDivisionError
ZeroDivisionError                         Traceback (most recent call last)

/Users/mok/dropbox/python-course-2011/<ipython console> in <module>()


It is seen that an exception is thrown, but Python really doesn’t know the reason (there is no associated message.)

The very smart thing about exceptions is that you can catch them, and let the program deal with the unexpected situation. This is done in a
try/except block, for example:

    f = a/b
except ZeroDivisionError:
    print "oops!"

After the try/except block, everything is normal, and the value of f is 5. Let us try again, this time setting b to zero:

b = 0
    f = a/b
    print "oops!"

The output written this time is “oops!”, and the value of f is unchanged (i.e. 5 if you follow the input above.)

5.3.1 The useful SystemExit exception

A quite common construction in a Python main program is to exit in the very beginning, if the user fails to provide the proper command line arguments, or, as in the following example, the specified file does not exist. When the SystemExit exception is raised, the Python process terminates:

    fnam = sys.argv[1]
    if not os.path.exists(fnam):
        print "file not found, duh!"
        raise SystemExit

5.4 Superposition

In this section, we will take a look at the Superimposer() class that comes with Biopython.

The following code is the main program from the Biopython source file Bio/PDB/, that we copy-paste into its own file called (The Python standard programming guidelines encourages use of lower-case letters for file names.)

 1: import sys
 2: from Bio.PDB import PDBParser, Selection, Superimposer
 3: import numpy
 5: if __name__=="__main__":
 7:     p=PDBParser()
 8:     s1=p.get_structure("FIXED", sys.argv[1])
 9:  fixed=Selection.unfold_entities(s1, "A")
11:     s2=p.get_structure("MOVING", sys.argv[1])
12:  moving=Selection.unfold_entities(s2, "A")
14:  rot=numpy.identity(3).astype('f')
15:  tran=numpy.array((1.0, 2.0, 3.0), 'f')
17:  for atom in moving:
18:         atom.transform(rot, tran)
20:     sup=Superimposer()
22:  sup.set_atoms(fixed, moving)
24:     print sup.rotran
25:     print sup.rms
27:  sup.apply(moving)

The above example is only mildly interesting, because it simply superimposes a molecule on a copy of itself, which has been translated by a small amount. The result of the calculation should therefore be the identity matrix, and a small translation in the opposite direction. We will none-the-less go through the program and examine what it’s doing.

The first few lines ought to be familiar by now. The PDB file from the command line is opened twice, and its atoms are placed in two Structure() instances s1 and s2.

In lines 9 and 12 we are using the convenient Selection() class, that can be used to access the various levels of the SMCRA hierachy. We use the method called unfold_entities(), which can unfold a list of entities to a list of entities of another level. For example:

  • list of atoms -> list of residues
  • list of residues -> list of atoms
  • list of residues -> list of chains

In this case, we unfold a Structure() instance to a list of atoms, specified by the “A” in parameter 2. Those lists of atoms are stored in variables moving and fixed.

In line 14 we define the identity matrix, and in line 15 we define a translation vector (1.0, 2.0, 3.0). That rotation/translation operator is applied to the moving list of atoms in line 17, and after this operation all atoms in that list have been shifted by the translation vector given above.

Now we instance the Superimposer() class in the variable sup, and feed two lists of atoms to it (line 22). The rotation/translation operator and root mean squared deviation is printed, and finally, in line 27, the translation in sup is applied to the moving list, which should result in the atoms being shifted back to their original positions.

It is left as an exercise for the reader to generalize this program so it can take two different PDB files and/or two different chains.

5.5 Neighbor search

Now lets take a look at the NeighborSearch class from Biopython. This is a useful class that can search entities (that can be Atom, Residue, etc. objects) within a certain distance.

To remind yourself how smart IPython is (assuming you are running it) let us first check the documentation for NeighborSearch.

from Bio.PDB import NeigborSearch

Recall that IPython has extensive support of Tab-completion. When typing the above, it is actually sufficient to type:


and IPython will automatically expand “Nei” to “NeighborSearch”.

Now, load a Structure object from a PDB file, and as in the previous example, unfold that structure into a list of atoms.

from Bio.PDB import Selection
s = p.get_structure("test", "1zen.pdb")
atoms = Selection.unfold_entities(s, "A")

Now, choose a random atom in that structure. Below, we choose atom number 30, which happens to be the OH atom of residue 3, which is a Tyr. The get_coord() method returns a Numpy array containing that atom’s coordinates.

1: center = atoms[30].get_coord()
2: ns = NeighborSearch(atoms) 
3: print (center, 5, "A") 
4: print (center, 5, "R")

In line 2, we prime an instance ns of the NeighborSearch class with the list of atoms extracted from PDB structure 1zen. Then, in line 3, we print out the list of atoms within 5.0 Å of the center (atom OH of Tyr 3). In the following line, we print the list of residues found by the search within that radius:

[<Residue TYR het=  resseq=3 icode= >,
 <Residue LYS het=  resseq=105 icode= >,
 <Residue GLU het=  resseq=53 icode= >,
 <Residue LYS het=  resseq=52 icode= >]

NeighborSearch is a very useful class solving an often encountered problem, that you can use in your programs.

5.6 Fragment search

For the next example, we will take a look at Biopython’s Fragmentmapper class, and we shall use it to write a small program that performs a calculation more or less like the Lego_ca command in O: step through the polypeptide chain one residue at the time, and find the best fitting fragment from a library of fragments.

The FragmentMapper class requires that you download and install the fragment database first. It consists of a number of text files, that should be placed in a directory somewhere. You can download the fragment library from In the following, it is assumed that these files are placed in a directory called fragment_data in the current working directory, but it could be anywhere on your file system. Here is the program

 1: from Bio.PDB import PDBParser, FragmentMapper, Selection
 2: import sys
 4: if __name__=="__main__":
 6:     p=PDBParser()
 7:     s=p.get_structure("X", sys.argv[1])
 9:     m=s[0]
10:  fm=FragmentMapper(m, 10, 5, "fragment_data")
12:  for r in Selection.unfold_entities(m, "R"):
13:         print r,
14:         if fm.has_key(r):
15:             print fm[r]
16:         else:
17:             print

Most of the elements of this program should be familiar. The program accepts the file name of a PDB entry on the command line. In line 10, the FragmentMapper class is instanced. It receives a Model object, the size of the library (10 in this case), and the length of the fragments (we use 5 here, same as the Lego_* system from O). The last parameter is the directory where the fragment library files are stored.

In the loop starting at line 12 the model m is unfolded residue by residue, and each is looked up in the fragment mapper object, which functions like a Python dictionary. If there is a match, the fragment object is printed. The output of the program looks something like this:

<Residue SER het=  resseq=1 icode= >
<Residue LYS het=  resseq=2 icode= >
<Residue ILE het=  resseq=3 icode= > <Fragment length=5 id=9>
<Residue PHE het=  resseq=4 icode= > <Fragment length=5 id=9>
<Residue ASP het=  resseq=5 icode= > <Fragment length=5 id=7>
<Residue PHE het=  resseq=6 icode= > <Fragment length=5 id=3>

The first two residues don’t have any matches, the first five-residue fragment match is at residue 3, because the position of the match is the central residue of the five. In other words, fragment id 9 is the best match of residues 1-5, and also the best match of residues 2-6, whereas the best match of residues 3-7 is fragment id 7.

5.7 Shell utilities

Finally, we will introduce a very useful module that is distributed with Python. It is the shutil module, containing various utility functions e.g. for copying files and directory trees.

import shutil

Some of the functions available in shutil are:

copy(src, dst)
copy data and mode bits fra src to dst. Same as
“cp src dst” from the command line.
copy2(src, dst)
copy data and all stat info (same as “cp -p src dst”).
move(src, dst)
Recursively move a file or directory to another location.

The shutil module is thus convenient for performing shell-like operations from within Python.

Date: 2012-02-01 12:08:44 CET

HTML generated by org-mode 6.21b in emacs 23


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: