mok0's world

Python for crystallographers 3

Posted in Biopython, Programming, Python, Tutorial by mok0 on January 29, 2012

3 Python for crystallographers 3

This is the third in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

3.1 Reading a PDB file

The first project is to read a PDB format file. The following snippet of code opens a file and swallows it. The list L will contain a list of lines:

fnam = '1au1.pdb'
f = open(fnam)
L = f.readlines()
print L[10]

But L includes a lot of header lines from the PDB file that we really don’t need. To help filter out the ATOM records, we make use of Pythons regular expression module, called re. Then we step through all lines in the list L, extract the lines that start with ‘ATOM’, and append those to another list: data.

import re

data = []
for rec in L:
    if re.match('ATOM', rec):
        data.append(rec.split())

We check the length of the list data, and print out a few of the first atoms:

len(data)
data[0]
for i in 1,2,3,4:
    print data[i]

Remember, that Python arrays start at 0, so data[0] contains the first atom, but that atom has serial ID number 1.

3.2 Class based approach

Now we will create a more advanced and flexible data structure for atoms. We create a file pdb.py with the following content:

 1: class Atom:
 2:  def __init__(self):
 3:         self.name = ''
 4:         self.id = 0
 5:         self.element_number = 6
 6:         self.x = 0.0
 7:         self.y = 0.0
 8:         self.z = 0.0
 9:         self.b = 1.0
10:         self.occ = 1.0

In line 2 we define the class constructor. In this design, the constructor does not need any parameters. So now we can define an “empty” atom. In IPython, we use the %run magic command to “source” the file pdb.py:

%run pdb
atom = Atom()
print a.x, a.y, a.z, a.occ, a.b

This will print out the default values defined in the constructor.

Now, let us add a method to the Atom class that will parse an ATOM record from a PDB file, and fill the relevant attributes with the relevant data. Here is the method, it should be addede to the class Atom in the file pdb.py:

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(L[9])
        self.b = float(L[10])

Now, in IPython, we run again:

%run pdb
atom = Atom()

We still have a record from the PDB file from before, when we loaded the PDB file into a list of strings called L. Let us try to parse one of these lines:

atom.parse_from_pdb(L[2000])
print atom.x, atom.y, atom.z

This loaded the data from line 2000 in the file into the object atom.

3.3 version 2

Now we would like to make a more complete standalone program, so we add a main program to the file pdb.py:

 1: class Atom:
 2:     def __init__(self):
 3:         self.name = ''
 4:         self.id = 0
 5:         self.element_number = 6
 6:         self.x = 0.0
 7:         self.y = 0.0
 8:         self.z = 0.0
 9:         self.b = 1.0
10:         self.occ = 1.0
11: 
12:     def parse_from_pdb (self,s):
13:         L = s.split()
14:         self.id = int(L[1])
15:         self.name = L[2]
16:         self.x = float(L[6])
17:         self.y = float(L[7])
18:         self.z = float(L[8])
19:         self.occ = float(L[9])
20:         self.b = float(L[10])
21:     #. 22: 
23: if __name__ == '__main__':
24:     import re
25: 
26:     fnam = '1au1.pdb'
27:     f = open(fnam)
28:     L = f.readlines()
29: 
30:  atoms = []
31:  for line in L:
32:  if re.match('ATOM|HETATM', line):
33:             a = Atom()
34:             a.parse_from_pdb(line)
35:             atoms.append(a)

We open the file as before, and swallow the whole thing using readlines(). This gives a list of lines. In line 30, we create an empty list that will be used to store the atom objects. In line 31 we step through all the strings in list L. Something new happens in line 32. Here we use match() of the regular expression module re to “grep” out lines that start with either ‘ATOM’ or ‘HETATM’. Only if one of these two keywords are found, we will execute the code inside that if statement. If we have an ‘ATOM’ or a ‘HETATM’ record, we instance an empty Atom object, and use the parse_from_pdb() method to populate the fields of the object. Finally, we append the newly created object to the list atoms. Now, run this from IPython, and look at some of the elements in list atoms:

%run pdb
print atoms[1]
print atoms[10]

This printout is pretty uninformative. We should add a __repr()__ method to the Atom class. It looks like this:

    def __repr__(self):
        s = "Atom: {0}, x,y,z={1},{2},{3}, occ={4}, B={5}"
        s = s.format(self.name,self.x,self.y,self.z,self.occ,self.b)
        return s

Here, the format() method of the str class is used to format the string. In this case, the tokens marked with curly brackets will be formatted with the corresponding argument. There is a whole mini-language that allows very sophisticated string formatting, it can be studied here: http://docs.python.org/library/stdtypes.html#string-formatting

%run pdb
print atoms[1]

which prints output that looks like this:

Atom: CA, x,y,z=24.887,27.143,6.222, occ=1.0, B=41.36

Now, we can enhance our main program to write out more information:

if __name__ == '__main__':
    import re

    fnam = '1au1.pdb'
    f = open(fnam)
    L = f.readlines()

    atoms = []
    for line in L:
        if re.match('ATOM|HETATM', line):
            a = Atom()
            a.parse_from_pdb(line)
            atoms.append(a)

    for a in atoms:
        print a

However, when we run this, we get a run-time error!

In [7]: %run pdb 1au1.pdb
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call
last)

/u/mok/Dropbox/python-course-2011/pdb.py in <module>()
     59         if re.match('ATOM|HETATM', line):
     60             a = Atom()
---> 61             a.parse_from_pdb(line)
     62             atoms.append(a)
     63

/u/mok/Dropbox/python-course-2011/pdb.py in parse_from_pdb(self, s)
     18         self.y = float(L[7])
     19         self.z = float(L[8])
---> 20         self.occ = float(L[9])
     21         self.b = float(L[10])
     22     #.

ValueError: invalid literal for float(): 1.00100.00
WARNING: Failure executing file: <pdb.py>

The second-last line gives us a hint to what is going on. Python can not convert the string ‘1.00100.00’ to a floating point number. This is a well known limitation of the PDB format. When a B factor becomes 100.0 or larger, there is no longer a space between the occupancy field and the B-factor field. Therefore, the logic we used in the parse_from_pdb() method is too simplistic. We can’t simply split the line into space-separated fields using the split() method. We need to be more careful. So, parse_from_pdb() needs to be changed to this:

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(s[55:60])
        self.b = float(s[60:66])

Now, instead of using the split fields to extract occ and b, we extract them directly from positions 55 to 59 and 60 to 65 in the input string. (This will work for this file, but the logic is likely to fail if tested on every PDB file we can find, because there are other issues with PDB files that may cause the number of fields to be different.)

3.4 A more complete program

Now, let us make the program more versatile, so that we can specify the PDB file from the command line. We change the main program to look like this:

 1: if __name__ == '__main__':
 2:     import re
 3:     import sys
 4:     import os
 5: 
 6:  fnam = sys.argv[1]
 7:  if not os.path.exists(fnam):
 8:         print "File", fnam, "not found, duh!"
 9:         raise SystemExit (ref:exit)
10: 
11:     f = open(fnam)
12:     L = f.readlines()
13: 
14:     atoms = []
15:     for line in L:
16:         if re.match('ATOM|HETATM', line):
17:             a = Atom()
18:             a.parse_from_pdb(line)
19:             atoms.append(a)
20: 
21:     for a in atoms:
22:         print a

In line 6 we use the sys module to access the (first) command line argument. This is assumed to be a file name, but in line 7 we double check to make sure the file exists, by using exists() from the (extremely useful) os.path module. If the file is not found, we print out an error message, and stop the program by raising an exception in line nil. Now the program is more general, and can be used to read in any PDB file.

3.5 Distance between atoms

We want to be able to compute the distance between two atoms. We add a function to pdb.py to do this. The function assumes to be passed two objects that have attributes x, y and z, it then does the math computes the distance between these points.

def distance(atom1, atom2):
    import math
    dist = math.sqrt((atom1.x-atom2.x)**2
                     +(atom1.y-atom2.y)**2
                     +(atom1.z-atom2.z)**2)
    return dist

We need to import the math module so we can look up the square root. We could also have written the function as a method of the Atom class, in which case the call would be:

d = atom1.distance(atom2)
print d

This is a matter of design. We choose to write a function, that is called like this:

d = distance(atom1, atom2)
print d

We can then compute the distance of all atoms to (a random) atom number 300 The full source code of the program is listed in the appendix. We run the program from within IPython like this:

%run pdb 1au1.pdb

3.6 Difference between = and ==

The answer is simple: = is used for assignement, and == is used for comparisons. For example:

In [9]: a=1

In [10]: a
Out[10]: 1

In [11]: a == 2
Out[11]: False

In [12]: a == 1
Out[12]: True

3.7 Plotting with matplotlib

Finally, let us take a look at the incredibly useful plotting library, matplotlib, that is a part of SciPy. First, we create a list of B values from our list of atom objects:

bvalues =[]
for a in atoms:
    bvalues.append(a.b)

Now we can make plot of B-values vs. atom number:

import pylab
pylab.plot(bvalues)

Matplotlib can do a huge amount of different plots. A great way to get started is to go to the matplotlib gallery at http://matplotlib.sourceforge.net/gallery.html and choose a plot that looks like what you need. Then you can cut and paste the source code of the plot into IPython. Use the %cpaste magic to do this.

Appendix

Complete source code of the final version of pdb.py

class Atom:
    def __init__(self):
        self.name = ''
        self.id = 0
        self.element_number = 6
        self.x = 0.0
        self.y = 0.0
        self.z = 0.0
        self.b = 1.0
        self.occ = 1.0

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(s[55:60])
        self.b = float(s[60:66])

    def __repr__(self):
        s = "Atom: {0}, x,y,z={1},{2},{3}, occ={4}, B={5}"
        s = s.format(self.name,self.x,self.y,self.z,self.occ,self.b)
        return s

def distance(atom1, atom2):
    import math
    dist = math.sqrt((atom1.x-atom2.x)**2
                     +(atom1.y-atom2.y)**2
                     +(atom1.z-atom2.z)**2)
    return dist

if __name__ == '__main__':
    import re
    import sys
    import os

    fnam = sys.argv[1]
    if not os.path.exists(fnam):
        print "file not found, duh!"
        raise SystemExit

    f = open(fnam)
    L = f.readlines()

    atoms = []
    for line in L:
        if re.match('ATOM|HETATM', line):
            a = Atom()
            a.parse_from_pdb(line)
            atoms.append(a)

    XXX = atoms[299]

    for a in atoms:
        print distance(a,XXX)

Date: 2012-01-29 16:22:46 CET

HTML generated by org-mode 6.21b in emacs 23

Python for crystallographers 2

Posted in Biopython, Programming, Python, Tutorial by mok0 on January 28, 2012

2 Python for Crystallographers 2

This is the second in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University in 2011. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

2.1 Classes

Classes are a fundamental feature in all Object Oriented programming languages. C++, Java, Objective-C, Smalltalk, etc. all implement classes. Programmers use classes in different ways. Some classes represent concrete objects, for example atoms, residues, or sequences, but in other cases, classes are used to encapsulate various algorithms, or they produce iterators or are factories of objects. As you use various Python modules, you will meet different class designs.

Let us define a very simple class:

class A:
    a = 1
dir(A)

A Python class is not usually used on its own. You can compare a class to a cookie-cutter, it defines the shape and size of the cookie, but to actually get something concrete out of it, you need to press it into the dough to make a cookie. With classes, you need to instance it to make it useful, and you can create as many instances as you want (unless the class has specific logic to prevent that.)

inst = A()
inst.a

Class A only contains one variable, a. Variables of this kind are called class variables, because they are shared by all instances of the class. This is useful for e.g. constants and other data items, but in general, instances need their own data too.

Here we define a class that has an instance variable b:

class A:
    a = 1
    def __init__(self, name):
        self.name = name

The special method __init__()= is called whenever Python constructs an instance of a class. The variable self refers to the instance object itself, so self.name refers to a variable in the namespace of the instance. The __init__() method can also be called a constructor. When we define a constructor that takes an argument, the instance must be defined by passing one to the constructor. This fails:

inst = A()

but this works:

inst = A("abc")
print inst.name

We can make another instance of class A:

inst2 = A("def")
print inst2.a
print inst2.name

You see that the name instance variable is different in the objects inst and inst2, but the class variable a is the same. If we change a:

A.a = 2
print inst2.a
print inst.a

You will see that the class variable a has now been set to 2 in both instances.

2.1.1 Class inheritance

Now we look at class inheriance. Here is a class that describes a person:

class Person:
    def __init__(self, name, addr, email, phone):
        self.name = name
        self.address = addr
        self.email = email
        self.phone = phone

    def set_email(self, email):
        self.email = email
        print "email sat til", email

The instance variables name, address, email, and phone is defined in the constructor, so it is used like this:

pers1 = Person("morten", "århus", "mok@example.com", 86188180)

We can access the attributes via the instance of the person:

print pers1.name
print pers1.email

In the class definition, we have defined a method to set the email address. A method of this type is called a “setter”:

pers1.set_email('nogetandet@gmail.com')

In this case, the setter does nothing else than change the value of the email variable, so we could have done just this:

pers1.email = 'nogetandet@gmail.com'

however, using getters and setters is normally considered good programming style.

If we want to have a class describing a student, we can make use of the functionality already existing in the Person class. This is called the DRY principle: Don’t Repeat Yourself. A student however, has special attributes that is not used for other persons:

class Student(Person):

    def set_studiekortnummer(self, s):
        self.studie_nr = s

We have added a method to set the studynumber, but the name, address etc. is inherited from the base class.

rune = Student("rune", "ukendt", "rune@gmail", 89421111)

We now have an instance of the Student class called rune. Let us set the student number:

print rune
rune.set_studiekortnummer(123456)
print rune.studie_nr
print rune.name

2.2 Biopython

We will now look at the Biopython package, but first, let us write a small program to fetch a sequence from Uniprot that is purely native Python.

2.2.1 fetch sequence from Uniprot

Here is the content of the program called fetch-files-uniprot.py:

#! /usr/bin/env python 
from utils import parse_list
import urllib
import sys

uniprot = "http://www.uniprot.org/uniprot/"
remote = urllib.FancyURLopener()

if __name__ == "__main__":

    fd = open (sys.argv[1])

    for ent in parse_list(fd):
        print "getting ", ent
        f = remote.open(uniprot+ent+".txt")
        seq = f.read()
        f.close()
        g = open(ent+".txt", "w")
        g.write(seq)
        g.close()
    #. #. 

Let us examine the code. Near the top, you see that we import the function parse_list from the module utils. This is our own Python module, stored in a file utils.py, and it looks like this:

def parse_list(f):
    "Parse our little sequence id list files"
    line = 'begin'
    L = []
    for line in f:
        if '#' in line:
            line,comment = line.split('#',1)
        #.         if len(line) > 0:
            L.append(line.strip())
    #. 
    return L
#. 

This function reads a file, skips comments marked with ‘#’, and returns a list of sequence IDs.

Next, we open a file with the file name sys.argv[1]. This is an attribute from the imported sys module, that contains the parameters given on the command line. So, it our program is invoked like this:

$ python fetch-files-uniprot.py ifns.txt

the value of sys.args[1] will be ‘ifns.txt’. Our program then proceeds by reading entries one by one from the file-like object remote, which is an instance of the object urllib.FancyURLopener() from the urllib module, opened with the URL of the relevant sequence entry. Every sequence entry is then stored in a local file.

2.2.2 Fetch sequence using Bio.Entrez

Now we will fetch some sequences using NCBI’s Entrez database. Biopython provides a convienient interface to Entrez’ API. Here is the program fetch-files-entrez.py:

#! /usr/bin/env python 
from Bio import Entrez
from utils import parse_list
import sys

# See http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html 
file_ext = {'gp':'gb',
            'fasta':'fa',
            'ft':'ft',
            'gpc':'gpc',
            'full':'full',
            }

if __name__ == '__main__':

    fd = open (sys.argv[1])
    Entrez.email = "mok+entrez@example.com"

    for ent in parse_list(fd):
        for rettype in file_ext:
            handle = Entrez.efetch(db="protein", id=ent,
                                   rettype=rettype,
                                   retmode="text")
            fnam = ent + '.' + file_ext[rettype]
            g = open(fnam, 'w')
            g.write(handle.read())
            g.close()
            print "wrote", fnam
        #.     #. #. 

This program is a bit more complicated than the Uniprot fetching program, but follows the same general logic. First, we set a class variable Entrez.email, this is requested by the Entrez service and we do that as a matter of courtesy. As before, we pass through all sequence IDs in the input file, but this time, we add an additional loop inside, that loops through a list of different data return types given in the dictionary file_ext, which at the same time maps a file extension to each return type. After running this program, we are left with a series of files for each sequence ID, for example, for ID P01563 we have:

  -rw-r--r--   1 mok  mok        345 May 30 10:11 P01563.fa
  -rw-r--r--   1 mok  mok       2676 May 30 10:11 P01563.ft
  -rw-r--r--   1 mok  mok      80468 May 30 10:11 P01563.full
  -rw-r--r--   1 mok  mok      17099 May 30 10:11 P01563.gb
  -rw-r--r--   1 mok  mok      56359 May 30 10:11 P01563.gpc
  -rw-r--r--   1 mok  mok      11662 May 30 10:03 P01563.txt

Now, lets go back to the IPython interface and play a bit more with Biopython. First, we will use the SeqIO module:

from Bio import SeqIO
dir (SeqIO) # look at what SeqIO contains 
seq = SeqIO.read("P01563.gb", "genbank")

Now, we have a Biopython sequence object (a “SeqRecord”) named seq containing all the information that was parsed from the input Genbank format file:

dir(seq)
print(seq)

We can look at the sequence description, and we see that the sequence is stored in a special object which also contains the alphabet.

seq.description
seq.seq

The object seq can be indexed using the square bracket notation (like normal Python lists), and it can be iterated in a loop:

print seq[0]
for a in seq:
    print a

The SeqRecord object attribute features contains a list of feature objects, all collected from the Genbank format sequence file:

print seq.features

You can investigate this yourself more, look at the various types of features and what data they contain.

2.3 The special method __repr__

We will briefly look at the special method called __repr()__. This is a method that can be defined in any class, and it defines how the class behaves when passed to the print function, and should simply return a string representation of the object. Let us define this method in our Person class from before:

class Person:
    def __init__(self, a, addr, email, tlf):
        self.name = a
        self.adresse = addr
        self.email = email
        self.telefon = tlf

    def set_email(self, email):
        self.email = email
        print "email set to", email

    def __repr__(self):
        return "Name: " + self.name

Now, if we print the person pers1 from before, Python will print:

Name: morten

instead of a strange technical object description. As the base class is now enhanced, the new capability will automatically be available to the Student class too:

rune = Student("rune", "ukendt", "rune@gmail", 89421111)
print rune

2.4 Function returning tuples

A function can return a tuple:

def f():
    return "a", "b"

print f()
i,j = f()

print i
print j

The above example shows how each element in the tuple returned by function f() can be assigned to different variables i and j.

2.5 Listdir

Python is very suited for all kinds of systems work. The os module contains lots of useful functions for interacting with the operating system. For example, the function listdir() will return a list of files in the directory passed as an argument:

import os

this_dir = os.listdir('.')
tmp_dir = os.listdir('/tmp')
for f in os.listdir('/tmp'):
    print f

Date: 2012-01-26 18:53:42 CET

HTML generated by org-mode 6.21b in emacs 23