mok0's world

Python for crystallographers 3

Posted in Biopython, Programming, Python, Tutorial by mok0 on January 29, 2012

3 Python for crystallographers 3

This is the third in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

3.1 Reading a PDB file

The first project is to read a PDB format file. The following snippet of code opens a file and swallows it. The list L will contain a list of lines:

fnam = '1au1.pdb'
f = open(fnam)
L = f.readlines()
print L[10]

But L includes a lot of header lines from the PDB file that we really don’t need. To help filter out the ATOM records, we make use of Pythons regular expression module, called re. Then we step through all lines in the list L, extract the lines that start with ‘ATOM’, and append those to another list: data.

import re

data = []
for rec in L:
    if re.match('ATOM', rec):
        data.append(rec.split())

We check the length of the list data, and print out a few of the first atoms:

len(data)
data[0]
for i in 1,2,3,4:
    print data[i]

Remember, that Python arrays start at 0, so data[0] contains the first atom, but that atom has serial ID number 1.

3.2 Class based approach

Now we will create a more advanced and flexible data structure for atoms. We create a file pdb.py with the following content:

 1: class Atom:
 2:  def __init__(self):
 3:         self.name = ''
 4:         self.id = 0
 5:         self.element_number = 6
 6:         self.x = 0.0
 7:         self.y = 0.0
 8:         self.z = 0.0
 9:         self.b = 1.0
10:         self.occ = 1.0

In line 2 we define the class constructor. In this design, the constructor does not need any parameters. So now we can define an “empty” atom. In IPython, we use the %run magic command to “source” the file pdb.py:

%run pdb
atom = Atom()
print a.x, a.y, a.z, a.occ, a.b

This will print out the default values defined in the constructor.

Now, let us add a method to the Atom class that will parse an ATOM record from a PDB file, and fill the relevant attributes with the relevant data. Here is the method, it should be addede to the class Atom in the file pdb.py:

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(L[9])
        self.b = float(L[10])

Now, in IPython, we run again:

%run pdb
atom = Atom()

We still have a record from the PDB file from before, when we loaded the PDB file into a list of strings called L. Let us try to parse one of these lines:

atom.parse_from_pdb(L[2000])
print atom.x, atom.y, atom.z

This loaded the data from line 2000 in the file into the object atom.

3.3 version 2

Now we would like to make a more complete standalone program, so we add a main program to the file pdb.py:

 1: class Atom:
 2:     def __init__(self):
 3:         self.name = ''
 4:         self.id = 0
 5:         self.element_number = 6
 6:         self.x = 0.0
 7:         self.y = 0.0
 8:         self.z = 0.0
 9:         self.b = 1.0
10:         self.occ = 1.0
11: 
12:     def parse_from_pdb (self,s):
13:         L = s.split()
14:         self.id = int(L[1])
15:         self.name = L[2]
16:         self.x = float(L[6])
17:         self.y = float(L[7])
18:         self.z = float(L[8])
19:         self.occ = float(L[9])
20:         self.b = float(L[10])
21:     #. 22: 
23: if __name__ == '__main__':
24:     import re
25: 
26:     fnam = '1au1.pdb'
27:     f = open(fnam)
28:     L = f.readlines()
29: 
30:  atoms = []
31:  for line in L:
32:  if re.match('ATOM|HETATM', line):
33:             a = Atom()
34:             a.parse_from_pdb(line)
35:             atoms.append(a)

We open the file as before, and swallow the whole thing using readlines(). This gives a list of lines. In line 30, we create an empty list that will be used to store the atom objects. In line 31 we step through all the strings in list L. Something new happens in line 32. Here we use match() of the regular expression module re to “grep” out lines that start with either ‘ATOM’ or ‘HETATM’. Only if one of these two keywords are found, we will execute the code inside that if statement. If we have an ‘ATOM’ or a ‘HETATM’ record, we instance an empty Atom object, and use the parse_from_pdb() method to populate the fields of the object. Finally, we append the newly created object to the list atoms. Now, run this from IPython, and look at some of the elements in list atoms:

%run pdb
print atoms[1]
print atoms[10]

This printout is pretty uninformative. We should add a __repr()__ method to the Atom class. It looks like this:

    def __repr__(self):
        s = "Atom: {0}, x,y,z={1},{2},{3}, occ={4}, B={5}"
        s = s.format(self.name,self.x,self.y,self.z,self.occ,self.b)
        return s

Here, the format() method of the str class is used to format the string. In this case, the tokens marked with curly brackets will be formatted with the corresponding argument. There is a whole mini-language that allows very sophisticated string formatting, it can be studied here: http://docs.python.org/library/stdtypes.html#string-formatting

%run pdb
print atoms[1]

which prints output that looks like this:

Atom: CA, x,y,z=24.887,27.143,6.222, occ=1.0, B=41.36

Now, we can enhance our main program to write out more information:

if __name__ == '__main__':
    import re

    fnam = '1au1.pdb'
    f = open(fnam)
    L = f.readlines()

    atoms = []
    for line in L:
        if re.match('ATOM|HETATM', line):
            a = Atom()
            a.parse_from_pdb(line)
            atoms.append(a)

    for a in atoms:
        print a

However, when we run this, we get a run-time error!

In [7]: %run pdb 1au1.pdb
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call
last)

/u/mok/Dropbox/python-course-2011/pdb.py in <module>()
     59         if re.match('ATOM|HETATM', line):
     60             a = Atom()
---> 61             a.parse_from_pdb(line)
     62             atoms.append(a)
     63

/u/mok/Dropbox/python-course-2011/pdb.py in parse_from_pdb(self, s)
     18         self.y = float(L[7])
     19         self.z = float(L[8])
---> 20         self.occ = float(L[9])
     21         self.b = float(L[10])
     22     #.

ValueError: invalid literal for float(): 1.00100.00
WARNING: Failure executing file: <pdb.py>

The second-last line gives us a hint to what is going on. Python can not convert the string ‘1.00100.00’ to a floating point number. This is a well known limitation of the PDB format. When a B factor becomes 100.0 or larger, there is no longer a space between the occupancy field and the B-factor field. Therefore, the logic we used in the parse_from_pdb() method is too simplistic. We can’t simply split the line into space-separated fields using the split() method. We need to be more careful. So, parse_from_pdb() needs to be changed to this:

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(s[55:60])
        self.b = float(s[60:66])

Now, instead of using the split fields to extract occ and b, we extract them directly from positions 55 to 59 and 60 to 65 in the input string. (This will work for this file, but the logic is likely to fail if tested on every PDB file we can find, because there are other issues with PDB files that may cause the number of fields to be different.)

3.4 A more complete program

Now, let us make the program more versatile, so that we can specify the PDB file from the command line. We change the main program to look like this:

 1: if __name__ == '__main__':
 2:     import re
 3:     import sys
 4:     import os
 5: 
 6:  fnam = sys.argv[1]
 7:  if not os.path.exists(fnam):
 8:         print "File", fnam, "not found, duh!"
 9:         raise SystemExit (ref:exit)
10: 
11:     f = open(fnam)
12:     L = f.readlines()
13: 
14:     atoms = []
15:     for line in L:
16:         if re.match('ATOM|HETATM', line):
17:             a = Atom()
18:             a.parse_from_pdb(line)
19:             atoms.append(a)
20: 
21:     for a in atoms:
22:         print a

In line 6 we use the sys module to access the (first) command line argument. This is assumed to be a file name, but in line 7 we double check to make sure the file exists, by using exists() from the (extremely useful) os.path module. If the file is not found, we print out an error message, and stop the program by raising an exception in line nil. Now the program is more general, and can be used to read in any PDB file.

3.5 Distance between atoms

We want to be able to compute the distance between two atoms. We add a function to pdb.py to do this. The function assumes to be passed two objects that have attributes x, y and z, it then does the math computes the distance between these points.

def distance(atom1, atom2):
    import math
    dist = math.sqrt((atom1.x-atom2.x)**2
                     +(atom1.y-atom2.y)**2
                     +(atom1.z-atom2.z)**2)
    return dist

We need to import the math module so we can look up the square root. We could also have written the function as a method of the Atom class, in which case the call would be:

d = atom1.distance(atom2)
print d

This is a matter of design. We choose to write a function, that is called like this:

d = distance(atom1, atom2)
print d

We can then compute the distance of all atoms to (a random) atom number 300 The full source code of the program is listed in the appendix. We run the program from within IPython like this:

%run pdb 1au1.pdb

3.6 Difference between = and ==

The answer is simple: = is used for assignement, and == is used for comparisons. For example:

In [9]: a=1

In [10]: a
Out[10]: 1

In [11]: a == 2
Out[11]: False

In [12]: a == 1
Out[12]: True

3.7 Plotting with matplotlib

Finally, let us take a look at the incredibly useful plotting library, matplotlib, that is a part of SciPy. First, we create a list of B values from our list of atom objects:

bvalues =[]
for a in atoms:
    bvalues.append(a.b)

Now we can make plot of B-values vs. atom number:

import pylab
pylab.plot(bvalues)

Matplotlib can do a huge amount of different plots. A great way to get started is to go to the matplotlib gallery at http://matplotlib.sourceforge.net/gallery.html and choose a plot that looks like what you need. Then you can cut and paste the source code of the plot into IPython. Use the %cpaste magic to do this.

Appendix

Complete source code of the final version of pdb.py

class Atom:
    def __init__(self):
        self.name = ''
        self.id = 0
        self.element_number = 6
        self.x = 0.0
        self.y = 0.0
        self.z = 0.0
        self.b = 1.0
        self.occ = 1.0

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(s[55:60])
        self.b = float(s[60:66])

    def __repr__(self):
        s = "Atom: {0}, x,y,z={1},{2},{3}, occ={4}, B={5}"
        s = s.format(self.name,self.x,self.y,self.z,self.occ,self.b)
        return s

def distance(atom1, atom2):
    import math
    dist = math.sqrt((atom1.x-atom2.x)**2
                     +(atom1.y-atom2.y)**2
                     +(atom1.z-atom2.z)**2)
    return dist

if __name__ == '__main__':
    import re
    import sys
    import os

    fnam = sys.argv[1]
    if not os.path.exists(fnam):
        print "file not found, duh!"
        raise SystemExit

    f = open(fnam)
    L = f.readlines()

    atoms = []
    for line in L:
        if re.match('ATOM|HETATM', line):
            a = Atom()
            a.parse_from_pdb(line)
            atoms.append(a)

    XXX = atoms[299]

    for a in atoms:
        print distance(a,XXX)

Date: 2012-01-29 16:22:46 CET

HTML generated by org-mode 6.21b in emacs 23

Advertisements