mok0's world

Python for crystallographers 5

Posted in Biopython, Programming, Python, Tutorial by mok0 on February 1, 2012

5 Python for crystallographers 5

This is the fifth and last in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

5.1 Variable function arguments

Variable arguments can be used in a function when the number of arguments that the function needs to process cannot be predicted. The archetypical example is the C printf function, which accepts a format string as the first parameter, and any number of variables following that. In Python, variable function arguments are specified in the function definition by prefixing the name of the formal parameter with an asterisk:

def func(*args):
    print type (args)
    for i in args:
        print i,

From the point of view of the function, the passed parameters are contained in a tuple:

In [6]: func(1,2,'a')
<type 'tuple'>
1 2 a

In [7]: func(1,2,3,4,5,6)
<type 'tuple'>
1 2 3 4 5 6

The function must of course know what to do with the varying number of arguments, and that is the job of the programmer.

5.2 Keyworded function arguments

This section introduces keyworded function arguments.

Keyworded arguments are somewhat similar to variable argument lists from the previous section, however, the function identifies the arguments by the parameter name, and arguments kan be skipped or they can be placed out of order. This is often a convenience to the programmer, and keyworded arguments are typically used to set various options in the function. Keyworded function arguments are specified in the function definition by prefixing the name of the formal parameter with a double asterisk:

def func1 (**kwargs):
    print type(kwargs)
    for k in kwargs:
        print k, kwargs[k],

From the point of view of the function, the passed parameters are contained in a dictionary. Let’s try to call this function using various keyworded arguments (recall that IPython’s %cpaste magic can be used to insert code snippets from the clipboard):

func1(x=1)
func1(residues=100)
func1(residues=100, atoms=1000)
func1(atoms=1000,residues=100)
In [11]: func1(x=1)
<type 'dict'>
x 1

In [12]: func1(residues=100)
<type 'dict'>
residues 100

In [13]: func1(residues=100, atoms=1000)
<type 'dict'>
residues 100 atoms 1000

In [14]: func1(atoms=1000,residues=100)
<type 'dict'>
residues 100 atoms 1000

Quite typically, keyworded arguments are used to set optional parameters, but the function might still want one or more fixed and required parameters. A template example of such a use is given below:

def func2 (a, b, **kwargs):
    pass

and an example of a function call might look like this:

func2(filenam, 0.2, atoms=1000,residues=100)

5.3 Exceptions

Python is extremely smart about errors that arise during excecution of programs. When programs written in compiled languages such as C encounter an error, for example if a number is divided by zero, the operating system simply terminates the process. The programmer then needs to track down where the error happened, fix it, and recompile the program.

When running Python interactively, for example via IPython, the process survives, but it throws an exception, meaning that it stops what is was doing when the error happened. But since the process survives, you can examine the content of the variables, for example by printing them, and this can give vital clues about what went wrong. Of course, you can also use the Python debugger which streamlines this process, enables you to set breakpoints in the program and so on. Curiously to crystallographers, the Python debugger is called PDB. However, this subject is outside the scope of this presentation. IPython has a smart interface to the Python debugger, which you are encouraged to check out.

Here, we attempt to divide by zero:

In [15]: 1/0
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)

/Users/mok/dropbox/python-course-2011/<ipython console> in <module>()

ZeroDivisionError: integer division or modulo by zero

Python does not like that, and raises an exception named ZeroDivisionError. The programmer can also raise exceptions, both predefined, or home-made exceptions. Here, we try to raise at ZeroDivisionError:

In [16]: raise ZeroDivisionError
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)

/Users/mok/dropbox/python-course-2011/<ipython console> in <module>()

ZeroDivisionError:

It is seen that an exception is thrown, but Python really doesn’t know the reason (there is no associated message.)

The very smart thing about exceptions is that you can catch them, and let the program deal with the unexpected situation. This is done in a
try/except block, for example:

a=10
b=2
try:
    f = a/b
except ZeroDivisionError:
    print "oops!"

After the try/except block, everything is normal, and the value of f is 5. Let us try again, this time setting b to zero:

b = 0
try:
    f = a/b
except:
    print "oops!"

The output written this time is “oops!”, and the value of f is unchanged (i.e. 5 if you follow the input above.)

5.3.1 The useful SystemExit exception

A quite common construction in a Python main program is to exit in the very beginning, if the user fails to provide the proper command line arguments, or, as in the following example, the specified file does not exist. When the SystemExit exception is raised, the Python process terminates:

    fnam = sys.argv[1]
    if not os.path.exists(fnam):
        print "file not found, duh!"
        raise SystemExit

5.4 Superposition

In this section, we will take a look at the Superimposer() class that comes with Biopython.

The following code is the main program from the Biopython source file Bio/PDB/Superimposer.py, that we copy-paste into its own file called superimposer.py. (The Python standard programming guidelines encourages use of lower-case letters for file names.)

 1: import sys
 2: from Bio.PDB import PDBParser, Selection, Superimposer
 3: import numpy
 4: 
 5: if __name__=="__main__":
 6: 
 7:     p=PDBParser()
 8:     s1=p.get_structure("FIXED", sys.argv[1])
 9:  fixed=Selection.unfold_entities(s1, "A")
10: 
11:     s2=p.get_structure("MOVING", sys.argv[1])
12:  moving=Selection.unfold_entities(s2, "A")
13: 
14:  rot=numpy.identity(3).astype('f')
15:  tran=numpy.array((1.0, 2.0, 3.0), 'f')
16: 
17:  for atom in moving:
18:         atom.transform(rot, tran)
19: 
20:     sup=Superimposer()
21: 
22:  sup.set_atoms(fixed, moving)
23: 
24:     print sup.rotran
25:     print sup.rms
26: 
27:  sup.apply(moving)

The above example is only mildly interesting, because it simply superimposes a molecule on a copy of itself, which has been translated by a small amount. The result of the calculation should therefore be the identity matrix, and a small translation in the opposite direction. We will none-the-less go through the program and examine what it’s doing.

The first few lines ought to be familiar by now. The PDB file from the command line is opened twice, and its atoms are placed in two Structure() instances s1 and s2.

In lines 9 and 12 we are using the convenient Selection() class, that can be used to access the various levels of the SMCRA hierachy. We use the method called unfold_entities(), which can unfold a list of entities to a list of entities of another level. For example:

  • list of atoms -> list of residues
  • list of residues -> list of atoms
  • list of residues -> list of chains

In this case, we unfold a Structure() instance to a list of atoms, specified by the “A” in parameter 2. Those lists of atoms are stored in variables moving and fixed.

In line 14 we define the identity matrix, and in line 15 we define a translation vector (1.0, 2.0, 3.0). That rotation/translation operator is applied to the moving list of atoms in line 17, and after this operation all atoms in that list have been shifted by the translation vector given above.

Now we instance the Superimposer() class in the variable sup, and feed two lists of atoms to it (line 22). The rotation/translation operator and root mean squared deviation is printed, and finally, in line 27, the translation in sup is applied to the moving list, which should result in the atoms being shifted back to their original positions.

It is left as an exercise for the reader to generalize this program so it can take two different PDB files and/or two different chains.

5.5 Neighbor search

Now lets take a look at the NeighborSearch class from Biopython. This is a useful class that can search entities (that can be Atom, Residue, etc. objects) within a certain distance.

To remind yourself how smart IPython is (assuming you are running it) let us first check the documentation for NeighborSearch.

from Bio.PDB import NeigborSearch
NeigborSearch?

Recall that IPython has extensive support of Tab-completion. When typing the above, it is actually sufficient to type:

Nei<TAB>?

and IPython will automatically expand “Nei” to “NeighborSearch”.

Now, load a Structure object from a PDB file, and as in the previous example, unfold that structure into a list of atoms.

from Bio.PDB import Selection
s = p.get_structure("test", "1zen.pdb")
atoms = Selection.unfold_entities(s, "A")

Now, choose a random atom in that structure. Below, we choose atom number 30, which happens to be the OH atom of residue 3, which is a Tyr. The get_coord() method returns a Numpy array containing that atom’s coordinates.

1: center = atoms[30].get_coord()
2: ns = NeighborSearch(atoms) 
3: print ns.search (center, 5, "A") 
4: print ns.search (center, 5, "R")

In line 2, we prime an instance ns of the NeighborSearch class with the list of atoms extracted from PDB structure 1zen. Then, in line 3, we print out the list of atoms within 5.0 Å of the center (atom OH of Tyr 3). In the following line, we print the list of residues found by the search within that radius:

Out[39]:
[<Residue TYR het=  resseq=3 icode= >,
 <Residue LYS het=  resseq=105 icode= >,
 <Residue GLU het=  resseq=53 icode= >,
 <Residue LYS het=  resseq=52 icode= >]

NeighborSearch is a very useful class solving an often encountered problem, that you can use in your programs.

5.6 Fragment search

For the next example, we will take a look at Biopython’s Fragmentmapper class, and we shall use it to write a small program that performs a calculation more or less like the Lego_ca command in O: step through the polypeptide chain one residue at the time, and find the best fitting fragment from a library of fragments.

The FragmentMapper class requires that you download and install the fragment database first. It consists of a number of text files, that should be placed in a directory somewhere. You can download the fragment library from http://csb.stanford.edu/rachel/fragments/. In the following, it is assumed that these files are placed in a directory called fragment_data in the current working directory, but it could be anywhere on your file system. Here is the program fragmentmapper.py:

 1: from Bio.PDB import PDBParser, FragmentMapper, Selection
 2: import sys
 3: 
 4: if __name__=="__main__":
 5: 
 6:     p=PDBParser()
 7:     s=p.get_structure("X", sys.argv[1])
 8: 
 9:     m=s[0]
10:  fm=FragmentMapper(m, 10, 5, "fragment_data")
11: 
12:  for r in Selection.unfold_entities(m, "R"):
13:         print r,
14:         if fm.has_key(r):
15:             print fm[r]
16:         else:
17:             print

Most of the elements of this program should be familiar. The program accepts the file name of a PDB entry on the command line. In line 10, the FragmentMapper class is instanced. It receives a Model object, the size of the library (10 in this case), and the length of the fragments (we use 5 here, same as the Lego_* system from O). The last parameter is the directory where the fragment library files are stored.

In the loop starting at line 12 the model m is unfolded residue by residue, and each is looked up in the fragment mapper object, which functions like a Python dictionary. If there is a match, the fragment object is printed. The output of the program looks something like this:

<Residue SER het=  resseq=1 icode= >
<Residue LYS het=  resseq=2 icode= >
<Residue ILE het=  resseq=3 icode= > <Fragment length=5 id=9>
<Residue PHE het=  resseq=4 icode= > <Fragment length=5 id=9>
<Residue ASP het=  resseq=5 icode= > <Fragment length=5 id=7>
<Residue PHE het=  resseq=6 icode= > <Fragment length=5 id=3>
...

The first two residues don’t have any matches, the first five-residue fragment match is at residue 3, because the position of the match is the central residue of the five. In other words, fragment id 9 is the best match of residues 1-5, and also the best match of residues 2-6, whereas the best match of residues 3-7 is fragment id 7.

5.7 Shell utilities

Finally, we will introduce a very useful module that is distributed with Python. It is the shutil module, containing various utility functions e.g. for copying files and directory trees.

import shutil
dir(shutil)

Some of the functions available in shutil are:

copy(src, dst)
copy data and mode bits fra src to dst. Same as
“cp src dst” from the command line.
copy2(src, dst)
copy data and all stat info (same as “cp -p src dst”).
move(src, dst)
Recursively move a file or directory to another location.

The shutil module is thus convenient for performing shell-like operations from within Python.


Date: 2012-02-01 12:08:44 CET

HTML generated by org-mode 6.21b in emacs 23

Advertisements

Python for crystallographers 3

Posted in Biopython, Programming, Python, Tutorial by mok0 on January 29, 2012

3 Python for crystallographers 3

This is the third in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

3.1 Reading a PDB file

The first project is to read a PDB format file. The following snippet of code opens a file and swallows it. The list L will contain a list of lines:

fnam = '1au1.pdb'
f = open(fnam)
L = f.readlines()
print L[10]

But L includes a lot of header lines from the PDB file that we really don’t need. To help filter out the ATOM records, we make use of Pythons regular expression module, called re. Then we step through all lines in the list L, extract the lines that start with ‘ATOM’, and append those to another list: data.

import re

data = []
for rec in L:
    if re.match('ATOM', rec):
        data.append(rec.split())

We check the length of the list data, and print out a few of the first atoms:

len(data)
data[0]
for i in 1,2,3,4:
    print data[i]

Remember, that Python arrays start at 0, so data[0] contains the first atom, but that atom has serial ID number 1.

3.2 Class based approach

Now we will create a more advanced and flexible data structure for atoms. We create a file pdb.py with the following content:

 1: class Atom:
 2:  def __init__(self):
 3:         self.name = ''
 4:         self.id = 0
 5:         self.element_number = 6
 6:         self.x = 0.0
 7:         self.y = 0.0
 8:         self.z = 0.0
 9:         self.b = 1.0
10:         self.occ = 1.0

In line 2 we define the class constructor. In this design, the constructor does not need any parameters. So now we can define an “empty” atom. In IPython, we use the %run magic command to “source” the file pdb.py:

%run pdb
atom = Atom()
print a.x, a.y, a.z, a.occ, a.b

This will print out the default values defined in the constructor.

Now, let us add a method to the Atom class that will parse an ATOM record from a PDB file, and fill the relevant attributes with the relevant data. Here is the method, it should be addede to the class Atom in the file pdb.py:

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(L[9])
        self.b = float(L[10])

Now, in IPython, we run again:

%run pdb
atom = Atom()

We still have a record from the PDB file from before, when we loaded the PDB file into a list of strings called L. Let us try to parse one of these lines:

atom.parse_from_pdb(L[2000])
print atom.x, atom.y, atom.z

This loaded the data from line 2000 in the file into the object atom.

3.3 version 2

Now we would like to make a more complete standalone program, so we add a main program to the file pdb.py:

 1: class Atom:
 2:     def __init__(self):
 3:         self.name = ''
 4:         self.id = 0
 5:         self.element_number = 6
 6:         self.x = 0.0
 7:         self.y = 0.0
 8:         self.z = 0.0
 9:         self.b = 1.0
10:         self.occ = 1.0
11: 
12:     def parse_from_pdb (self,s):
13:         L = s.split()
14:         self.id = int(L[1])
15:         self.name = L[2]
16:         self.x = float(L[6])
17:         self.y = float(L[7])
18:         self.z = float(L[8])
19:         self.occ = float(L[9])
20:         self.b = float(L[10])
21:     #. 22: 
23: if __name__ == '__main__':
24:     import re
25: 
26:     fnam = '1au1.pdb'
27:     f = open(fnam)
28:     L = f.readlines()
29: 
30:  atoms = []
31:  for line in L:
32:  if re.match('ATOM|HETATM', line):
33:             a = Atom()
34:             a.parse_from_pdb(line)
35:             atoms.append(a)

We open the file as before, and swallow the whole thing using readlines(). This gives a list of lines. In line 30, we create an empty list that will be used to store the atom objects. In line 31 we step through all the strings in list L. Something new happens in line 32. Here we use match() of the regular expression module re to “grep” out lines that start with either ‘ATOM’ or ‘HETATM’. Only if one of these two keywords are found, we will execute the code inside that if statement. If we have an ‘ATOM’ or a ‘HETATM’ record, we instance an empty Atom object, and use the parse_from_pdb() method to populate the fields of the object. Finally, we append the newly created object to the list atoms. Now, run this from IPython, and look at some of the elements in list atoms:

%run pdb
print atoms[1]
print atoms[10]

This printout is pretty uninformative. We should add a __repr()__ method to the Atom class. It looks like this:

    def __repr__(self):
        s = "Atom: {0}, x,y,z={1},{2},{3}, occ={4}, B={5}"
        s = s.format(self.name,self.x,self.y,self.z,self.occ,self.b)
        return s

Here, the format() method of the str class is used to format the string. In this case, the tokens marked with curly brackets will be formatted with the corresponding argument. There is a whole mini-language that allows very sophisticated string formatting, it can be studied here: http://docs.python.org/library/stdtypes.html#string-formatting

%run pdb
print atoms[1]

which prints output that looks like this:

Atom: CA, x,y,z=24.887,27.143,6.222, occ=1.0, B=41.36

Now, we can enhance our main program to write out more information:

if __name__ == '__main__':
    import re

    fnam = '1au1.pdb'
    f = open(fnam)
    L = f.readlines()

    atoms = []
    for line in L:
        if re.match('ATOM|HETATM', line):
            a = Atom()
            a.parse_from_pdb(line)
            atoms.append(a)

    for a in atoms:
        print a

However, when we run this, we get a run-time error!

In [7]: %run pdb 1au1.pdb
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call
last)

/u/mok/Dropbox/python-course-2011/pdb.py in <module>()
     59         if re.match('ATOM|HETATM', line):
     60             a = Atom()
---> 61             a.parse_from_pdb(line)
     62             atoms.append(a)
     63

/u/mok/Dropbox/python-course-2011/pdb.py in parse_from_pdb(self, s)
     18         self.y = float(L[7])
     19         self.z = float(L[8])
---> 20         self.occ = float(L[9])
     21         self.b = float(L[10])
     22     #.

ValueError: invalid literal for float(): 1.00100.00
WARNING: Failure executing file: <pdb.py>

The second-last line gives us a hint to what is going on. Python can not convert the string ‘1.00100.00’ to a floating point number. This is a well known limitation of the PDB format. When a B factor becomes 100.0 or larger, there is no longer a space between the occupancy field and the B-factor field. Therefore, the logic we used in the parse_from_pdb() method is too simplistic. We can’t simply split the line into space-separated fields using the split() method. We need to be more careful. So, parse_from_pdb() needs to be changed to this:

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(s[55:60])
        self.b = float(s[60:66])

Now, instead of using the split fields to extract occ and b, we extract them directly from positions 55 to 59 and 60 to 65 in the input string. (This will work for this file, but the logic is likely to fail if tested on every PDB file we can find, because there are other issues with PDB files that may cause the number of fields to be different.)

3.4 A more complete program

Now, let us make the program more versatile, so that we can specify the PDB file from the command line. We change the main program to look like this:

 1: if __name__ == '__main__':
 2:     import re
 3:     import sys
 4:     import os
 5: 
 6:  fnam = sys.argv[1]
 7:  if not os.path.exists(fnam):
 8:         print "File", fnam, "not found, duh!"
 9:         raise SystemExit (ref:exit)
10: 
11:     f = open(fnam)
12:     L = f.readlines()
13: 
14:     atoms = []
15:     for line in L:
16:         if re.match('ATOM|HETATM', line):
17:             a = Atom()
18:             a.parse_from_pdb(line)
19:             atoms.append(a)
20: 
21:     for a in atoms:
22:         print a

In line 6 we use the sys module to access the (first) command line argument. This is assumed to be a file name, but in line 7 we double check to make sure the file exists, by using exists() from the (extremely useful) os.path module. If the file is not found, we print out an error message, and stop the program by raising an exception in line nil. Now the program is more general, and can be used to read in any PDB file.

3.5 Distance between atoms

We want to be able to compute the distance between two atoms. We add a function to pdb.py to do this. The function assumes to be passed two objects that have attributes x, y and z, it then does the math computes the distance between these points.

def distance(atom1, atom2):
    import math
    dist = math.sqrt((atom1.x-atom2.x)**2
                     +(atom1.y-atom2.y)**2
                     +(atom1.z-atom2.z)**2)
    return dist

We need to import the math module so we can look up the square root. We could also have written the function as a method of the Atom class, in which case the call would be:

d = atom1.distance(atom2)
print d

This is a matter of design. We choose to write a function, that is called like this:

d = distance(atom1, atom2)
print d

We can then compute the distance of all atoms to (a random) atom number 300 The full source code of the program is listed in the appendix. We run the program from within IPython like this:

%run pdb 1au1.pdb

3.6 Difference between = and ==

The answer is simple: = is used for assignement, and == is used for comparisons. For example:

In [9]: a=1

In [10]: a
Out[10]: 1

In [11]: a == 2
Out[11]: False

In [12]: a == 1
Out[12]: True

3.7 Plotting with matplotlib

Finally, let us take a look at the incredibly useful plotting library, matplotlib, that is a part of SciPy. First, we create a list of B values from our list of atom objects:

bvalues =[]
for a in atoms:
    bvalues.append(a.b)

Now we can make plot of B-values vs. atom number:

import pylab
pylab.plot(bvalues)

Matplotlib can do a huge amount of different plots. A great way to get started is to go to the matplotlib gallery at http://matplotlib.sourceforge.net/gallery.html and choose a plot that looks like what you need. Then you can cut and paste the source code of the plot into IPython. Use the %cpaste magic to do this.

Appendix

Complete source code of the final version of pdb.py

class Atom:
    def __init__(self):
        self.name = ''
        self.id = 0
        self.element_number = 6
        self.x = 0.0
        self.y = 0.0
        self.z = 0.0
        self.b = 1.0
        self.occ = 1.0

    def parse_from_pdb (self,s):
        L = s.split()
        self.id = int(L[1])
        self.name = L[2]
        self.x = float(L[6])
        self.y = float(L[7])
        self.z = float(L[8])
        self.occ = float(s[55:60])
        self.b = float(s[60:66])

    def __repr__(self):
        s = "Atom: {0}, x,y,z={1},{2},{3}, occ={4}, B={5}"
        s = s.format(self.name,self.x,self.y,self.z,self.occ,self.b)
        return s

def distance(atom1, atom2):
    import math
    dist = math.sqrt((atom1.x-atom2.x)**2
                     +(atom1.y-atom2.y)**2
                     +(atom1.z-atom2.z)**2)
    return dist

if __name__ == '__main__':
    import re
    import sys
    import os

    fnam = sys.argv[1]
    if not os.path.exists(fnam):
        print "file not found, duh!"
        raise SystemExit

    f = open(fnam)
    L = f.readlines()

    atoms = []
    for line in L:
        if re.match('ATOM|HETATM', line):
            a = Atom()
            a.parse_from_pdb(line)
            atoms.append(a)

    XXX = atoms[299]

    for a in atoms:
        print distance(a,XXX)

Date: 2012-01-29 16:22:46 CET

HTML generated by org-mode 6.21b in emacs 23

Python for crystallographers 1

Posted in Programming, Python, Tutorial by mok0 on January 26, 2012

1 Python for Crystallographers 1

This is the first in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University in 2011. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

1.1 IPython

Python has a quite decent interactive mode that offers readline editing of your input. However, IPython is an extension of Python that offers some very convenient features. IPython has a number of so-called “magic” commands that can be listed using

%lsmagic

Magic commands start with the percent sign, but if the automagic mode is activated (IPython uses that mode by default) you can omit it. I will activate logging of my session using the magic:

logstart

There are several very good screencasts for IPython at http://showmedo.com, search for tutorials by Jeff Rush and unpingco. You will see that you can do some very cool stuff with IPython!

1.2 Simple Python variables

Python has simple variables, integers, floats, complex and strings. Here is an integer:

a = 1

and a float:

f = 1.

We can see what variables we have assigned with the magic commands %who and %whos. The latter offers more detail.

who

Strings are defined by strings surrounded by quotes or double quotes, so “python” and ‘python’ is the same thing. If you need to put quotes or double quotes inside a string variable, you can do that like this:

s = '"python"'
s = "python's way"

Complex numbers are defined like this:

b = complex(1,5)

A complex number has real and imaginary parts. These are stored in the object attributes real and imag:

b.imag
b.real

A complex number can be used in arithmetic expressions as expected:

a = 3.4
b*a
b*b

1.3 Compound objects

1.3.1 Lists

Pythons true strength comes from the powerful built-in compound objects. First, we have the list, which is defined using square brackets.

L = []

This defines an empty list. When running interactively, you can inspect the value of objects by typing their name,

L

but when running a Python program, you need to print it:

print(L)

Here is a list containing integers:

L = [2,4,6,8]

We can append numbers to the list:

L.append(100)

This also demonstrates that the list object has a number of associated methods. Here we used the “append” method. It is like a function that implicitly operates on that element. You can inspect any Python object using the dir() built-in:

dir (L)

This concept is called introspection. From the dir() output, we find a method called insert(). To see what that does, we use an IPython help feature: typing a question mark after an object name prints a small help text:

L.insert?

Here we insert an element in L after element 3:

L.insert(3, "morten")

This illustrates that a list can contain any type of object, in any mixture. Lists can also contain lists:

M=['hej dude']
M.append(L)
M.append('hej')

We can access the individual elements of a list using the square bracket notation:

L[1]
L[0]
L[4]

We can step through (“iterate”) a list in a for loop:

for x in L:
   print x

1.3.2 Dictionaries

Another very useful built-in compound object is the dictionary. A dictionary is defined using curly brackets:

D = {1:'ditlev', 2:'rune'}

Here, the numbers 1 and 2 are the keys of the dictionary. The strings ‘ditlev’ and ‘rune’ are the values. A key can be any immutable object, like strings:

D['ditlev'] = 'brodersen'

Values of a dictionary can be any object, for example also our list L from before:

D = {1:'ditlev', 2:'rune', 'ditlev':'brodersen', 'mylist':L}

We can iterate through a dictionary in a for loop. Here, we use two dicts ‘isoelectric’ and name that I defined when preparing for this session. (I saved this object earlier using the magic command %store, then it is loaded automatically when IPython starts up.) [Readers of the on-line tutorial can find this and other data in the Appendix.]

for a in isoelectric:
   print a, name[a], isoelectric[a]

1.3.3 Tuples

Tuples are exactly like lists, except they are immutable (elements can not be modified once defined.) A tuple is defined using parentheses:

T = (42,43,44,2)

Tuples are used extensively internally in Python, for example when passing arguments to functions.

1.3.4 Sets

The set is a very useful Python object. You define it like this:

S = set()

This is an empty set. The set constructor can also take a sequence (e.g. lists, tuples)

S = set([1,2,3,4,5,6])

Sets are unique, an element can exist only once. The set constructor silently discards duplicate elements:

S = set([100,2,2,2,2,23,4,5,6])

The Set() class has an add() method:

S.add("morten")

This illustrates that a set can contain different types of objects. The add() method is similar to the append() method of lists, but “append” doesn’t make sense for a set, since it is unordered.

You can iterate through a set in a for loop:

for i in S:
   print i

Sets are very powerful. You can use the well-known set operations like union, difference and intersection. Previously, I have defined the set aa containing one-letter codes of the 20 amino acids:

aa = set("ACDEFGHIKLMNPQRSTVWY")

(NOTE: The set aa defined above does not contain the string “ACD…”! To achieve that I would need to use the constructor set([“ACD…”]). That is because the set constructor takes 1 argument which is an iterable object, and the string “ACDE…” iterates to “A”, “C”, …)

We now define sets of the acidic and basic amino acids:

basic = set (('K', 'R'))
acidic = set(('D', 'E'))

We can find the set of “neutral” amino acids:

neutral = aa - acidic - basic

The set phobic, which I have defined previously, has the set of hydrophobic amino acids. The hydrophilic set is thus:

philic = aa - phobic

The set of amino acids, that is hydrophilic but not charged is:

philic - (acidic|basic)

The set of amino acids without a side chain charge is:

aa - acidic - basic

1.4 Copying objects

Python is very economical with computer memory, and tries to reference the same memory locations if at all possible. This gives a few unexpected effects that you need to know about. It is normally never any problem.

Simple objects (ints, strings, etc.) can be copied like you expect:

a = 10
b = a

We have assigned a to b, and they are separate, independent objects. If we subsequently alter b, a will stay the same. This is not the case for compound objects, and more complicated objects. For example, create a list:

A = range(10)

(this gives a list of integers 0-9), and make an assignment to B:

B = A

Next, let’s change element 1 of B:

B[1] = 100

You will notice that element 1 of A changes too! So, Python is using the same memory for lists A and B, and B is just another reference to that memory.

We can get around that by using the copy module:

import copy

If we now copy A to B, it will work as expected:

B = copy.copy(A)
B[1] = 200

and A stays the same. However, copy.copy() is a shallow copy. If A contains anything but simple objects (for example another list), we need to use a deep copy. For example, this fails:

C = [A,B]
D = copy.copy(C)
D[1][1] = 1000
print  C

but using the deep copy will work:

D = copy.deepcopy(C)
D[1][1] = 2000
print C
print D

1.5 Modules

Since we already met the copy module above, let us briefly introduce the concept of modules in Python. In short, a module is a collection of variables, functions and classes that you can load. An often used module is os. It contains a bunch of objects needed if you want to interact with the operating system. Modules need to be imported:

import os

Now, the power of the os module is available to us. You can inspect its content using dir():

dir(os)

Let’s take a look at the uname() object from the os module. This basically does the same information that is given when you type uname from the shell:

x = os.uname()
type(x)

You see, that x is a tuple. We can access its elements:

hostname = x[1]
print hostname

This was an example of a core module (distributed with Python). There is a very large number of third-party Python packages. One example, that we will study more later, is numpy, Numerical Python.

import numpy
dir(numpy)

You can see that Numpy contains a very large number of objects, which all deal with numerical data (“numbers”) in a very efficient way. The most basic object is the array:

a = numpy.array((1,2,3,4,5,6,7,8,9))

This defines a linear array of 9 elements. We can reshape that to a (3,3) array:

a.reshape(3,3)

That array can be recast to a matrix object, and participate in very efficient matrix and vector operations. We will look more into this later.

1.5.1 Namespace

In the above, you have seen that whenever you need to reference an object from a module, you need to prefix it with the name of the module. That is because each module has its own namespace. If you need to reference a specific object a lot, you can import it into the current namespace, like so:

from os import uname

1.6 Functions

Python provides functions. Let us define a function to add two objects:

def mysum(a,b):
   return a+b

The parenthesis contains the two formal parameters of the function. These exist in the local namespace of the function. Variables a and b in the calling program (if they exist) will not be affected. Let’s test the function:

mysum(10,20)
mysum(0.4,20)

This works as expected. However, you can also do this:

mysum('asdfa','aasdasdf')

which demonstrates, that any object providing the method __add__() can be “added” using the ‘+’ operator! However, this:

mysum(0.4,'a')

will raise an exception because floats and strings can’t be added.

1.7 Files

It is very easy to read files from Python. This opens a file and associates a file object f to it:

f = open("1zen.pdb")

Now f is an ordinary Python object that we can do stuff with:

dir(f)

The readlines() method simply swallows the whole file, and returns a list of lines:

L = f.readlines()

Now the lines in the file can be accessed. L[0] contains the first line, and (since this is a PDB file) line 500 is one of the ATOM records:

print L[500]

You can see that L[500] is a string:

type(L[500])

If we want to get to the individual items in that string (the X coordinate, the B factor etc.) we need to split that string into individual components. To do this, we can use the string’s split() method:

m = L[500]
mm = m.split()

Now, mm is a list of strings containing the individual items. If we wanted to treat the X-coordinate as a number (say, add 0.01 Angstrom to it) we need to convert it to a number:

x = float(mm[5])
x = x + 0.01

When done with the file, you should close it like a good boy:

f.close()

Files can be iterated like we saw for lists:

L = []
f = open("1zen.pdb")
for line in f:
   L.append(line.split())

Here, we did several of the steps above in one operation.

1.8 File-like objects

We saw that Python’s file object is very convenient, so Python operates with so-called file-like objects. These are objects that can be anything, but appear to the programmer as though they are simple files. A simple example shows that you can access a web page as though it was a file:

import urllib
f = urllib.urlopen('http://www.bioxray.au.dk/index_uk.php')

Now we have our “file-like” object f, and we can suck in the web page.

W = f.readlines()
print len(W)

We get a list of 413 lines containing the HTML text. However, the “file-like” object f can do more than an ordinary file. For example, you can get the HTTP header information:

print f.info()

but that is another story 🙂

1.9 Take-home message

The take-home message of this presentation is:

In Python, EVERYTHING IS AN OBJECT

Appendix

# Amino acid properties 
hydropathy = {'A': 1.8, 'C': 2.5, 'D': -3.5, 'E': -3.5, 'F': 2.8, 'G':
-0.4,
              'H': -3.2, 'I': 4.5, 'K': -3.9, 'L': 3.8, 'M': 1.9, 'N':
-3.5,
              'P': -1.6, 'Q': -3.5, 'R': -4.5, 'S': -0.8, 'T': -0.7,
'V': 4.2,
              'W': -0.9, 'Y': -1.3}

tlc = {'A': "ALA", 'C': "CYS", 'D': "ASP", 'E': "GLU", 'F': "PHE",
       'G': "GLY", 'H': "HIS", 'I': "ILE", 'K': "LYS", 'L': "LEU",
       'M': "MET", 'N': "ASN", 'P': "PRO", 'Q': "GLN", 'R': "ARG",
       'S': "SER", 'T': "THR", 'V': "VAL", 'W': "TRP", 'Y': "TYR"}

charge={'A': 0, 'C': 0, 'D': -1, 'E': -1, 'F': 0, 'G': 0, 'H': 0.1,
        'I': 0, 'K': 1, 'L': 0, 'M': 0, 'N': 0, 'P': 0, 'Q': 0,
        'R': 1, 'S': 0, 'T': 0, 'V': 0, 'W': 0, 'Y': 0}

isoelectric={'A': 6.0, 'C': 5.02, 'D': 2.77, 'E': 3.22, 'F': 5.48,
             'G': 5.97, 'H': 7.47, 'I': 5.94, 'K': 9.59, 'L': 5.98,
             'M': 5.74, 'N': 5.41, 'P': 6.3, 'Q': 5.65, 'R': 11.15,
             'S': 5.68, 'T': 5.64, 'V': 5.96, 'W': 5.89, 'Y': 5.66}

mw = {'A': 89.09, 'C': 121.15, 'D': 133.1, 'E': 147.13, 'F': 165.19,
      'G': 75.07, 'H': 155.16, 'I': 131.17, 'K': 146.19, 'L': 131.17,
      'M': 149.21, 'N': 132.12, 'P': 115.13, 'Q': 146.15, 'R': 174.2,
      'S': 105.09, 'T': 119.12, 'V': 117.15, 'W': 204.23, 'Y': 181.19}

name = {'A': 'alanine', 'C': 'cysteine', 'D': 'aspartic acid',
        'E':'glutamic acid', 'F': 'phenylalanine', 'G': 'glycine',
        'H': 'histidine', 'I': 'isoleucine', 'K': 'lysine',
        'L': 'leucine', 'M': 'methionine', 'N': 'asparagine',
        'P': 'proline', 'Q': 'glutamine', 'R': 'arginine', 'S':
'serine',
        'T': 'threonine', 'V': 'valine', 'W': 'tryptophan',
        'Y': 'tyrosine'}

polarity={'A': 'n', 'C': 'n', 'D': 'p', 'E': 'p', 'F': 'n', 'G': 'n',
          'H': 'p', 'I': 'n', 'K': 'p', 'L': 'n', 'M': 'n', 'N': 'p',
          'P': 'n', 'Q': 'p', 'R': 'n', 'S': 'p', 'T': 'p', 'V': 'n',
          'W': 'n', 'Y': 'p'}

aa = set("ACDEFGHIKLMNPQRSTVWY")
phobic = set([ x for x in hydropathy if hydropathy[x] > 0.1])
neutral = set("ANCEGILMFQPSTWYV")
positive = set("RK")
negative = set("DE")
zwitter=set("H")

Date: 2012-01-26 11:22:16 CET

HTML generated by org-mode 6.21b in emacs 23

Tagged with: , ,