Python for crystallographers 2
Table of Contents
2 Python for Crystallographers 2
This is the second in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University in 2011. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.
2.1 Classes
Classes are a fundamental feature in all Object Oriented programming languages. C++, Java, Objective-C, Smalltalk, etc. all implement classes. Programmers use classes in different ways. Some classes represent concrete objects, for example atoms, residues, or sequences, but in other cases, classes are used to encapsulate various algorithms, or they produce iterators or are factories of objects. As you use various Python modules, you will meet different class designs.
Let us define a very simple class:
class A: a = 1
dir(A)
A Python class is not usually used on its own. You can compare a class to a cookie-cutter, it defines the shape and size of the cookie, but to actually get something concrete out of it, you need to press it into the dough to make a cookie. With classes, you need to instance it to make it useful, and you can create as many instances as you want (unless the class has specific logic to prevent that.)
inst = A()
inst.a
Class A only contains one variable, a
. Variables of this kind are called class variables, because they are shared by all instances of the class. This is useful for e.g. constants and other data items, but in general, instances need their own data too.
Here we define a class that has an instance variable b
:
class A: a = 1 def __init__(self, name): self.name = name
The special method __init__()=
is called whenever Python constructs an instance of a class. The variable self
refers to the instance object itself, so self.name
refers to a variable in the namespace of the instance. The __init__()
method can also be called a constructor. When we define a constructor that takes an argument, the instance must be defined by passing one to the constructor. This fails:
inst = A()
but this works:
inst = A("abc") print inst.name
We can make another instance of class A
:
inst2 = A("def") print inst2.a print inst2.name
You see that the name
instance variable is different in the objects inst
and inst2
, but the class variable a
is the same. If we change a
:
A.a = 2 print inst2.a print inst.a
You will see that the class variable a
has now been set to 2 in both instances.
2.1.1 Class inheritance
Now we look at class inheriance. Here is a class that describes a person:
class Person: def __init__(self, name, addr, email, phone): self.name = name self.address = addr self.email = email self.phone = phone def set_email(self, email): self.email = email print "email sat til", email
The instance variables name
, address
, email
, and phone
is defined in the constructor, so it is used like this:
pers1 = Person("morten", "århus", "mok@example.com", 86188180)
We can access the attributes via the instance of the person:
print pers1.name print pers1.email
In the class definition, we have defined a method to set the email address. A method of this type is called a “setter”:
pers1.set_email('nogetandet@gmail.com')
In this case, the setter does nothing else than change the value of the email variable, so we could have done just this:
pers1.email = 'nogetandet@gmail.com'
however, using getters and setters is normally considered good programming style.
If we want to have a class describing a student, we can make use of the functionality already existing in the Person
class. This is called the DRY principle: Don’t Repeat Yourself. A student however, has special attributes that is not used for other persons:
class Student(Person): def set_studiekortnummer(self, s): self.studie_nr = s
We have added a method to set the studynumber, but the name, address etc. is inherited from the base class.
rune = Student("rune", "ukendt", "rune@gmail", 89421111)
We now have an instance of the Student
class called rune
. Let us set the student number:
print rune rune.set_studiekortnummer(123456) print rune.studie_nr print rune.name
2.2 Biopython
We will now look at the Biopython package, but first, let us write a small program to fetch a sequence from Uniprot that is purely native Python.
2.2.1 fetch sequence from Uniprot
Here is the content of the program called fetch-files-uniprot.py
:
#! /usr/bin/env python from utils import parse_list import urllib import sys uniprot = "http://www.uniprot.org/uniprot/" remote = urllib.FancyURLopener() if __name__ == "__main__": fd = open (sys.argv[1]) for ent in parse_list(fd): print "getting ", ent f = remote.open(uniprot+ent+".txt") seq = f.read() f.close() g = open(ent+".txt", "w") g.write(seq) g.close() #. #.
Let us examine the code. Near the top, you see that we import the function parse_list
from the module utils
. This is our own Python module, stored in a file utils.py
, and it looks like this:
def parse_list(f): "Parse our little sequence id list files" line = 'begin' L = [] for line in f: if '#' in line: line,comment = line.split('#',1) #. if len(line) > 0: L.append(line.strip()) #. return L #.
This function reads a file, skips comments marked with ‘#’, and returns a list of sequence IDs.
Next, we open a file with the file name sys.argv[1]
. This is an attribute from the imported sys
module, that contains the parameters given on the command line. So, it our program is invoked like this:
$ python fetch-files-uniprot.py ifns.txt
the value of sys.args[1]
will be ‘ifns.txt’. Our program then proceeds by reading entries one by one from the file-like object remote
, which is an instance of the object urllib.FancyURLopener()
from the urllib
module, opened with the URL of the relevant sequence entry. Every sequence entry is then stored in a local file.
2.2.2 Fetch sequence using Bio.Entrez
Now we will fetch some sequences using NCBI’s Entrez database. Biopython provides a convienient interface to Entrez’ API. Here is the program fetch-files-entrez.py
:
#! /usr/bin/env python from Bio import Entrez from utils import parse_list import sys # See http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html file_ext = {'gp':'gb', 'fasta':'fa', 'ft':'ft', 'gpc':'gpc', 'full':'full', } if __name__ == '__main__': fd = open (sys.argv[1]) Entrez.email = "mok+entrez@example.com" for ent in parse_list(fd): for rettype in file_ext: handle = Entrez.efetch(db="protein", id=ent, rettype=rettype, retmode="text") fnam = ent + '.' + file_ext[rettype] g = open(fnam, 'w') g.write(handle.read()) g.close() print "wrote", fnam #. #. #.
This program is a bit more complicated than the Uniprot fetching program, but follows the same general logic. First, we set a class variable Entrez.email
, this is requested by the Entrez service and we do that as a matter of courtesy. As before, we pass through all sequence IDs in the input file, but this time, we add an additional loop inside, that loops through a list of different data return types given in the dictionary file_ext
, which at the same time maps a file extension to each return type. After running this program, we are left with a series of files for each sequence ID, for example, for ID P01563 we have:
-rw-r--r-- 1 mok mok 345 May 30 10:11 P01563.fa -rw-r--r-- 1 mok mok 2676 May 30 10:11 P01563.ft -rw-r--r-- 1 mok mok 80468 May 30 10:11 P01563.full -rw-r--r-- 1 mok mok 17099 May 30 10:11 P01563.gb -rw-r--r-- 1 mok mok 56359 May 30 10:11 P01563.gpc -rw-r--r-- 1 mok mok 11662 May 30 10:03 P01563.txt
Now, lets go back to the IPython interface and play a bit more with Biopython. First, we will use the SeqIO
module:
from Bio import SeqIO dir (SeqIO) # look at what SeqIO contains seq = SeqIO.read("P01563.gb", "genbank")
Now, we have a Biopython sequence object (a “SeqRecord”) named seq
containing all the information that was parsed from the input Genbank format file:
dir(seq) print(seq)
We can look at the sequence description, and we see that the sequence is stored in a special object which also contains the alphabet.
seq.description seq.seq
The object seq
can be indexed using the square bracket notation (like normal Python lists), and it can be iterated in a loop:
print seq[0] for a in seq: print a
The SeqRecord object attribute features
contains a list of feature objects, all collected from the Genbank format sequence file:
print seq.features
You can investigate this yourself more, look at the various types of features and what data they contain.
2.3 The special method __repr__
We will briefly look at the special method called __repr()__
. This is a method that can be defined in any class, and it defines how the class behaves when passed to the print
function, and should simply return a string representation of the object. Let us define this method in our Person
class from before:
class Person: def __init__(self, a, addr, email, tlf): self.name = a self.adresse = addr self.email = email self.telefon = tlf def set_email(self, email): self.email = email print "email set to", email def __repr__(self): return "Name: " + self.name
Now, if we print the person pers1
from before, Python will print:
Name: morten
instead of a strange technical object description. As the base class is now enhanced, the new capability will automatically be available to the Student
class too:
rune = Student("rune", "ukendt", "rune@gmail", 89421111) print rune
2.4 Function returning tuples
A function can return a tuple:
def f(): return "a", "b" print f() i,j = f() print i print j
The above example shows how each element in the tuple returned by function f()
can be assigned to different variables i
and j
.
2.5 Listdir
Python is very suited for all kinds of systems work. The os
module contains lots of useful functions for interacting with the operating system. For example, the function listdir()
will return a list of files in the directory passed as an argument:
import os this_dir = os.listdir('.') tmp_dir = os.listdir('/tmp') for f in os.listdir('/tmp'): print f
Date: 2012-01-26 18:53:42 CET
HTML generated by org-mode 6.21b in emacs 23
leave a comment