mok0's world

Python for crystallographers 1

Posted in Programming, Python, Tutorial by mok0 on January 26, 2012

1 Python for Crystallographers 1

This is the first in a series of tutorials on Python I gave at the Department of Molecular Biology, Aarhus University in 2011. It is aimed at crystallographers and structural biologists. This text is a commented transcript of the session log file produced by IPython The transcript does not include the output from Python. You will have to try it yourself.

1.1 IPython

Python has a quite decent interactive mode that offers readline editing of your input. However, IPython is an extension of Python that offers some very convenient features. IPython has a number of so-called “magic” commands that can be listed using

%lsmagic

Magic commands start with the percent sign, but if the automagic mode is activated (IPython uses that mode by default) you can omit it. I will activate logging of my session using the magic:

logstart

There are several very good screencasts for IPython at http://showmedo.com, search for tutorials by Jeff Rush and unpingco. You will see that you can do some very cool stuff with IPython!

1.2 Simple Python variables

Python has simple variables, integers, floats, complex and strings. Here is an integer:

a = 1

and a float:

f = 1.

We can see what variables we have assigned with the magic commands %who and %whos. The latter offers more detail.

who

Strings are defined by strings surrounded by quotes or double quotes, so “python” and ‘python’ is the same thing. If you need to put quotes or double quotes inside a string variable, you can do that like this:

s = '"python"'
s = "python's way"

Complex numbers are defined like this:

b = complex(1,5)

A complex number has real and imaginary parts. These are stored in the object attributes real and imag:

b.imag
b.real

A complex number can be used in arithmetic expressions as expected:

a = 3.4
b*a
b*b

1.3 Compound objects

1.3.1 Lists

Pythons true strength comes from the powerful built-in compound objects. First, we have the list, which is defined using square brackets.

L = []

This defines an empty list. When running interactively, you can inspect the value of objects by typing their name,

L

but when running a Python program, you need to print it:

print(L)

Here is a list containing integers:

L = [2,4,6,8]

We can append numbers to the list:

L.append(100)

This also demonstrates that the list object has a number of associated methods. Here we used the “append” method. It is like a function that implicitly operates on that element. You can inspect any Python object using the dir() built-in:

dir (L)

This concept is called introspection. From the dir() output, we find a method called insert(). To see what that does, we use an IPython help feature: typing a question mark after an object name prints a small help text:

L.insert?

Here we insert an element in L after element 3:

L.insert(3, "morten")

This illustrates that a list can contain any type of object, in any mixture. Lists can also contain lists:

M=['hej dude']
M.append(L)
M.append('hej')

We can access the individual elements of a list using the square bracket notation:

L[1]
L[0]
L[4]

We can step through (“iterate”) a list in a for loop:

for x in L:
   print x

1.3.2 Dictionaries

Another very useful built-in compound object is the dictionary. A dictionary is defined using curly brackets:

D = {1:'ditlev', 2:'rune'}

Here, the numbers 1 and 2 are the keys of the dictionary. The strings ‘ditlev’ and ‘rune’ are the values. A key can be any immutable object, like strings:

D['ditlev'] = 'brodersen'

Values of a dictionary can be any object, for example also our list L from before:

D = {1:'ditlev', 2:'rune', 'ditlev':'brodersen', 'mylist':L}

We can iterate through a dictionary in a for loop. Here, we use two dicts ‘isoelectric’ and name that I defined when preparing for this session. (I saved this object earlier using the magic command %store, then it is loaded automatically when IPython starts up.) [Readers of the on-line tutorial can find this and other data in the Appendix.]

for a in isoelectric:
   print a, name[a], isoelectric[a]

1.3.3 Tuples

Tuples are exactly like lists, except they are immutable (elements can not be modified once defined.) A tuple is defined using parentheses:

T = (42,43,44,2)

Tuples are used extensively internally in Python, for example when passing arguments to functions.

1.3.4 Sets

The set is a very useful Python object. You define it like this:

S = set()

This is an empty set. The set constructor can also take a sequence (e.g. lists, tuples)

S = set([1,2,3,4,5,6])

Sets are unique, an element can exist only once. The set constructor silently discards duplicate elements:

S = set([100,2,2,2,2,23,4,5,6])

The Set() class has an add() method:

S.add("morten")

This illustrates that a set can contain different types of objects. The add() method is similar to the append() method of lists, but “append” doesn’t make sense for a set, since it is unordered.

You can iterate through a set in a for loop:

for i in S:
   print i

Sets are very powerful. You can use the well-known set operations like union, difference and intersection. Previously, I have defined the set aa containing one-letter codes of the 20 amino acids:

aa = set("ACDEFGHIKLMNPQRSTVWY")

(NOTE: The set aa defined above does not contain the string “ACD…”! To achieve that I would need to use the constructor set([“ACD…”]). That is because the set constructor takes 1 argument which is an iterable object, and the string “ACDE…” iterates to “A”, “C”, …)

We now define sets of the acidic and basic amino acids:

basic = set (('K', 'R'))
acidic = set(('D', 'E'))

We can find the set of “neutral” amino acids:

neutral = aa - acidic - basic

The set phobic, which I have defined previously, has the set of hydrophobic amino acids. The hydrophilic set is thus:

philic = aa - phobic

The set of amino acids, that is hydrophilic but not charged is:

philic - (acidic|basic)

The set of amino acids without a side chain charge is:

aa - acidic - basic

1.4 Copying objects

Python is very economical with computer memory, and tries to reference the same memory locations if at all possible. This gives a few unexpected effects that you need to know about. It is normally never any problem.

Simple objects (ints, strings, etc.) can be copied like you expect:

a = 10
b = a

We have assigned a to b, and they are separate, independent objects. If we subsequently alter b, a will stay the same. This is not the case for compound objects, and more complicated objects. For example, create a list:

A = range(10)

(this gives a list of integers 0-9), and make an assignment to B:

B = A

Next, let’s change element 1 of B:

B[1] = 100

You will notice that element 1 of A changes too! So, Python is using the same memory for lists A and B, and B is just another reference to that memory.

We can get around that by using the copy module:

import copy

If we now copy A to B, it will work as expected:

B = copy.copy(A)
B[1] = 200

and A stays the same. However, copy.copy() is a shallow copy. If A contains anything but simple objects (for example another list), we need to use a deep copy. For example, this fails:

C = [A,B]
D = copy.copy(C)
D[1][1] = 1000
print  C

but using the deep copy will work:

D = copy.deepcopy(C)
D[1][1] = 2000
print C
print D

1.5 Modules

Since we already met the copy module above, let us briefly introduce the concept of modules in Python. In short, a module is a collection of variables, functions and classes that you can load. An often used module is os. It contains a bunch of objects needed if you want to interact with the operating system. Modules need to be imported:

import os

Now, the power of the os module is available to us. You can inspect its content using dir():

dir(os)

Let’s take a look at the uname() object from the os module. This basically does the same information that is given when you type uname from the shell:

x = os.uname()
type(x)

You see, that x is a tuple. We can access its elements:

hostname = x[1]
print hostname

This was an example of a core module (distributed with Python). There is a very large number of third-party Python packages. One example, that we will study more later, is numpy, Numerical Python.

import numpy
dir(numpy)

You can see that Numpy contains a very large number of objects, which all deal with numerical data (“numbers”) in a very efficient way. The most basic object is the array:

a = numpy.array((1,2,3,4,5,6,7,8,9))

This defines a linear array of 9 elements. We can reshape that to a (3,3) array:

a.reshape(3,3)

That array can be recast to a matrix object, and participate in very efficient matrix and vector operations. We will look more into this later.

1.5.1 Namespace

In the above, you have seen that whenever you need to reference an object from a module, you need to prefix it with the name of the module. That is because each module has its own namespace. If you need to reference a specific object a lot, you can import it into the current namespace, like so:

from os import uname

1.6 Functions

Python provides functions. Let us define a function to add two objects:

def mysum(a,b):
   return a+b

The parenthesis contains the two formal parameters of the function. These exist in the local namespace of the function. Variables a and b in the calling program (if they exist) will not be affected. Let’s test the function:

mysum(10,20)
mysum(0.4,20)

This works as expected. However, you can also do this:

mysum('asdfa','aasdasdf')

which demonstrates, that any object providing the method __add__() can be “added” using the ‘+’ operator! However, this:

mysum(0.4,'a')

will raise an exception because floats and strings can’t be added.

1.7 Files

It is very easy to read files from Python. This opens a file and associates a file object f to it:

f = open("1zen.pdb")

Now f is an ordinary Python object that we can do stuff with:

dir(f)

The readlines() method simply swallows the whole file, and returns a list of lines:

L = f.readlines()

Now the lines in the file can be accessed. L[0] contains the first line, and (since this is a PDB file) line 500 is one of the ATOM records:

print L[500]

You can see that L[500] is a string:

type(L[500])

If we want to get to the individual items in that string (the X coordinate, the B factor etc.) we need to split that string into individual components. To do this, we can use the string’s split() method:

m = L[500]
mm = m.split()

Now, mm is a list of strings containing the individual items. If we wanted to treat the X-coordinate as a number (say, add 0.01 Angstrom to it) we need to convert it to a number:

x = float(mm[5])
x = x + 0.01

When done with the file, you should close it like a good boy:

f.close()

Files can be iterated like we saw for lists:

L = []
f = open("1zen.pdb")
for line in f:
   L.append(line.split())

Here, we did several of the steps above in one operation.

1.8 File-like objects

We saw that Python’s file object is very convenient, so Python operates with so-called file-like objects. These are objects that can be anything, but appear to the programmer as though they are simple files. A simple example shows that you can access a web page as though it was a file:

import urllib
f = urllib.urlopen('http://www.bioxray.au.dk/index_uk.php')

Now we have our “file-like” object f, and we can suck in the web page.

W = f.readlines()
print len(W)

We get a list of 413 lines containing the HTML text. However, the “file-like” object f can do more than an ordinary file. For example, you can get the HTTP header information:

print f.info()

but that is another story 🙂

1.9 Take-home message

The take-home message of this presentation is:

In Python, EVERYTHING IS AN OBJECT

Appendix

# Amino acid properties 
hydropathy = {'A': 1.8, 'C': 2.5, 'D': -3.5, 'E': -3.5, 'F': 2.8, 'G':
-0.4,
              'H': -3.2, 'I': 4.5, 'K': -3.9, 'L': 3.8, 'M': 1.9, 'N':
-3.5,
              'P': -1.6, 'Q': -3.5, 'R': -4.5, 'S': -0.8, 'T': -0.7,
'V': 4.2,
              'W': -0.9, 'Y': -1.3}

tlc = {'A': "ALA", 'C': "CYS", 'D': "ASP", 'E': "GLU", 'F': "PHE",
       'G': "GLY", 'H': "HIS", 'I': "ILE", 'K': "LYS", 'L': "LEU",
       'M': "MET", 'N': "ASN", 'P': "PRO", 'Q': "GLN", 'R': "ARG",
       'S': "SER", 'T': "THR", 'V': "VAL", 'W': "TRP", 'Y': "TYR"}

charge={'A': 0, 'C': 0, 'D': -1, 'E': -1, 'F': 0, 'G': 0, 'H': 0.1,
        'I': 0, 'K': 1, 'L': 0, 'M': 0, 'N': 0, 'P': 0, 'Q': 0,
        'R': 1, 'S': 0, 'T': 0, 'V': 0, 'W': 0, 'Y': 0}

isoelectric={'A': 6.0, 'C': 5.02, 'D': 2.77, 'E': 3.22, 'F': 5.48,
             'G': 5.97, 'H': 7.47, 'I': 5.94, 'K': 9.59, 'L': 5.98,
             'M': 5.74, 'N': 5.41, 'P': 6.3, 'Q': 5.65, 'R': 11.15,
             'S': 5.68, 'T': 5.64, 'V': 5.96, 'W': 5.89, 'Y': 5.66}

mw = {'A': 89.09, 'C': 121.15, 'D': 133.1, 'E': 147.13, 'F': 165.19,
      'G': 75.07, 'H': 155.16, 'I': 131.17, 'K': 146.19, 'L': 131.17,
      'M': 149.21, 'N': 132.12, 'P': 115.13, 'Q': 146.15, 'R': 174.2,
      'S': 105.09, 'T': 119.12, 'V': 117.15, 'W': 204.23, 'Y': 181.19}

name = {'A': 'alanine', 'C': 'cysteine', 'D': 'aspartic acid',
        'E':'glutamic acid', 'F': 'phenylalanine', 'G': 'glycine',
        'H': 'histidine', 'I': 'isoleucine', 'K': 'lysine',
        'L': 'leucine', 'M': 'methionine', 'N': 'asparagine',
        'P': 'proline', 'Q': 'glutamine', 'R': 'arginine', 'S':
'serine',
        'T': 'threonine', 'V': 'valine', 'W': 'tryptophan',
        'Y': 'tyrosine'}

polarity={'A': 'n', 'C': 'n', 'D': 'p', 'E': 'p', 'F': 'n', 'G': 'n',
          'H': 'p', 'I': 'n', 'K': 'p', 'L': 'n', 'M': 'n', 'N': 'p',
          'P': 'n', 'Q': 'p', 'R': 'n', 'S': 'p', 'T': 'p', 'V': 'n',
          'W': 'n', 'Y': 'p'}

aa = set("ACDEFGHIKLMNPQRSTVWY")
phobic = set([ x for x in hydropathy if hydropathy[x] > 0.1])
neutral = set("ANCEGILMFQPSTWYV")
positive = set("RK")
negative = set("DE")
zwitter=set("H")

Date: 2012-01-26 11:22:16 CET

HTML generated by org-mode 6.21b in emacs 23

Tagged with: , ,