class: center, middle, inverse # Python for Bioinformatics for Learning Python ## Martin Schweitzer --- # Outline ## Python is a great tool for Bioinformatics -- ## Learning about Bioinformatics is a great way to learn Python --- class: center, middle # Disclaimer ## I am not a bioinformatician. --- # About me... ## Lectured and run training courses -- ## Ran a course for Bioinformaticians -- ## Came across Rosalind `http://rosalind.info/` --- # What is Bioinformatics? ## A multidisciplinary field covering Computer Science, Biology, mathematics and statistics. --- # DNA ``` GTCCGTCCGAGGGAAATTGCGCATTCTGGTAGCTTTCGAGCTTTGTCTTTG CAGATGTCGGCCATCAGTAAGAAGGAAAAAACCTCTCGTTAACTTCGAACC AACACGTGGCCGTCCTTGTGGGCTGGGCTTTTAACGGCCCCCTGCCACAGC GGCTGGTGGAATCAATGCAGAGGGGTTTCTATCACGAGCACCATTCGGCCC GCCACTGTTGCGCTAAGTTGACTCGTCCATATGTCTCGTCGAGTTTTCGAG ATCGCTGCAGTGGGATGATAAGGCCCAATTTCATTTCTAAATGATGTAGAT ATCGACTCGGCTACACAGAACGCATGGTTGAGATGATCCAGCGTCAGGGGG GACCGTCTACAGTCCCTGTGATATGTGGGATTATCGTACCGCTAACTACGT TCCACCTTTTGGTAACTCAGCCACAAAATCCTGTGGCTAGTGACACCTTGG TCAAAAACGTGGATCTGCATAGTACTAAGTGTCCGGCCTCGCCTAACATAC CGGGCCAAGTAGTGCTTACACCGACCACAAATAGGCTCCGTGACGCTCAGC ATCGTAAGATCCGCTGTCGTGCGGTAGTGGCATCCGGGACAGGCAGGGATT GCGTGTGGCAGACGGTTCTTAAAACCAATTTCGATCATATGGAATGTCCTG TTGAGGAGTGGGTAACGGGCACTGGTCACGTCACACGCATCTGCCTGAAGT GGGCAGTGATTATCGTGAATGCAGTATTAACTGTGAGGGTGGGCTCTCCGG CCCTCACAGGACGACGATATGTACGGTGCATCTTCGGGTTTCGAGGAGAGA ATCCTACATGACGACACCACAGTCTGGTCGTTGGCACTAAATGTGTCCCTT GTCCGTCCGAGGGAAATTGCGCATTCTGGTAGCTTTCGAGCTTTGTCTTTG CAGATGTCGGCCATCAGTAAGAAGGAAAAAACCTCTCGTTAACTTCGAACC AACACGTGGCCGTCCTTGTGGGCTGGGCTTTTAACGGCCCCCTGCCACAGC ``` --- # DNA ```python import random "".join([random.choice('ACGT') for _ in range(10000)]) ``` --- # Determine the GC-Content of some DNA ### GC-Content is the percentage of the DNA which is G or C. ```python dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG...' def gc_content(dna): gc = 0 for base in dna: if base == 'C' or base == 'G': gc += 1 return float(gc)/len(dna) ``` --- # Determine the GC-Content of some DNA ```python dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG...' def cg_content(dna): cg = [base for base in dna if base in 'CG'] return float(len(cg))/len(dna) ``` --- # Determine the GC-Content of some DNA ```python dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG...' def cg_content(dna): return float(dna.count('G') + dna.count('C'))/len(dna) ``` --- # Reverse complement ``` => GTCCGTCCGAGGGAAATTGCGCATTCTGG ||||||||||||||||||||||||||||| CAGGCAGGCTCCCTTTAACGCGTAAGACC <= ``` Given some DNA, find the reverse complement: ```python def reverse_complement(dna): complement = {'A':'T', 'C':'G', 'G':'C', 'T':'A'} rc = "" for base in dna: rc += complement[base] return "".join(reversed(rc)) ``` --- class: bigger_mono # Reverse complement ``` => GTCCGTCCGAGGGAAATTGCGCATTCTGG ||||||||||||||||||||||||||||| CAGGCAGGCTCCCTTTAACGCGTAAGACC <= ``` Given some DNA, find the reverse complement: ```python def reverse_complement(dna): complement = {'A':'T', 'C':'G', 'G':'C', 'T':'A'} return "".join([complement[base] for base in dna[::-1]]) ``` --- # Hamming distance Count in how many places two strings differ ``` GTCCGTCCGAGGGAAATTGCGCATTCTGG | | | GTCTGTCCGACGGAAATTGCACATTCTGG ``` ```python def hamming(lhs, rhs): total = 0 for i in range(len(lhs)): if lhs[i] != rhs[i]: total += 1 return total ``` --- # Hamming distance Count in how many places two strings differ ``` GTCCGTCCGAGGGAAATTGCGCATTCTGG | | | GTCTGTCCGACGGAAATTGCACATTCTGG ``` ```python def hamming(lhs, rhs): set1 = set([(x,y) for x,y in enumerate(lhs)]) set2 = set([(x,y) for x,y in enumerate(rhs)]) return len(set1.difference(set2)) ``` --- # Hamming distance Count in how many places two strings differ ``` GTCCGTCCGAGGGAAATTGCGCATTCTGG | | | GTCTGTCCGACGGAAATTGCACATTCTGG ``` ```python def hamming(lhs, rhs): return len([(x,y) for x,y in zip(lhs, rhs) if x != y]) ``` --- # Finding motifs Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. ```python # Given some DNA, print all the 5-mers (k-mers) k = 5 dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG' for i in range(len(dna) - k + 1): print dna[i:i+k] ``` Results: ``` GTCCG TCCGT CCGTC CGTCC GTCCG TCCGA CCGAG ... ``` --- # Finding motifs Given some DNA, print out the most frequent 5-mer motifs ```python k = 5 dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG' kmers = {} for i in range(len(dna) - k + 1): motif = dna[i:i+k] if not motif in kmers: kmers[motif] = 0 kmers[motif] += 1 all = sorted(kmers.items(), key=operator.itemgetter(1)) print all[-5:] ``` --- # Finding motifs Given some DNA, print out the most frequent 5-mer motifs ```python from collections import defaultdict k = 5 dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG' kmers = defaultdict(int) for i in range(len(dna) - k + 1): motif = dna[i:i+k] kmers[motif] += 1 all = sorted(kmers.items(), key=operator.itemgetter(1)) print all[-5:] ``` --- # Finding motifs Given some DNA, print out the most frequent 5-mer motifs ```python from collections import Counter k = 5 dna = 'GTCCGTCCGAGGGAAATTGCGCATTCTGG' motifs = Counter([dna[i:i+k] for i in range(len(dna)-k+1)]) print motifs.most_common(5) ``` --- # Rabbits Write a program that prints the *n*th Fibonacci number: ```python def fibonacci(n): parent = 1 child = 1 for i in range(n): child, parent = parent, parent + child return child for i in range(10): print fibonacci(i), ``` -- `1 1 2 3 5 8 13 21 34 55` --- # Rabbits Write a program that prints the *n*th Fibonacci number: ```python def fibonacci(n): parent = 1 child = 1 for i in range(n): child, parent = parent, parent + child return child print fibonacci(150) ``` prints: ``` 16130531424904581415797907386349 ``` --- # Calculating minimum skew ``` dna = 'CCTATCGGTGGATTAGCATGTCC...' skew = '0 -1 -2 -2 -2 -3 -2 -1 -1...' ``` Find all positions in dna where skew is a minimum. ```python dna = 'CCTATCGGTGGATTAGCATGTCC...' d = {"C":-1, "G":1} skew = [] val = 0 for ch in genome: if ch in "CG": val += d[ch] skew.append(val) mn = min(skew) indices = [str(i+1) for i, m in enumerate(skew) if m == mn] print " ".join(indices) ``` --- # Calculating minimum skew ``` dna = 'CCTATCGGTGGATTAGCATGTCC...' skew = '0 -1 -2 -2 -2 -3 -2 -1 -1...' ``` Find all positions in dna where skew is a minimum. ```python dna = 'CCTATCGGTGGATTAGCATGTCC...' # d = {"C":-1, "G":1} d = {"A":0, "C":-1, "G":1, "T":0} skew = [] val = 0 for ch in genome: val += d[ch] skew.append(val) mn = min(skew) indices = [str(i+1) for i, m in enumerate(skew) if m == mn] print " ".join(indices) ``` --- # Calculating minimum skew ```python dna = 'CCTATCGGTGGATTAGCATGTCC...' d = {"A":0, "C":-1, "G":1, "T":0} val = 0 def getval(ch): global val val += d[ch] return val skew = [getval(ch) for ch in dna] mn = min(skew) indices = [str(i+1) for i, m in enumerate(skew) if m == mn] print " ".join(indices) ``` --- # Calculating minimum skew ```python dna = 'CCTATCGGTGGATTAGCATGTCC...' d = {"A":0, "C":-1, "G":1, "T":0} def getval(ch, l=[0]): l[0] += d[ch] return l[0] skew = [getval(ch) for ch in dna] mn = min(skew) indices = [str(i+1) for i, m in enumerate(skew) if m == mn] print " ".join(indices) ``` --- # Calculating minimum skew ```python dna = 'CCTATCGGTGGATTAGCATGTCC' d = {"A":0, "C":-1, "G":1, "T":0} def getval(lst): val = 0 for bp in lst: val += d[bp] yield val skew = [val for val in getval(dna)] mn = min(skew) indices = [str(i+1) for i, m in enumerate(skew) if m == mn] print " ".join(indices) ``` --- # Calculating minimum skew ```python dna = 'CCTATCGGTGGATTAGCATGTCC' d = {"A":0, "C":-1, "G":1, "T":0} def accumulate(lst): val = 0 for x in lst: val += x yield val skew = [val for val in accumulate([d[bp] for bp in dna])] mn = min(skew) indices = [str(i+1) for i, m in enumerate(skew) if m == mn] print " ".join(indices) ``` --- # Calculating minimum skew ```python import numpy as np dna = 'CCTATCGGTGGAGCATGCCCTCC' d = {"A":0, "C":-1, "G":1, "T":0} a = np.cumsum([d[bp] for bp in dna]) print np.where(a == min(a)) ``` --- # Hidden Markov Models ``` Probability of a Hidden Path Problem Given: A hidden path π followed by the states States and transition matrix Transition of an HMM (Σ, States, Transition, Emission). Return: The probability of this path, Pr(π). You may assume that initial probabilities are equal. Sample Dataset AABBBAABABAAAABBBBAABBABABBBAABBAAAABABAABBABABBAB -------- A B -------- A B A 0.194 0.806 B 0.273 0.727 Sample Output 5.01732865318e-19 ``` --- # Hidden Markov Models ![Graph](graph2.png) -- `BAAABAABBBBABBBABBBBABBBAABAABAAABBAABABAAABAAABB` --- # Hidden Markov Models ```python transitions = {('A','A'):0.193, ('A','B'):0.807, ('B','A'):0.625, ('B','B'):0.375} dna = "BAAABAABBBBABBBABBBBABBBAABAABAAABBAABABAAABAAABBB" result = 0.5 for ch in zip(dna, dna[1:]): result *= transitions[(ch[0], ch[1])] print result ``` --- # Hidden Markov Models ```python """ Going to first try something naiive... Works! I think I fluked it. """ transitions = {('A','A'):0.193, ('A','B'):0.807, ('B','A'):0.625, ('B','B'):0.375} dna = "BAAABAABBBBABBBABBBBABBBAABAABAAABBAABABAAABAAABBB" result = 0.5 for ch in zip(dna, dna[1:]): result *= transitions[(ch[0], ch[1])] print result ``` --- # Results ## Python is a great language for solving bioinformatics problems. -- ## Rosalind is a great source of (small) problems, and gives great motivation and practice. -- ## To learn, we need to solve problems and relentlessly improve our solutions. --- class: center, middle # Thank You ## pycon@martinschweitzer.com ### Slides: `http://goo.gl/Jscu1T` ![slides](chart.png) ---