Code Tutorial #1: Python Code for reading a FASTA file (Bioinformatics)

A simple exercise with code

When I first started out with my bioinformatics project while at grad school, one of the biggest challenges I faced in my research was to understand and access information from various file types. So I decided to start off the first code tutorial here with an introduction to reading and extracting information from a FASTA file.

This is going to be useful if you are a biologist planning on taking up research in bioinformatics or computational genomics. You will be dealing with Big Data almost all of the time. FASTA file holds the nucleotide (DNA) sequence that makes up an organism and is unique for different species of organisms. There are also differences in the individual level that make each individual unique.

You will need to read a FASTA file when you are analyzing the genetic sequence of an organism. You may need it for analyzing specific areas of the sequence or calculating the GC content for the species and so on.

Here is what a FASTA file looks like:

The first line that starts with a ‘>’ is a FASTA format identifier that is tracked by many online bioinformatics tools that you may use on a regular basis.

Here is the python script that helps you read it:

def readGenomeFromFasta(file_genome):
genome=
with
open(file_genome, ‘r’) as new_file_1:
for line in new_file_1.readlines():
if not line[0] == ‘>’:
genome += line.rstrip()
return genome
return len(genome)

This not only strips the first line that starts with a ‘>’, it also strips off any blank spaces when the file is being read. It also uses a control that returns the number of nucleotides in the file.

The following piece of code is used along with the function definition, and is used to read the file, write to a new file and call the function to process the file:

print (“Type filename”)
file_genome= input(“> “)
input_file=open(file_genome, ‘r’)
print (“Type output filename for genome”)
output_file= input(“> “)
out_file = open(output_file, ‘w’)

final_file = readGenomeFromFasta(file_genome)#1
out_file.write(final_file)

Other than just reading the FASTA file, here is a bit of code that gives us a glimpse of what we can do with this code. Here I am simply creating an artificial nested list of “1”s to mark every position of the nucleotide. This is not helpful by itself (but it was a part of a bigger problem I was solving in my project) but serves to give a demo of how to use the file once you can read and modify it.

[[1],[1],[1],[1],[1],[1],[1]].

 

def new_list_for_index(genome):
“””Return a new list that has all the same index as of the main new_ref”””
new_list=[]
for idx in genome:
new_index=[1]
new_list.append(new_index
return new_list

print (“type name of marker file you want to output index data into”)
marker_index= input (“> “)
markerIndex=open(marker_index, ‘w’)

markerIdx=new_list_for_index(readGenomeFromFasta(file_genome))
markerIndex.write(str(markerIdx))

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.