Author: Nirav Malani

Fastq to Fasta

Inspired by solutions posted at biostar, here is my contribution to this simple sequencing file format conversion.

## fastq_to_fa.py
import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(sys.stdin) :
 print(">%s\\n%s" % (title,seq))

Usage:

 cat reads.fastq | fastq_to_fa.py

If you need to simply extract sequences…that is also simple!

## flattenFastq.py
for title, seq, qual in FastqGeneralIterator(sys.stdin) :
    print("%s" % (seq))

Circular Ideogram

Displaying data in circular fashion is increasingly getting attractive and a great way to make art out of data!

Here I show one of the many ways to make Circos style Ideogram plot of point genomic data with few genomic features. The goal here is to create a cheesy visualization of association between HIV or MLV integration sites to genes & GC content. The data and code is available at my Github repository.

Ciros_plot

The handy cut()

One of the most interesting functions for a bioinformatician in R is cut()

Want to a generate asthetically pleasing window size label given an integer?

getWindowLabel <- function(x) {
    ind <- cut(abs(x), c(0, 1000, 1e+06, 1e+09, 1e+12), include.lowest = TRUE, 
                  right = FALSE, labels = FALSE)
    paste(x/c(1, 1000, 1e+06, 1e+09, 1e+12)[ind], 
          c("bp", "Kb", "Mb", "Gb")[ind], sep = "")
}

getWindowLabel(c(0, 1e+07, 1000, 1e+06, 2e+09))

[1] "0bp"  "10Mb" "1Kb"  "1Mb"  "2Gb" 

How about those significant p-values?

getStars <- function(x) 
     as.character(cut(x, c(0, 0.001, 0.01, 0.05, 1), 
                  c("***", "**", "*", ""), 
                  include.lowest = TRUE))

getStars(c(0, 0.001, 0.01, 0.05, 1))

[1] "***" "***" "**"  "*"   ""

You can use cut() to get equally sized bins of datapoints or convert a numeric variable into a categorical type.

Bioinformatics Art

During my undergrad, I did research in Dr. James Pierce’s lab working on the horseshoe crab genome. Below are the logos I created for a website which was to host the cDNA library from BACs.


Shortly after starting the grad school, hopes of progressing ahead were dim. To boost the spirits and aim for the finish line, I created team Bioinformatics t-shirts.


After wheedling science gods we all passed grad school and I landed a job at UPENN in Rick Bushman’s lab. Great majority of my work involved looking at next-gen data produced by 454 sequencer.

454animation

The secure website of the lab at the time lacked a bit of color and creativity so I created few things to setup the moods for analysis.

webfrontendbushmanlab

Logos

Creating logos is one of my all time favorite activities. Logos with meaning and patterns are especially fun for me. For example, my very first logo below spells out my last name using various shapes. See if you can spot all!

new-ubiquitous-om

 

The very first logo which embarked my creative track! The Malani Logo

The symbolic representation of OM or AUM always fascinated me, hence it’s at the crux of my logo above. With this continued obsession, I created a logo dedicated to OM itself and many other like it for various occasions & groups.

This slideshow requires JavaScript.

Anoopam Mission

AM2004_LOGO_cmyk_300by100

If there was a place where my creations fully blossomed, it would be Anoopam Mission. With the help of Surendra Patel, who registered Anoopam Mission USA in Philadelphia under the guidance of Sahebji, my creative abilities were able to thrive. My design work at Anoopam Mission began in 2003 when Sahebji inaugurated a new Shikhar Bandh Mandir in Allentown, PA. For the celebration, I created an iconic t-shirt which compelled many more designs for the years to follow. Most of the t-shirt designs were created for a yearly summer camp dedicated for the youth growing up in America (Yogi Youth Camp)

During recent years, my involvement with Anoopam Mission increased as my relation with Sahebji and other resident saints strengthened. In 2013, the organization decided to celebrate Sahebji’s 75th birthday & 10 year anniversary of the temple. Below are few designs I made for save the date magnet.

Along with the magnets, I also made the official invitation for the celebration with complementing envelope and an insert. The invitation was a three fold piece design where the front flap with “Mandir Shikhar” exposed the “75 logo” in the last flap!

Here are few other small things which I created for various celebrations at Anoopam Mission.

Brochures & Cover Art

Below are few of my early projects/favors I did for a friend and my uncle. The mosaic of pictures with SPCS logo in the middle was a cover page of a yearly souvenir book published by the SPCS organization.

spcs-cover-page

The dance brochure highlights few classical Indian dances performed by Nilam Mistry during her Nritya Nipuna.

 

Code

R Packages

hiAnnotatorcontains set of functions which allow users to annotate a RangedData or GRanges object with custom set of annotations. The basic philosophy of this package is to take two RangedData or GRanges objects (query & subject) with common set of space/seqnames (i.e. chromosomes) and return associated annotation per space/seqname and rows from the query matching space/seqnames and rows from the subject (i.e. genes or cpg islands).

hiReadsProcessor: contains set of functions which allow users to process LM-PCR products sequenced using any platform. Given an excel/txt file containing parameters for demultiplexing and sample metadata, the functions automate trimming of adaptors and identification of the genomic product. Genomic products are further processed for QC and abundance quantification.

R Code

Collection of posts to do neat stuff with R.

Github

Where I stash most of my pleasant looking code

Easy Reproducible Report using RStudio, Markdown, & Knitr

Why am I doing this?

  • Software for doing these sort of things are gaining momentum in the open science community.
  • Presenting your work in this sort of fashion looks unique & saves you time.

Why Reproducible Report (RR)?

  • Code, analysis, and results are in a single file.
  • Automatically renew the report when the code or data changes.
  • Avoid erros by NOT copy-pasteing results.
  • Preserve contextual narrative about why analysis was performed in a certain fashion [1].
  • Documentation for the analytic and computational processes from which conclusions are drawn [1].

Workflow in a nutshell

Combine R with plain text file format to produce documents (e.g., pdfs, HTML documents, etc.)

Look here for all possible instances: http://cran.r-project.org/web/views/ReproducibleResearch.html

In this document we are going to explore making reports using Knitr & Markdown in RStudio


Why RStudio, Markdown, & Knitr?

  1. Classically RR is made using Sweave (R + LaTeX)
    • Sweave, a function in R to make reports from R code weaved with LaTeX.
      • The problem: Not smart enough to reformat/resize things if unfit for a document.
      • Limited to only understanding LaTeX
    • LaTeX, an elegant markup language
      • The problem: Too many conventions to remember to make a simple report.
      • But if mastered, then fruits are worth it!
  2. Knitr, R package to overcome most limitation of Sweave
    • Allows any input languages (e.g. R, Python and Awk) and any output markup languages (e.g. LaTeX, HTML, Markdown and reStructuredText)
  3. Markdown, a light weight markup language that\’s simple, readable, and intuitive.
    • Notice the irony here: markup languages generally take a piece of text and decorations around it to render the final results (i.e. <strong>BOLD TEXT</strong>), while in markdown you are doing the bare minimum!
  4. RStudio, software with builtin support for Knitr and Markdown….and it\’s simply awesome!

What are we doing essentially?

Weave R code into a document ->
Where text is fashioned by markdown ->
Which Knitr will parse and ->
Write the output into a HTML file.

Prerequisites:

Two best friends in RStudio

\"RStudio

  • MD button is a reference to Markdown documentation
  • Knit HTML button will make the report.

Simplest Example

  1. Open a new R markdown document in RStudio. File -> New -> R Markdown.
  2. This launches a sample document to get you started.
    • All the R code is enclosed in triple tick marks called code chunks
    • Each code chunk by default shows up in the final output. You can add various options to fine tune the behavior of code chunks.
  3. Press Knit HTML button to see what it does.

R Code Chunks

To insert an R code chunk, you can type it manually or just press Chunks - Insert chunks or use the shortcut key. This will produce the following code chunk:

The following R code chunk is as follows:

```r
myData <- data.frame(x = 1:10, y = 20:30)
myData
plot(myData)
```

The code chunk input and output is then displayed as follows:

myData <- data.frame(x = 1:10, y = 21:30)
myData
##     x  y
## 1   1 21
## 2   2 22
## 3   3 23
## 4   4 24
## 5   5 25
## 6   6 26
## 7   7 27
## 8   8 28
## 9   9 29
## 10 10 30

plot(myData)

Some example of ggplot!

library(ggplot2)
suppressPackageStartupMessages(library(tabplot))
tableplot(diamonds)

tabplot

Pressing tab when inside the braces will bring up code chunk options.

Most frequently used options (bold text denote default options):

  • comment: display # in front of text when printing text from code chunk. Set to “” to display nothing.
  • echo: display the code chunk itself in the output. (T/F)
  • eval: evaluate the code chunk. (T/F)
  • result: how should the results be displayed. (markup,asis,hide)
  • fig.width or fig.height : dimensions of the figure. (7)
  • cache : save the code chunk computation for reuse later. To rerun cached code chunks, just delete the contents of the cache folder. (T/F)

References & Excellent resources to learn more