proj: Course Projects

Category: [ show details]

Identification of Differentially Expressed Genes from Acute Alcohol Exposure using Linear Models

by Greg Risser and Bobby Link

Alcohol abuse is detrimental to one’s health and is the known source of many ailments. Utilizing a systems approach can help personalize medicine when a patient is a knowingly abusing alcohol. Microarray data from whole blood samples taken from six age controlled test subjects at varying blood alcohol concentrations (BACs)(BAC = 0.04% Ascending, 0.08%, 0.04% Descending, and 0.02% Descending) were filtered for differential expression (DE) based on Local Pooled Error (LPE) tests (FDR < 0.05). DE genes were then used to create many linear models using MatLab R2016b and assigned an r2 value from their corresponding model. DE Genes were also subject to enrichment analysis using DAVID. Significant genes were subject to literature review to find similarities. Many of the similar genes found had similar gene expression behavior due to ethanol administration. Results from DAVID revealed some oddly enriched GO terms (e.g. multi-organism cellular process, viral process, interspecies interaction between organisms) and KEGG pathways (e.g. Parkinson's disease, Alzheimer's disease, Huntington's disease), but otherwise normal terms including ‘oxidative phosphorylation’, ‘ribosome’, and ‘protein metabolic process’ which reflects the changes that the body makes in prioritizing alcohol metabolism. A limitation of the study is that the microarray data represents changes in the bloodstream and cannot be extrapolated to other relevant tissue types (e.g. brain, liver, or cardiac tissues).

Tags: qsb,binfo

Pathogenicity in Haemophilus influenzae

by Josh Earl, Jeremy Leipzig

Haemophilus Influenzae Is a significant clinical problem. The virulence of the bacteria is not completely understood. In this project, we attempt to predict the virulence of the bacteria from genetic sequences. Genomes from 875 whole-genome sequences were annotated for genes. More than 4000 gene clusters were obtained based on sequence similarity. Presence-absence of these gene clusters in each bacteria were used to predict the virulence.

Tags: ml,binfo,seq,r

Pathway based functional analysis of IBD patients using metagenomic analysis of human gut microbiome

by Dhruv Sakalley

Human Gut Microbiome contains collection of diverse species which help carry out various functions for the proper functioning of the human body. However, IBD is a condition where the diversity of this microbiome is significantly altered. Little is known about the causes and effects of these variations in the present literature. The next generation sequencing techniques provide suitable data for Metagenomic analysis leading to identification of uncluttered microorganisms, and make it possible to get detailed functional insights into the functional footprint of these altered microorganisms. This study uses of KEGG pathways for mapping functionality of the diverse gene sets in order to better understand function level changes caused due to the altered microbiome in case of IBD.

Tags: qsb,binfo

HIV-1 Protease Cleavage Sites

by Jessica Eager, Prince Jacob

Uses a database of known HIV1 cleavage peptides. Uses neural network & decision trees to distinguish cleavage sites from non-cleavage sites.

Tags: matlab,ml,binfo,seq

Training Artificial Neural Network to identify splice-end sites in eukaryotic DNA sequences

by Jessica Eager and Brinda Kamalia

We describe a few types of machine learning techniques as applied to splice junction sequence data. The key aim is to use nucleotide data to predict whether a sequence will become an exon/intron border, an intron/exon border, or neither. Using an artificial neural network and support vector machine, we perform classification of sequence samples. Principal component analysis was employed to visualize clustering and to aid in SVM training. The ANN performed poorly as compared to a knowledge-based artificial neural network. Our SVM, however, performed better than the KBANN.

Tags: qsb,binfo

Transcriptome Assembly and Differential Expression Analysis for RNA-Seq

by Vivian Wu and Haiyue Lu

Evidence that multiple splice and promoter isoforms are often co-expressed in the same tissue sample raises questions about the distribution of different isoforms across cell types and physiological states. Mapped reads of mouse myoblast sequence fragments were assembled using Cufflinks and feature extraction performed to identify significantly expressed primary transcripts across all stages. Of the significant transcripts, 16 of 36 were previously annotated and 662 of all isoforms were presented in all four stages.

Tags: qsb,binfo

Random forest compared to Microarray Analysis to observe coordinate plan defense in Arabidopsis

by David Goodman, Eric Dluhy

Project to compare gene identification of a random forest classifier with that of simple thresholding of fold changes.

Tags: qsb,binfo

Splice site prediction through neural networks

by Peter

Tags: qsb,binfo

Composition of the microbiome at various host body sites using data from the human microbiome project

by Nicole Ferraro

The human body contains a large proportion of bacterial cells that perform a variety of vital functions to maintain human health. Dysbiosis in these communities can lead to disease, and so an understanding of these populations and their functions is essential to understanding how they may impact disease states. Here, we look at sequenced microbiota samples from 15 separate sites on the human body and answer the two fundamental questions associated with the microbiome: Who’s there and what are they doing? The answers to these questions could provide insight into how the microbiota interact with the host, and what are potential mechanisms for disease, and also shed light on the differences in these interactions throughout the body. The Quantitative Insights into Microbial Ecology (QIIME) package was be used to conduct this analysis, through assigning sequences to OTUs, examining taxonomy of those OTUs, and assessing host site diversity. We would expect to find that there is great variation within these sites, based on functions necessary for that area of the host, and that there would be high beta diversity, while lower alpha diversity. This was not consistent with the results, but as the complete sample files were unable to be analyzed, this finding is not surprising, and could still hold true if the entirety of the files were analyzed.

Tags: binfo,qsb

DeNovo Genome Assembly

by Carl Eberle

The problem chosen for this project is the de novo alignment and assembly of NGS fragments into a complete genome. The problem of genome assembly is well characterized into a variant of the Shortest Common Superstring (SCS) problem in computer science and is known to be NP-Hard. The solution chosen for this problem is Greedy Approximation where two sequences with the best overlap are merged into one larger string. The process is then repeated until there are no more strings to overlap.

Tags: qsb,binfo

Comparison of Hierarchical Clustering Algorithms Applied to Reconstructing B-Cell Receptor Lineages

by Adam Craig

Statistically rigorous methods for reconstructing B-cell receptor lineages exist but are computationally intensive, making application of them to large data sets impractical. Faster clustering methods exist but are predicated on assumptions that may not apply in this biological context. Methods: I propose to compare the results and speeds of several commonly used clustering methods, namely UPGMA, complete linkage, single linkage, neighbor joining, relaxed neighbor joining, maximum parsimony, and spectral hierarchical clustering, to the statistically rigorous iterative maximum likelihood method implemented in Kepler’s Antigen Receptor Probabilistic Parser software.

Tags: binfo,qsb

Gut Microbial Gene Composition Correlated to Inflammatory Bowel Disease

by Yemin Lan

Gene composition of gut microbial communities is potentially associated with several life states or health conditions. This project intends to study the impact of gut microbial organization on human health and well-being by assessing their functional diversity and corresponding disease states. The raw data can be achieved from a human gut microbial gene catalogue established from faecal samples of 124 individuals in a previous study. We first extracted the functional annotation information of non-redundant genes for each sample; and then assigned protein families to each annotation using Pfam database to get abundance information of protein families covered by these genes; finally, functional profiles based on the protein families selected by various feature selection methods are built, which will be analyzed by PCA. The functions of selected features will be studied to find biological implications on IBD samples versus healthy gut microbial samples, and see if certain feature selection methods are preferable than others in finding differences between distinct environmental samples for this particular analysis based on gut microbial sequences.

Tags: qsb,binfo

Identification of Active Sites

by Rachel Benway

There is a need for a tool that can correctly identify how proteins interact; specifically how an enzyme interacts with its substrate. Enzymes are a subset of proteins that catalyze reactions by lowering the activation energy needed to complete the reaction. Enzyme – substrate pairs are highly specific and the enzymes are not consumed by the reaction. Thus, they can perform the reaction over and over. This tool will take the protein sequence and identify pairs of amino acids that are part of active sections of the protein based on seven features of the pair. Decision trees were successfully created with an accuracy of 98%.

Tags: binfo,qsb

Micro Array Analysis for Complement Regulation in Auto-immunity

by Francis Bell

The complement immune system is associated with several auto-immune conditions. These conditions are similar. Analyzing micro array data could produce a commonality. This could lead to potential new treatment options. So far, micro-arrays from multiple sclerosis studies have been analyzed. The genes encoding for complement proteins are not consistently abnormally expressed. Only a few profiles have shown a change in their expression. This does not mean that the complement is a cause of primary symptoms. On the contrary, it suggests the complement is the fundamental problem, at least in some cases.

Tags: binfo,qsb

SarmadZe

Tags: qsb,binfo

Finding Drugs of Interest for Non-Small Cell Lung Cancer Using Mi-croarray Analysis and cMap

by Robert Saporito and Adam Wojnar

This analysis looks to determine if the best lung cancer treatment for a particular individual is dependent on the cancer-stage. Pairwise normal-tumor samples from four datasets analyzed to determine significant differ-entially expressed genes. These genes are submitted for enrichment and connectivity map analysis to determine the best drug targets. Meta-analysis is then performed to find common drug targets. Vorinostat and tri-chostatin-a were found to be common drugs of inter-est between the current study and previous research.

Tags: qsb,binfo

RachelSamantha

Tags: qsb,binfo

Differential Transcript Expression in Various Brain Regions with and without Alzheimer’s Disease

by Jessica Fegely, Brandon Gordon, and Kevt'her Hoxha

Alzheimer’s disease (AD) is a critical disease that affects millions each year, with expectations for significant growth in the next generation. Understanding the mechanisms by which AD functions can be accomplished through studies of both the normal brain and the diseased brain. This analysis of AD microarrays of brain tissue samples was conducted using MATLAB and rankProd in R. The results from this analysis illustrates that the various tissues studied have different genes which are differentially expressed when comparing AD and non-AD brains. The variety of genes that are shown to be differentially expressed can be used in future studies as potential drug targets or for pathway analysis in the progression of Alzheimer’s Disease.

Tags: qsb,binfo

Analyzing miRNA of Glioblastoma and Alzheimer’s Disease

by Alvee Hoque, Maggy Carka, Kathryn Markey

Glioblastoma (GBM) and Alzheimer’s disease (AD) are diseases that are impacting the global population as the age of life expectancy increases. There is evidence that leads researchers to believe that both GBM and AD are involved in the same pathways, with an inverse relationship. Using the datasets that were used in the paper The analysis of miRNA expression profiling datasets reveals inverse microRNA patterns in glioblastoma and Alzheimers disease (2019) from Gene Expression Omnibus (GEO) we analyzed and reviewed the paper’s results. Using GEO2r and MATLAB to find p-values and fold changes, we were unable to replicate the findings of the paper. We were not able to identify inverse relationships between top 10 significant genes and bottom 10 significant genes. Although we were not able to associate an inverse relationship using miRNA data, we were able to find similarities of molecular pathways with the paper. Taking significant genes of miRNA and running them through DAVID, there were hits in endocytosis, tgf-beta, axon guidance, signal pathways regulating pluripotent stem cells, and circadian rhythms pathways. This reinforces the data that GBM and AD have similar pathways, although we were not able to replicate an inverse association. Further additional testing must be done in order to solidify the workings of these pathways in GBM and AD, in order to create potential therapeutic targets or important biomarkers.

Tags: qsb,binfo

Investigation of transcriptomic differences between non-small cell lung cancer subtypes

by Sarah Blatt and Caijie Wang

Non-small cell lung cancer (NSCLC) is among the most prevalent lung cancers, accounting for 80% of all cancer cases. Transcriptional, histopathological and clinical differences have been reported for the two major subtypes, adenocarcinoma (AC) and squamous cell carcinoma (SCC). However, no subtype-specific therapy exists. Further investigation into the transcriptomic variability between SCC and AC could provide insight into potential molecular targets for therapeutic intervention of NSCLC. In a previous study, Kuner et al. performed a global gene expression analysis of AC and SCC samples to identify potential molecular targets. Therefore, the present study utilizes two of their datasets to re-investigate differences be-tween the two histopathological tumor subtypes through expression profiling of 58 human NSCLC samples. Using MATLAB and the Bioinformatics Toolbox, we identify 726 differentially expressed genes between the SCC and AC subtypes and as well as genes associated with GO biological processes and functional category keywords unique to each subtype. Function analysis through DAVID revealed a potential deregulation of unique sets of genes associated with cell junctions in both SCC and AC. When investigating a select panel of cell adhesion and EMT genes in 56 human NSCLC samples through quantitative real-time PCR, milder transcriptomic differences were found between SCC and AC. Differential expression analysis revealed discrepancies between the expression patterns of EMT transcriptions between the present study and the previous study based on these datasets. When evaluating concordance of the fold changes of SCC relative to AC between qRT-PCR and the corresponding microarray subset, we found a significantly negative correlation.

Tags: qsb,binfo

Comparing Ileal Tissue Gene Expression Based on Age-of-Diagnosis in Pediatric Crohns Disease

by Rawan Shraim and Walker Alexander

Age-of-diagnosis has proven to play a significant role in disease phenotype and location in pediatric patients diagnosed with inflammatory bowel disease (IBD). Disease course differences based on age-of-diagnosis is commonly noted in the clinical setting, however, evidence of differences in genetic expression in mucosal tissue has not been studied before. In this study, pediatric patients are stratified into 2 groups, patients diagnosed with Crohn's disease (CD) at 6-10 years old (A1a) vs 10-18 years old (A1b). The aim is to identify differences in genetic signatures between the 2 groups of patients taking into consideration the location of disease in each group. An existing mRNA data set generated by Haberman et al. from NCBI GEO database is utilized. Using python, analyses completed by Haberman et al. were replicated. Results identified 122 differentially expressed genes between ileocolonic A1a and A1b. A1b demonstrated an upregulation of inflammatory genes, IL11 and Irg1, and downregulation of alpha-defensin gene, DEFA5. Clustering also identified that A1a ileocolonic and colonic and A1b colonic patients clustered together separately from A1b ileocolonic patients. The overall results support clinical findings that Alb ileocolonic CD patients have higher inflammation and a worse phenotype of IBD in comparison to A1a ileocolonic and colonic patients.

Tags: qsb,binfo

Gene Expression analysis to understand gastric cancer at molecular level

by Pushpita Rahman, Nilima Chavan

Gastric Cancer is one of the leading causes of cancer deaths in the world, with advanced stages having poor prognosis. Very little is known about the pathways involved in carcinogenesis and progression and genes associated with clinical properties. In this project, we set to follow a paper's analysis on a high-density oligonucleotide microarray conducted with 22 cancerous gastric tissue and 8 noncancerous tissue to gain a better molecular understanding of carcinogenesis, progression, and diversity of gastric cancer. Our analysis was completed using Matlab programming utilizing the bioinformatics toolbox. Our results were similar in the expression analysis which showed distinction between cancer tissues and noncancerous tissues. Genes expressed in cancer tissues were identified in the signification expressed genes (p < 0.05). These results provide a better understanding of the molecular structure of gastric cancers which improve our understanding of biological properties. This provided knowledge for future development of understanding of carcinogenesis, progression, and diversity of gastric cancer.

Tags: qsb,binfo

Metagenomic Analysis of IBD Microbiome

by Sanders Clark, Meghan Knecht

While Inflammatory Bowel Disease (IBD) has high prevalence in the United States and symptoms that significantly reduce quality of life for those affected, there is little known about the underlying mechanisms that cause the disease. In order to develop effective treatments, analysis of the microbial communities associated with IBD must be investigated. In recent studies, researchers have leveraged metagenomic sequencing to gain insight into IBD microbiome. In this study, metagenomic sequences of 4 samples from patients with Crohn’s Disease and 7 samples from healthy patients are evaluated. Using a command line tool for a microbiome helper virtual environment, Trimmotic, Bowtie2, HUMaN2, and MetaPhlAn2 were used to trim, filter, and taxonomically and functionally analyze the data. Firmicutes and bacteroidetes were found to be the predominant phyla in CD and healthy patients. Clostridium, Escherichia, Streptococcus and other Clostridiales were found to be more prevalent in CD samples when compared to healthy samples. Finally, pathway analysis revealed that peptidoglycan synthesis and maturation was more prevalent in CD samples.

Tags: qsb,binfo

Gene Expression Changes in Children with Autism

by Cassandra Li and Nhat Duong

The objective of this study was to find gene expression differences in children at different levels on the autism spectrum compared to the general population. Autism is associated with a high degree of heritability, yet spe-cific genes of autism have yet to be identified. Data from the Gene Expression Omnibus (GEO) with Tracking Series No. GSE6575 was downloaded, which includes microarray data from a study involving 61 children from the general population to children with autism. MATLAB was used to calculate the fold change and conduct significance analysis to find differentially expressed genes between these two populations. Then, the DAVID database was used for functional gene analysis. The threshold for significant genes was a fold change greater than 1.5 and a p-value of less than 0.05. Using this data, over 20 genes were found that were differentially expressed. Upon submitting these genes to the DAVID database, it was discovered that most of these genes had functions related to natural killer (NK) cells, and their use in attacking against infected cells.

Tags: qsb,binfo

Identifying genes with significantly diffential expression in brain tissue from patients with psychiatric disorders

by Angela Tomita, Hung-Yi Wang

More than 350 million people globally are affected by major depression, bipolar disorder, or schizophrenia. Previous studies show promising results of genes with significantly differential expression in brain tissue from patients with psychiatric disorders. Our dataset is RNA assay from human cerebellum and parietal cortex brain with depression, bipolar disorder, schizophrenia and control. We use Matlab to analyze p- value, fold change, hierarchical clustering. Then, analyze significant genes on DAVID and PANTHER. In our all PANTHER GO results, all patterns are similar, even we compare disease with each other. The most associated gene is about cellular process. In our PANTHER pathway results, the associated pathway is about the immune system. When compared with depression, bipolar disorder and schizophrenia show similar patterns. They may influence the same pathway in pathology. Using DAVID to for gene enrichment with GO term and KEGG pathways, only about 20% of the results from this study matched those of publications. In conclusion, despite limitations, a better understanding of genes with significantly differential expression - with a focus on the 20% match - may be the key to development of more effective treatment and intervention for certain psychiatric disorders.

Tags: qsb,binfo

Classification Optimization of Small Cell Lung Cancer Based Upon Transcriptomic Analysis

Small cell lung cancer is the most aggressive form of lung carcinoma and has very poor prognosis. It is caused by a collection of mutations and expression dysregulations leading to changes in gene expression. Transcriptomic analysis can be used to study these gene expression differences. This study provides an opti-mized stochastic gradient descent model based on cross- validated feature reduction for the classification of small-cell lung cancer samples and healthy controls. From the analysis, it was found that optimized classifi-cation accuracy could be reached by using only 2% of the genes as features. Ultimately, the model was able to sort through noise to determine the 1000 top genes for optimal classification.

Tags: qsb,binfo

ParthPatel.MarkWelsh.hiv crispr predict

Tags: qsb,binfo

Jennifer.leukemiaCD4CD8

Tags: qsb,binfo

Differential Expression of Common Genes in Breast, Lung, and Prostate Cancer

by Alan & Chad

In order to better understand common mechanisms in cancer progression, a to identify differentially expressed genes (DEGs) common in lung, breast, and prostate cancer datasets obtained from 6 total microarray analyses. DEGs were identified using t-test and fold change calculation and thresholding implemented with python3 within the jupyter notebook interface. Roughly 3,000 probe targets were found to be differentially expressed among the genes, with 24 probe targets common to all 3 cancer types. After translating probe targets into genes, 21 unique DEGs were found to be common to breast, lung, and prostate cancer. While the number is lower than a comparable study by Makhijani et al. using the same datasets, variations in results are likely attributable to differences in data processing methods. A comparison with relevant literature verified that many of the common DEGs identified in the study have previously been associated with breast, lung, prostate, and other cancers. Moreover, pathway and gene ontology searches with the DEGs via the DAVID database identified many biological processes related to cancer pathogenesis. Thus, the work shows that meta-analyses similar to this study can be used to identify common mechanisms in cancer progression. While the analyses performed here provide a potent tool for exploring the genes involved in cancer, the study could be expanded to use additional datasets, including those from other cancer types, to present a wider and more robust study.

Tags: qsb,binfo

BenStear.MohammadHossain.mir135b.breast and prostate cancer

Tags: qsb,binfo

Gene Expression Profiles Based off the FAB Classification System Using Microarray Data

by Nick OGrady, Andrew Kaiser

Tags: qsb,binfo

MicroRNA Classifiers for Predicting Prognosis of Squamous Cell Lung Cancer

by Mengxi Yang, Heba Abid

The purpose of this research article is to demonstrate that distinct miRNA profiles exist in lung SCC may be robust predictors of prognosis in this disease. [1] in this project we want to use microarray data to analyze this paper results and then compare the final results with the original paper. First, we used matlab to find the significant genes and TargetScan to find the miRNA targets for further analyses. Moreover, we used DAVID enrichment to do Gene ontology and Pathway analysis. Due to using different analytical tools from the original articles our results are different from the original research results. However, we had some similar results in miRNA significance and gene enrichment.

Tags: qsb,binfo

Renae

Tags: matlab,gui,biocomp,binfo,seq

Gene selection and classification of microarray data using Neural network and AdaBoost

by Anna Lu, Yang Wan

Gene classification is a common task in expression studies. However, univariate gene selection by ranking is inaccurate, insufficient for multiclass microarray data. AdaBoost and Neural Networks are proposed as more robust methods for microarray data classification appropriate for multi-class problems. MATLAB’s built-in neural network library and AdaBoost in MATLAB’s Classification Learning App are used to assess the error rate in classification of three microarray datasets: leukaemia, brain, and NCI 60 human tumor cell lines. These three microarray data sets are selected for their variable number of classes. Findings show that AdaBoost provides greater accuracy than neural network for the three selected microarray datasets. Both AdaBoost and neural network are robust to variable number of classes: two, five, and eight for leukemia, brain, and NCI 60 respectively. github: https://github.com/ahl54/BMES543_microarray_ML

Tags: qsb,binfo

3D structure prediction of proteins from dihedral angles

by Tijo Abraham

The structure of the proteins determines the function of the protein. There are hundreds of thousands of genes translated to proteins but only about 40,000 of solved protein structures. The experimental methods are time consuming and expensive and it is not possible to identify the structure of all proteins using experimental methods like X-ray crystallography and NMR. There are several algorithms that are used to predict the secondary structure of proteins but they are not very accurate. My proposed method uses the neural network to predict the dihedral angles of a given sequence. The backbone of the protein will be calculated using these predicted dihedral angles using trigonometric calculations.

Tags: binfo,qsb

Micro Array Analysis of gene expression profiles of cancer stem cells: Acute myeloid leukimia (AML)

by Aleah Kenner

The purpose of this project was to gain understanding about the effect of clustering data analysis. Currently there are two methods that are used hierarchical clustering and K means clustering. For this project I want to look into a lesser used method, bi-clustering.

Tags: binfo,qsb

Gene function prediction using sequence similarity

by Amy, Deepthi

Predicting the gene function is one of the biggest challenges in biology today. Many methods have been proposed till date but it is not clear as to which method can be trusted in terms of efficiency, usability and performance. The project aims to create an algorithm that can link to NCBI database when a gene is queried. The query is compared with the gene sequences in the database and the function for the query will be determined based on the predefined methods.

Tags: binfo,qsb

Time course analysis of microarray data in new castle disease

by Bailu

Analyzes a GEO timeseries experiment on NewCastleDisease. Extracts and clusters time course patterns extracted using StepMiner.

Tags: binfo,qsb

Modelling Predictors of Molecular Response to Targeted Treatment of Cancer Cells With Tyrosine Kinase Inhibator Dasatinib

by Eaindra Tin Latt

Tags: advbcomp,qsb,binfo

proj - Course Projects