DE-analysis.knit

—title: “DE analysis” author: “riya” date: “2023-07-18” output: html_document —

RNA seq analysis

There is ongoing research focused on developing therapies for Ewing sarcoma, a type of pediatric bone cancer. One key target is the EWSR1-FLI1 fusion oncogene, which results from the fusion of the EWSR1 and FLI1 genes. However, the exact effects of suppressing EWSR1-FLI1 on the gene expression patterns within Ewing sarcoma tumor cells are still not fully understood.

Loading the necessary libraries

library(dplyr)
library(tidyr)
library(DESeq2)
library(pheatmap)
library(EnhancedVolcano)
library(ggplot2)
library(org.Hs.eg.db)
library(AnnotationDbi)
library(RColorBrewer)
library(GenomicRanges)
library(clusterProfiler)

Reading the data

The dataset is in RangedSummarizedExperiment format, which is a common format used for storing and analyzing genomic data. The metadata associated with this dataset includes a column called “condition” with two levels: (1) “shEF1” representing EWSR1-FLI1 knock-down samples, and (2) “shCTR” representing control samples. There are a total of three shCTR samples and four shEF1 samples in the dataset.

rse <- readRDS("data/EwS.rds")

rse$condition <- ifelse(rse$condition=="shCTR","control","knockout")

fixed_rownames <- sapply(strsplit(rownames(rse),".",fixed = T), function(x) x[1])
rownames(rse) <- fixed_rownames

Here, the rownames or genes in our dataset is represented by ENSEMBL Id along with the version number. I am removing the version number to make it easier for downstream analysis such as performing gene annotation.

Creating DESeq2 object

To run our data through DESeq2, we need to convert it into DESeqDataSet object. For this, we will need a count matrix and a design formula. The design formula specifies column(s) from the metadata we are interested in and how to use it for analysis. Here in our experiment, I am interested in studying differences in gene expression in two different conditions, hence I have specified column condition as my design formula.

dds <- DESeqDataSet(rse,design = ~condition )

## converting counts to integer mode

dim(dds)

## [1] 58037     7

Filtering out any genes with very low expression for all the samples

dds <- dds[rowSums(counts(dds)) >=10,] 
dim(dds)

## [1] 38632     7

Normalize the count data

Normalization is performed to account for differences in library sizes and other sources of technical variation, enabling accurate comparisons of gene expression levels. By normalizing the count data, it becomes possible to remove potential biases and obtain a more reliable representation of gene expression levels, facilitating comparisons between genes across different samples.

Raw expression values are adjusted on the basis of size factors.

# getting the size factors for each sample and assigning it back to dds object.
dds <- estimateSizeFactors(dds)
# getting the normalised counts
normalized_counts <- counts(dds, normalized = TRUE)

Quality control

Before moving along with further analyses, it is often recommended to perform quality control checks on the count data to help us make sure that samples look good. In RNA seq analysis, this helps to provide answers to common questions like - how similar/different samples are ? are there any outliers? Where does the maximum variation in our dataset arise from?

Common sample-level QC include unsupervised clustering methods like Principal Component Analysis (PCA) and hierarchical clustering methods. To improve clustering/distances for visualization, normalised counts are transformed prior to applying these QC methods. DESeq2 offers two transformation methods namely- regularized log (rlog) transformation and variance stabilising transformation (vst).

Why transform data?

When conducting exploratory analysis for high dimensional data using PCA, or hierarchical clustering, it is best that the variance of data remains constant at different ranges of the mean values. But, when dealing with count data such as RNA-seq, it is common to see variance increase with the mean (heteroskedasticity). This can lead to biased representation of data where variables with larger variance dominate the PCA results. So, transformation helps to stabilize the variance of the data, improving further statistical analysis.

rld <- rlog(dds, blind= TRUE)

The function rlog() returns an object based on RangedSummarisedExperiment class. The assay slot contains the actual count which we have stored inside rld object.

rlog vs vst? rlog works well with smaller datasets, but when it comes to larger datasets (hundreds of samples) or ones with high count outliers, vst outperforms.

1. PCA

plotPCA(rld, intgroup = "condition")

PCA aims to identify patterns and reduce the complexity of high-dimensional datasets by transforming the data into a new set of linear variables called principal components. PC1 or the first component accounts for most variation,PC2 accounts for second most and so on. Each data point represents a sample, and its position in the plot reflects it’s relationship to the other samples in the dataset. Here’s more on PCA PCA- explained.

Here in above PCA plot, a clear clustering between knockout and control samples indicates that the gene expression differences between two conditions are strong enough to be captured by principal components. The knockout samples are dispersed along PC1, suggesting that they have diverse gene expression profiles compared to control samples -which are seen clustered together around PC2. This clustering around PC2 indicates that they have similar gene expression levels for the genes that contribute to the variation captured by PC2. Interestingly, knockout samples seem to be divided into two sub clusters suggesting likelihood of additional differences in gene expression profiles among these sub clusters.

2. Hierarchical clustering

# creating matrix from the transformed rld object for visualisation
rld_matrix <- assay(rld)
# creating correlation matrix from rld_matrix
rld_cor <- cor(rld_matrix)
## creating annotation df 
annotation <- data.frame(condition = rse$condition)
rownames(annotation)<-colnames(rld_matrix)

pheatmap(rld_cor,annotation_col = annotation)

We can see good clustering among samples belonging to two conditions. Within each conditions, samples should be more similar to each other than samples from other conditions. This indicates that gene expression patterns are consistent within each condition. Hence, the heatmap indicates that the conditions are distinct and can be distinguished based on gene expression patterns.

So far, our samples look good and ready for further analysis.

Performing differential analysis

Let’s run the actual differential analysis using DESeq() function.

dds <- DESeq(dds)

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

In this single line of code, DESeq2 performs normalization, estimation of dispersion, and shrinkage estimation to improve the accuracy of differential expression analysis. It then fits a negative binomial model and conducts hypothesis testing to identify genes with significant changes in expression between the conditions of interest.

# obtain the results
res <-results(dds)

The output of DESeq() is accessed using results() function which contains all the information regarding differentially expressed genes and their statistical properties.

Dispersion plot

plotDispEsts(dds)

The dispersion plot allows us to verify that the model fits to our data. When looking for a good fit, you expect your data to generally scatter around the curve, with the dispersion decreasing with increasing mean expression levels which is what we see in above plot.

Here, the blue “fitted” values represent initial dispersion estimates before any shrinkage. Red represents final estimates i.e. after the shrinkage has been applied. The circled genes have extreme variance and are not well fitted by the red dispersion line.

MA plot

plotMA(res,ylim = c(-4,4))

The MA plot helps to visualize the magnitude (M) and abundance (A) of gene expression changes, particularly for upregulated and downregulated genes. Genes with similar expression values at both control and knockout conditions cluster around M=0 representing no significant difference in gene expression between two conditions. Points lying above the centerline (M=0) represent upregulated genes, indicating higher expression in one condition compared to other. Conversely, points lying below this center line are downregulated genes.

Histogram plot of p-value

res_df <- data.frame(res)
# histogram of p values
ggplot(res_df, aes(x=pvalue))+
  geom_histogram(bins = 100)

Visualising histogram plot for p-values helps to see how the null and alternative hypotheses are distributed. The plot we see above has a peak near near 0 which falls uniformly as we move towards right. This represents a “well-behaved” set of p-values (anti-conservative p values). The taller the peak, the more p-values are close to 0 and are significant. Here’s more on interpreting p-value histogram why plot p-values?.

#Set thresholds
padj.cutoff <- 0.05
lfc.cutoff <- 0.58
#subset significant genes
sig_res <- filter(res_df, padj<padj.cutoff& abs(log2FoldChange)> lfc.cutoff)
dim(sig_res)

## [1] 9531    6

# Adding gene annotation to the table with significant genes
anno <- AnnotationDbi::select(org.Hs.eg.db,rownames(sig_res),
                              columns = c("ENSEMBL","ENTREZID","SYMBOL","GENENAME"),
                              keytype = "ENSEMBL")
results <- cbind(ENSEMBL = rownames(sig_res),sig_res)
anno_results <- left_join(as.data.frame(results),anno)

Volcano Plot

#Volcano Plot
EnhancedVolcano(anno_results, x = "log2FoldChange", y = "padj", lab = anno_results$SYMBOL,
                title = "DESeq2 results",
                subtitle = "Differential expression",
                border = "full")

Similar to MA plot, volcano plot helps to visualise differences in gene expression between two conditions except volcano plot accounts for statistical significance in additions to log fold change (M value). Genes with higher statistical significance or lower p-values are plotted higher up on the Y axis, representing significantly differentially expressed genes .

Heatmap vaisualisation for differntially expressed genes

#Set a color palette
heat_colors <- brewer.pal(8, "YlOrRd")
DE_genes<-normalized_counts[anno_results$ENSEMBL,]

pheatmap(DE_genes,
         annotation_col = annotation,
         color=heat_colors,
         show_rownames = F,
         cluster_rows = T,
         border_color = NA, 
         fontsize = 10, 
         scale = "row", 
         fontsize_row = 10, 
         height = 20,
         main = "differentially expressed genes")

Enrichment analysis

The goal with gene enrichment analysis is to identify the biological functions, pathways or processes that are over represented or enriched in our data.

# Subset the data frame to remove rows with NA values
anno_results <- anno_results[rowSums(is.na(anno_results)) == 0, ]
anno_results <- anno_results[which(duplicated(anno_results$ENTREZID)==F),]

It is always best to remove any missing or redundant values from the data to prevent from introducing biases further in the analysis.

# extract the fold changes
foldChanges <- anno_results$log2FoldChange
names(foldChanges)<- anno_results$ENTREZID
## Sort fold changes in decreasing order
foldChanges <- sort(foldChanges, decreasing = TRUE)
foldChanges <- setNames(as.data.frame(foldChanges), "log2FoldChange") %>%
  cbind(ENTREZID = rownames(.))

gene_list <- foldChanges$ENTREZID
kegg_enrich<-enrichKEGG(
  gene_list,
  organism = "hsa",
  keyType = "kegg",
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  minGSSize = 10,
  maxGSSize = 500)

## Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...

## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...

kegg_res <- kegg_enrich@result %>%
  as.data.frame()%>%
  arrange(pvalue)

top_pathways <- head(kegg_res, 10)

# Plot the top 10 enriched pathways
ggplot(top_pathways, aes(x = reorder(Description, -log10(pvalue)), y = -log10(pvalue))) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Top 10 Enriched KEGG Pathways",
       x = "Pathways",
       y = "-log10(p-value)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))