Transformative Insights into RNA-Seq Data Analysis Using PCA
Written on
Introduction to RNA Sequencing
In this discussion, we will explore how RNA sequencing analysis is revolutionizing medical research. By the end, you will understand how machine learning algorithms can provide valuable insights into various diseases, and I’ll share the Python code for these processes. Let’s start by establishing the context.
Pancreatic Adenocarcinoma (PAAD) ranks as the third leading cause of cancer-related deaths, with a dismal 5-year survival rate of under 5%. Projections indicate that by 2030, it may become the second leading cause of cancer mortality in the United States.
Cognitive Computing in Modern Research
Cognitive computing is increasingly viewed as a critical aspect of technological advancements. As users, we often overlook the profound impact technology has on our daily lives.
Ribonucleic Acid (RNA) is a vital polymeric molecule that plays essential roles in coding, decoding, regulating, and expressing genes. RNA sequencing, or RNA-Seq, is a technology that quantifies RNA levels in biological samples at specific time points.
Dataset Overview
We have a dataset featuring normalized RNA sequencing reads from pancreatic cancer tumors, encompassing approximately 20,000 genes across 185 samples. The data is formatted in GCT, a tab-delimited style that shares gene expression information and metadata for each sample. This GCT file resembles a multi-dimensional DataFrame, consisting of three combined 2-D DataFrames:
- data_df: Contains 18,465 rows (Gene ID) and 183 columns (Sample Name/ID).
- row_metadata_df: Holds row metadata but is empty, indicating a lack of present metadata.
- col_metadata_df: Includes 183 columns (Sample Names/ID) and 124 rows detailing column metadata, such as histological type and patient status.
Data Cleaning and Gene Expression Distribution
Out of the 18,465 rows, 4,367 contained NULL values in some columns, which were removed during the cleaning process. This left us with 14,098 rows and 183 columns, with each row corresponding to a Gene ID and each column representing a unique sample.
Next, we generated a distribution plot for gene expression across all samples. The heatmap illustrates the expression values—ranging from 0 to 15—where the x-axis represents sample names, and the y-axis corresponds to gene IDs. The color gradient indicates gene expression levels for each sample, with the color bar on the right enhancing our understanding of distribution.
Utilizing PCA for Data Preparation
To focus on the Exocrine (adenocarcinoma) tumors, we excluded Neuroendocrine tumors. This preparation step is crucial for employing PCA to visualize data. We expect to observe two distinct clusters based on the cancer types using the 'histological_type_other' dataset.
The initial visualization of the PCA plot indicates that most data points cluster within specific ranges of PC1 and PC2 values. Outlier samples—defined as those with PC1 values beyond -100 or 100 and PC2 values below 50—were subsequently filtered out.
Understanding Interferons in Pancreatic Adenocarcinoma
Interferons (IFNs) are signaling proteins produced by host cells in response to various pathogens, including viruses and tumor cells. Type I interferons, a significant subgroup, play a role in regulating immune responses. The genes associated with Type 1 Interferons—collectively known as the Type 1 IFN signature—comprise a set of 25 genes in humans.
We will now create a DataFrame featuring these 25 genes as rows, with the sample names serving as columns, to visualize gene expression related to pancreatic adenocarcinoma.
The heatmap displays the distribution of gene values, primarily clustering around 10 and 12, while lower frequencies are observed closer to 4. The output generated from the GSVA (Gene Set Variation Analysis) algorithm is displayed in the terminal.
Conclusion
In summary, we explored RNA sequencing data and its storage formats for enhanced analysis. We utilized machine learning algorithms such as PCA to derive insights for medical research, simplified the understanding of algorithmic outputs, and examined the role of interferons in immune response. Finally, we employed GSVA in Docker to gain specific insights into the data.
For the complete code, feel free to visit my GitHub profile. For more engaging content, consider following my blog or installing the Medium app!