%% Microarray Data Analysis % In this tutorial, we will analyze microarray data available from GEO % repository hosted at NCBI (http://www.ncbi.nlm.nih.gov/geo/). % We will rely on Matlab's Bioinformatics Toolbox for some of the file % parsing and data analysis functionality. % %% % We'll analyze the experimental data that was used in the following study: % "A stromal gene signature associated with inflammatory breast cancer.", % Boersma et.al., 2008. https://www.ncbi.nlm.nih.gov/pubmed/17999412 %% % *Abstract:* The factors that determine whether a breast carcinoma will % develop into inflammatory breast cancer (IBC) remain poorly understood. % Recent evidence indicates that the tumor stroma influences cancer phenotypes. % We tested the hypotheses that the gene expression signature of the tumor % stroma is a distinctive feature of IBC. We used laser capture microdissection % to obtain enriched populations of tumor epithelial cells and adjacent stromal % cells from 15 patients with IBC and 35 patients with invasive, noninflammatory % breast cancer (non-IBC). Their mRNA expression profiles were assessed using % Affymetrix GeneChips. In addition, a previously established classifier for % IBC was evaluated for the resulting data sets. The gene expression profile % of the tumor stroma distinguished IBC from non-IBC, and a previously established % IBC prediction signature performed better in classifying IBC using the gene % expression profile of the tumor stroma than it did using the profile of the % tumor epithelium. In a pathway analysis, the genes differentially expressed % between IBC and non-IBC tumors clustered in distinct pathways. We identified % multiple pathways related to the endoplasmic stress response that could be % functionally significant in IBC. Our findings suggest that the gene expression % in the tumor stroma may play a role in determining the IBC phenotype. %% % The data is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5847 %% GEO data types % * GSM: An individual microarray Sample (e.g., patient, or tissue). % * GSE: A Series, representiing an experimental study. A GSE contains one % or more GSM entries. % * GPL: Platform data, containing information about microarray probes. % Each GSM sample is associated with a GPL. E.g., to find out the gene % symbols for a GSM, you would need to consult with the GPL used in that % study. %% Variable names and terms % * gse: The struct returned from Matlab's geoseriesread(). Contains all % the information contained in a GSE. % * d: The DataMatrix object representing the GSE data. We'll also use the % prefix d to denote other DataMatrix objects, e.g., dpvals. % * m: A numerical matrix (e.g., converting d into a double). % * MAP_GSE_GPL: used to map gse probes to gpl indices, so we can get e.g., % gene symbol information for a probe. % IBC: inflammatory breast cancer % Epithelial: the type of cells forming the breast cancer % Stroma: fatty tissue (adipocytes) and fibroblasts if true %% Download GSE file and use Matlab's geoseriesread() to parse it. % https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5847 % See the bmes_downloadandparsegse() function for details. gse=bmes_downloadandparsegse('GSE5847') % gse.Data is a DataMatrix object, similar to a Matlab table. In addition % to the numerical matrix, it internally stores the row & column names. get(gse.Data) d = gse.Data; %Let's use d for short, because we'll use this often. %% % Print first 6 Probes and first 5 samples d(1:6,1:5) %% GSE Header information gse.Header %% gse.Header.Series %% % Get the title of the first sample gse.Header.Samples.title{1} %% % Print all of the available characteristics of the 20th sample gse.Header.Samples.characteristics_ch1(:,20) %% % Print the diagnostic information of all samples gse.Header.Samples.characteristics_ch1(1,:) %% GPL Platform used in this study gse.Header.Series.platform_id %% % GPL information is contained in a separate file. Download and parse it. % See bmes_downloadandparsegpl(). gpl = bmes_downloadandparsegpl('GPL96') %% % Here's the type of information we get for each microarray Probe. gpl.ColumnNames %% %Let's look at the information for the first probe: [gpl.ColumnNames gpl.Data(1,:)'] %% %% Translate probesets to gene symbols gplprobes = gpl.Data(:, strcmp(gpl.ColumnNames, 'ID')); gplgenes = gpl.Data(:, strcmp(gpl.ColumnNames, 'Gene Symbol')); gseprobes = d.rownames; MAP_GSE_GPL = zeros(numel(gseprobes),1); % For each gseprobe, we need to search gplprobes and use the corresponding % gene. Doing string comparison for each of them will be too slow. Let's % use a Map container to speed this up. map = containers.Map(gplprobes,1:numel(gplprobes)); for i=1:numel(gseprobes) if map.isKey(gseprobes{i}); MAP_GSE_GPL(i)=map(gseprobes{i}); end end gsegenes = gseprobes; %make a copy, so entries not found will keep the probe name. gsegenes(find(MAP_GSE_GPL)) = gplgenes(MAP_GSE_GPL(find(MAP_GSE_GPL))); % Print first 5 probes and the gene symbols we found for them. We convert % to table() just because we like how it prints. table(gseprobes(1:5), gsegenes(1:5),'VariableNames',{'gseprobe','gsegene'}) % continue from here. %% % TODO: the above code is useful. Think about generalizing it to a function % that takes gpl structure and gseprobes and returns any GPL column (e.g., % genesymbol, refseq id, etc.) you need. %% % Let's replace the DataMatrix object so it uses genes as rownames. % we would've liked to say "d.rownames=gsegenes", but datamatrix doesn't % support that. d = d.rownames(':',gsegenes); %% % print a random selection of rows and columns. d( randi(d.size(1),1,6), randi(d.size(2),1,4)) %% Data Analysis: Determine sample groups we'll work with % We are often interested in comparing groups of samples. We need to look % at the header information and decide which information for samples we can % use to group them. For this experiment, we'll work with four groups, % using IBC vs. non-IBC and stroma vs. epithelial characteristics. % Where one finds the sample information is experiment-specific, but % the Header.Samples structure usually contains what we need. %% % first characteristics_ch1 contains diagnosis info (IBC vs. non-IBC) samplegroups = gse.Header.Samples.characteristics_ch1(1,:); unique(samplegroups)' %% % source_name_ch1 contains tissue source info (stroma vs. epithelium) samplesources = gse.Header.Samples.source_name_ch1; unique(samplesources)' %% % create logical vectors to record which samples are IBC and which are % stroma. Iibc = strcmpi(samplegroups,'diagnosis: ibc'); Istroma = strcmpi(samplesources, 'human breast cancer stroma'); Istroma_ibc = Istroma & Iibc; Istroma_nonibc = Istroma & ~Iibc; Iepi_ibc = ~Istroma & Iibc; Iepi_nonibc = ~Istroma & ~Iibc; % create a numerical vector to assign each sample to a group 1-4. Igroups=zeros(1,numel(samplegroups)); Igroups(Istroma_ibc) = 1; Igroups(Istroma_nonibc) = 2; Igroups(Iepi_ibc) = 3; Igroups(Iepi_nonibc) = 4; groupnames={'sIBC' 's~IBC' 'eIBC' 'e~IBC'}; % let's change the columnnames in the datamatrix. colnames=d.colnames; for i=1:4; colnames(Igroups==i) = groupnames(i); end d=d.colnames(':',colnames); %this really means: "d.colnames=colnames;" %% % print a random selection of rows and columns. d( randi(d.size(1),1,6), randi(d.size(2),1,5)) %------------------------------------------------------------------------- %% Data Analysis: Find differentially expressed genes between groups of samples. % Let's find the significantly differentially expressed genes between IBC and % non-IBC samples, in stroma samples. (We are not using the epithelial % samples in this analysis.) [~,pvals]=ttest2(d(:,Istroma_ibc)', d(:, Istroma_nonibc)'); fpvals = mafdr(pvals); % TODO: create dpvals from fpvals. %% % Matlab also offers mattest() function, which can use permutation tests % for false discovery rate correction. It takes a longer time to compute.. [dpvals]=mattest(d(:,Istroma_ibc), d(:, Istroma_nonibc), 'permute',10); %% % Print the list of top 5 most significantly different genes. (Note that % dpvals is a DataMatrix object, so we'll just have it print itself. dpvals_sorted = dpvals.sortrows('p-values'); fprintf('Found %d genes with pvalue<=0.01\n',nnz(dpvals(:,1) <= 0.01)); fprintf('Top 5 most significantly different genes between IBC and non-IBC stroma samples:\n'); disp(dpvals_sorted(1:5,:)) %% Plot the expression levels of top 5 genes in each group [~,Itop5]=sort(dpvals.double(:,1)); Itop5 = Itop5(1:5); top5avg = zeros(5,4); top5std = zeros(5,4); for gi=1:4 top5avg(:,gi) = mean(d.double(Itop5, Igroups==gi),2); top5std(:,gi) = std(d.double(Itop5, Igroups==gi),[],2); end %bmes_fig top5; clf bar(top5avg); ylabel('log2(expression)') h=gca; h.XTickLabel=d.rownames(Itop5); legend(groupnames); %bmes_setniceylim; %% Plot the expression levels of top gene using boxplots boxplot(d.double(Itop5(1),:)',Igroups'); % TODO: set xticks / ylabel. %% % pvalue is good to know, but we usually also want to know the fold % changes (How different the average expression is between groups). % Note that this GSE data is already log2-transformed. So, calculation of % fold change should involve a subtraction, not a division! %log2fc = log2( mean(d(:,Istroma_ibc),2) ./ mean(d(:, Istroma_nonibc),2) ); log2fc = mean(d(:,Istroma_ibc),2) - mean(d(:, Istroma_nonibc),2); scatter(log2fc, -log10(dpvals(:,1)), '.'); xlabel('log_2(IBC:nonIBC) in stroma'), ylabel('-log_{10}(pvalue)'); %mark fc>=1.5 and pvalue<=0.01 hold on; plot(log2([2/3 2/3]), ylim, log2([1.5 1.5]), ylim) plot(xlim, -log10([.01 .01])) %% % Print the top 5 genes, now with fold change. % log2fc is not easy to interpret, let's convert it to negative fold % change. So, a negfc=2 will mean IBC is 2-fold compared to nonIBC and a % negfc=-2 will mean IBC is half of nonIBC. negfc = 2.^log2fc; negfc(negfc<1) = - 1./negfc(negfc<1); % Add the foldchange information to the dpvals object: dpvals=[dpvals bioma.data.DataMatrix(negfc,'ColNames',{'negfc'})]; % Select the genes with pvalue<=0.01 and FC>=1.5. I = dpvals(:,'p-values')<=0.01 & abs(dpvals(:,'negfc'))>=1.5; dsigfc = dpvals(I,:); dsigfc = dsigfc.sortrows('p-values'); fprintf('Found %d genes with pvalue<=0.01 and FC>=1.5. Showing top 5:\n',size(dsigfc,1)); disp(dsigfc(1:5,:)) %% % Matlab has a mavolcanoplot() tool that helps you explore significantly % different genes for different pvalue and fold change cutoffs. mavolcanoplot(d.double(:,Istroma_ibc), d.double(:,Istroma_nonibc), dpvals.double(:,1),'Labels',d.rownames) %% Writing data into Excel File % Create a cell array containing the data that you want appear in an Excel sheet. I=find(dpvals(:,1)<=0.01); nsig=numel(I); xlsdata = cell(nsig, 3); %each row will contain genesymbol,pvalue,negfc for i=1:nsig gene=dpvals.rownames{I(i)}; p=dpvals.double(I(i), 1); nfc=dpvals.double(I(i), 2); xlsdata(i,:) = {gene p nfc}; end xlsdata=[ {'genesymbol' 'pvalue' 'negfc'}; xlsdata]; %add the header row. xlswrite('stromaibc.xlsx',xlsdata,'siggenesIBC_nonIBC'); %------------------------------------------------------------------------- %% Data Analysis: Hierarchical Clustering % Let's cluster genes and samples using hierarchical clustering. Clustering % can be very time consuming, so let's only do it for a subset of the genes. % One idea is to only keep the genes that vary most across samples % (ingoring sample groups.) This can be done using: I=genevarfilter(d, 'Percentile',99); %remove 99% of genes % Removing 99% is too aggressive. If your computing time and memory allow, % use 90%. d2 = d( I, :); % Another option is to keep the genes that vary across sample groups. Since % we already did differential expression analysis above, let's use the genes % resulting from it. (Note that we only compared IBC vs. nonIBC in stroma; % one would really need to repeat differential expression analysis for % other pairs of groups of samples, and use the union of all significant % genes). Isig=dpvals(:,'p-values')<=0.05; %let's use 0.05 threshold to get some more genes. d2 = d(Isig,:); cg = clustergram(d2,'Standardize','Row'); %% Data Analysis: "Flat" clustering % You can cluster genes (or samples) without a hierarchy. There are % different methods for that. You can start off from a hierarchical % clustering and "cut" the tree at a certain level to get a flat level of % groupings. % Let's create a distance matrix between pairs of genes. % pdist() gives a vector (to save space). If you want the symmetric matrix, % just pass the result through squareform(). genedist = pdist(d2,'spearman'); % linkage() hierarchically groups the genes. The result contains % information about which two groups are combined at each branch. tree = linkage(genedist,'average'); % visualize the tree, show only 20 nodes. % Groups of genes will have a numerical id for labels. %bmes_fig geneclust; clf dendrogram(tree,20,'Labels',d2.rownames); h=gca; h.XTickLabelRotation=45; %% % let's now cut the tree to generate 6 clusters. The result is a membership % vector, assigning each gene to one of the 6 clusters. Iclust = cluster(tree, 'maxclust', 6); % Let's plot each group of genes separately. %bmes_fig geneclust; clf for i=1:6 subplot(2,3, i); plot( d2(Iclust==i, :)' ); end %% % The figure is too crowded. Let's show the average expression for the % genes in each cluster. d2avg=zeros(6,size(d2,2)); for i=1:6 d2avg(i,:) = mean(d2(Iclust==i,:),1); end %bmes_fig geneclust; clf clf; plot(d2avg'); end % If this were a time-series experiment, plotting with x axis representing % time would be informative. However, in this experiment, The x axis % represents samples, whose ordering doesn't mean much.. Let's instead % show average values for different groups. d2avgavg=zeros(6,4); for i=1:6; for j=1:4 d2avgavg(i,j) = mean(mean(d2(Iclust==i,Igroups==j))); end; end %bmes_fig geneclust; clf bar(d2avgavg); legend(groupnames); xlabel('gene clusters'); ylim([6 8]); %bmes_setniceylim; %% Data-Analysis: k-means clustering % k-means is another method to generate groups of genes. Iclust = kmeans(d2, 6, 'dist','corr'); % Once you have the Iclust membership values, you can again generate % similar plots as above. d2avgavg=zeros(6,4); for i=1:6; for j=1:4 d2avgavg(i,j) = mean(mean(d2(Iclust==i,Igroups==j))); end; end %bmes_fig geneclust; clf bar(d2avgavg); legend(groupnames); xlabel('gene clusters'); ylim([4 8]); %bmes_setniceylim; %% Data Analysis: Principal Component Analysis (PCA) % Each gene is a multi-dimensional vector (as many dimensions as the number % of samples). We often want to visualize genes in a lower (2D) space. % Let's use PCA to reduce data dimensionality while preserving as much of % the original information as possible. [pc, reduced, pcvars] = pca(d2); % reduced has as many columns as the original data. But the first column % now contains the most information, the second column contains the second % most, etc. Let's just use the first two columns to show genes on a % figure. %bmes_fig pca; clf scatter(reduced(:,1), reduced(:,2)); %% % We can combine the clustering and PCA and show the genes in different % clusters with a different color. %bmes_fig pca; clf gscatter(reduced(:,1), reduced(:,2), Iclust); % we can label the points using the text() function. Let's do that for the % first 5 genes. for i=1:5; text(reduced(i,1),reduced(i,2), d2.rownames{i}, 'FontWeight','bold', 'FontSize',13); end % Whenever you show results of PCA, you should report the amount of % variance captured by the reduced data. pcvars = pcvars / sum(pcvars); xlabel(sprintf('%%%.1f of variance', pcvars(1)*100)); ylabel(sprintf('%%%.1f of variance', pcvars(2)*100)); %------------------------------------------------------------------------- %% Data Analysis: Gene Set Enrichment % We have found the significantly different genes between two groups. But % what do these genes do? Are there significant differences in biological % functions between two groups? To answer these questions, we'll make use % of the Gene Ontology terms, which annotate each gene to one or more % Biological Processes, Cellular Components, and Molecular Functions. gplgobio = gpl.Data(:, strcmp(gpl.ColumnNames, 'Gene Ontology Biological Process')); % The biological process of the first microarray probe: gplgobio{1} %% % The Gene Ontology terms are given as a list separated by '///'. Let's % convert each into a cell array, to make programming with them easier. for i=1:numel(gplgobio) %handle empty separately, b/c strsplit doesn't work as we want for empty strings. if isempty(gplgobio{i}); gplgobio{i}={}; else gplgobio{i} = strsplit(gplgobio{i},' /// '); end end % Let's reduce amount of memory by keeping a master list of Biological % Processes and have each probe have indices to that master list. GOBIO = unique( [gplgobio{:}] ); % And use the "map" technique to search for each term in the GOBIO list. map = containers.Map(GOBIO,1:numel(GOBIO)); for i=1:numel(gplgobio) for j=1:numel(gplgobio{i}) gplgobio{i}{j} = map(gplgobio{i}{j}); end gplgobio{i} = cell2mat(gplgobio{i}); end %% % Now each gplgobio is a numeric array. gplgobio{1} %% % To get the actual terms, use these numbers as indices to the GOBIO list. GOBIO( gplgobio{1} )' %% % We have made the gplgobio more convenient to work with, but we really % need to get the gobio information for the gse probes. We need to map take % the GSE probes and search them in the GPL probes again. We have % previously created the MAP_GSE_GPL, which we can use again. gsegobio = cell(size(d,1),1); %any probe not found in gpl will remain an empty vector. gsegobio(find(MAP_GSE_GPL)) = gplgobio(MAP_GSE_GPL(find(MAP_GSE_GPL))); %% % Now we can go through each GOBIO term and see if our list of significant % genes is enriched for that term (ie. if that Biological Process is % significantly different between the groups we are comparing). We only % need to do this for the GOBIO terms that have at least one significant % gene for it. If it has no significant genes for it at all, we already % know that Biological Process is not affected, so no need to run a % statistical test for it. %% % Find the terms that have at least one significant gene for it. Isig=dpvals(:,'p-values')<=0.01; dsig = d(Isig,:); dpvalssig = dpvals(Isig,:); gsegobiosig = gsegobio(Isig); candidategobio = unique( [gsegobiosig{:}] ); %% % Let's just analyze the first candidate GO Biological process. gobioid = candidategobio(1); GOBIO{ gobioid } %% Hypergeometric Test % To calculate the significance of a Biological Process, we'll use the % hypergeometric test. We need the following four numbers: % % x: number of samples drawn, with the desired characteristic. % K: number of items with the desired characteristic in the population % N: number of samples drawn % M: size of the population % x: The number of significant genes that are in this term. x = nnz([gsegobiosig{:}] == gobioid) %% % K: Number of genes (significant or not) that are in this term. K = nnz([gsegobio{:}] == gobioid) %% % * N: Total number of significant genes. (ignore those with no GO term). N = nnz( ~cellfun(@isempty,gsegobiosig) ) %% % M: Total number of genes. (ignore those with no GO term). M = nnz( ~cellfun(@isempty,gsegobio) ) %% % Calculate and print the pvalue for this Biological Process. pval = hygecdf(x,M,K,N,'upper'); fprintf('pvalue=%f for the GO Biological Process [%s]\n',pval,GOBIO{gobioid}); % We calculated the pvalue for only one of the candidate terms. You would % of course want to calculate a pvalue for all candidate terms, apply FDR % correction, and report only the most significant ones.