R语言代写| False Discovery Rates 案例 R语言代做

发布时间：2021-08-03 23:30:39浏览次数：

R代写|R语言代写|统计学代写: 这是一个利用R语言解决统计学问题的题目In this document we’ll present both some simulations to indicate why multiple testing is a problem and then some tools for false discovery rates in R.Simulation I – Pairwise ComparisonsThis simulation is simply intended to demonstrate the potential effects of multiple testing on comparing factor levels. We’ll assume a completely randomized design in which 500 subjects are each assigned to one of 100 factor levels at random. This gives us 100 groups of five subjects:mat = matrix(rnorm(500),5,100)Note here that all groups are generated the same way – using a standard normal distribution – so that thereshould be no difference between them.We’ll now obtain the average value in each group (note that the apply function here usese the function mean on each column of the matrix mat)means = apply(mat,2,mean)and use the sort function to sort these from smallest to largest. The ‘ix’ component that is returned gives theindex of the smallest mean, then the second smallest etc. We can use this to re-order the columns of mattIf we produce a boxplot of these values, we see a generally increasing trendboxplot(mat,cex.lab=1.5,cex.axis=1.5,col= blue )ord = sort(means,index.return=TRUE)$ix mat = mat[,ord]1 1 9 18 28 38 48 58 68 78 88 98this is because we’ve sorted the columns so they should show that – lowest first, highest last. Now, of course, we were going to examine all pairwise comparisons, including testing between the highest and the lowest meant.test(mat[,1],mat[,100])#### Welch Two Sample t-test#### data: mat[, 1] and mat[, 100]## t = -4.0935, df = 6.996, p-value = 0.004616## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval:## -3.870197 -1.035886## sample estimates:## meanofx meanofy## -1.4961285 0.9569127This is strongly significant! The reason for this is we are choosing the two means that are farther apart. With a large number of groups we expect the lowest mean among the groups to be pretty small, similarly for the group with the highest mean. A randomly-chosen pair of groups has a 5% chance of being declared significantly different, but if we look at all pairs of groups, we will chose these. Note that which pair of groups is farthest apart will change each time you do the simulation, but there will be some pair that is, and they will look significantly different.We can also use this simulation to explore the use of Tukey’s HSD. For this we’ll need to turn mat back into a vector and add a group indicator2−3 −1 123 matdat = data.frame(y=as.vector(mat), x=as.factor( rep(1:100,1,each=5))) and we can run the TukeyHSD procedure asHSD = TukeyHSD(aov(y~x,data=matdat))This produces a 4950 (the number of possible pairwise comparisons) by 4 table with colums being Difference between means Lower confidence interval for difference Upper confidence interval for difference P-value of difference corrected for multiple comparisons If we look just at the smallest corrected p-values we getmin(HSD$x[,4]) ## [1] 0.1935257which is now much larger than 0.05 and we can correctly conclude that none of the pairs of means are different from each other.Simulation II – Inflation of P-valuesThis simulation is designed to illustrate the inflation of the effective p-value through multiple testing. That is, we normally conduct a test with the view that “If the null hypothesis was true, and we re-did the experiment many times, we would incorrectly reject the null 5% of the time.”Here we will examine the situation where you are conducting 5 experiments and will publish whichever experiment results in a statistically significant result. The question posed is “Suppose all 5 null hypotheses are true, what is the probability that I will get to publish at least one paper?” That is, what is the chance that one out of five will come up. (Note the contrast to Simulation I above – there we had many levels to compare but only did one experiment; here we are repeating our five experiments many times).We are going to suppose that all the hypotheses can be tested with a t-test with five degrees of freedom. So each time we repeat the experiment we get five t-statistics. If all five null hypotheses are true then each statistic has a t5 distribution and we can generate 1000 replicated experiments as follows:tsim = matrix( rt(5000,5), 1000,5)(we could complicate this so that we don’t the same distribution in each column, but this just makes thissimple).Now let’s look at the first experiment (the first column of tsim). If we draw up a histograph of these replicatedexperiments, it looks like a standard t distributionhist(tsim[,1],cex.lab=1.5,cex.axis=1.5,100) abline(v=c(0,qt(0.95,5)))3Histogram of tsim[, 1]−10 −5 0 5tsim[, 1]and we reject the null hypothesis about 5% of the time. We can see this empirically by asking how frequently is the first column of tsim larger than the critical value of a t5 distribution (we’re only doing one-sided tests here)mean(tsim[,1]>qt(0.95,5)) ## [1] 0.045However, this isn’t the way our experiment actually works – we get to write a paper if any of our five tests rejects. That is, if the largest t-statistic is significant, we write, so let’s look at the maximums of each of the five experimentstmax = apply(tsim,1,max)If we draw the histogram of these, we have a very different picture with values that are much more often positive, and much larger than beforehist(tmax,cex.lab=1.5,cex.axis=1.5,100) abline(v=c(0,qt(0.95,5)))4Frequency0 20 40 60 80Histogram of tmax0246tmaxHow is this possible, when each column was generated the same way? Yes, but we got to choose which column to look at and we always chose the largest. Or, the maximum of five random numbers is going to tend to be larger than any individual one.If we examine how frequently we get to write a paper out of this, we havemean(tmax>qt(0.95,5)) ## [1] 0.232or about 18% percent of the time. This is what is know as p-value inflation – given that you are testing five hypotheses, you are really rejecting at the 0.13 level, rather than 0.05.Here we have independent tests, so it is best to apply a Bonferroni correction. That is we divide the α level by 5 and use the 0.01 quantile of the t distribution. If we do this, our rejection rate ismean(tmax>qt(0.99,5)) ## [1] 0.049much better.False Discovery Rates Case 1: A Simulation For Single t-testsThe previous simulation looked at p-value inflation when you are conducting a small number of tests, and we see that a Bonferroni correction is fairly reasonable. However, in modern genomic testing we have many more than five tests that we run at any one time.5Frequency0 10 30 50Here, we will examine a single simulated experiment involving 2000 t-tests based to 10 observations, all of which correspond to the null case (we’ll use this later). We’ll think of these as examining 2000 genes for differential expression.First we set up randomly-generated data:sim.data = matrix( rnorm(20000),10,2000)We can then construct the t-statistics all at once by taking means and standard deviations using the applyfunctionsim.tstats = apply(sim.data,2,mean)/(apply(sim.data,2,sd)/sqrt(10)) and calculate the corresponding p-valuessim.pvals = 1-pt(sim.tstats,9)If we draw a histogram of these, we see that they look approximately uniformly distributed.hist(sim.pvals,xlab= P-value ,cex.lab=1.5,cex.axis=1.5,col= blue )Histogram of sim.pvals0.0 0.2 0.4 0.6 0.8 1.0P−valueThis is no accident: if your p-value is generated under the null hypothesis, it should have a uniform distribution (and hence 5% of them should be less than 0.05).Indeed, if we ask how many of these test are rejected in this experiment, we see that about 5% of them are6Frequency050 150 sum(sim.pvals0) ## [1] 479The gene is differentially expressed and we would like to pick it up.We can now construct our t-statistics and p-values for these tests. Note I’ve added 0.5X after taking the mean here – this is exactly the same as adding 0.5X to each value in the corresponding column in Y and then taking the mean as we normally wouldIf we look at the p-values for this data set we see a “spike” at very small valuessim2.tstats = ( apply(Y,2,mean) + 0.5*X)/(apply(Y,2,sd)/sqrt(10)) sim2.pvals = 1-pt(sim2.tstats,9)7 hist(sim2.pvals,xlab= P-value ,cex.lab=1.5,cex.axis=1.5,col= blue )Histogram of sim2.pvals0.0 0.2 0.4 0.6 0.8 1.0P−valueThis corresponds to the approximately 25% of genes (this will be different each time you randomly generate a new X) for which the null hypothesis isn’t true – so we should have a pretty good shot of picking them up.Indeed, if we simply reject tests at the 0.05 level, and look only at the “true rejections” we getsum(sim2.pvals[X>0]0],20)Histogram of sim2.pvals[X > 0]0.0 0.2 0.4 0.6 0.8 1.0sim2.pvals[X > 0]9Frequency0 50 100 150 200 p2 = hist(sim2.pvals[X==0],20)Histogram of sim2.pvals[X == 0]0.0 0.2 0.4 0.6 0.8 1.0sim2.pvals[X == 0]plot(p1,xlab= P-value ,cex.lab=1.5,cex.axis=1.5,col=rgb(0,0,1,1),xlim=c(0,1)) plot(p2,,col=rgb(1,1,1,0.25),add=TRUE)10Frequency0 20 40 60 80 Frequency0 50 100 200sort.sim2pvals = sort(sim2.pvals,index.return=TRUE)Histogram of sim2.pvals[X > 0]0.0 0.2 0.4 0.6 0.8 1.0P−valueOf course, we don’t actually know these distributions and so must estimate them.On simple approach to this was suggested by Benjamini and Hochberg in 1995 – who originally coined the term False Discovery Rate. There are now many variations on this that improve its accuracy, but we can motivate the mathematics fairly easily:Suppose we conduct m tests and declare any test with a p-value less than p to be significant. Imagine that k tests are declared significant under this cutoff. However, we expect at most mp null tests to be declared significant (since we can have at most m null cases), so we have that the False Discovery Rate is estimated to be less than mp/k.(One of the ways this is improved on is by trying to estimate the number of null cases, m0 and using this instead of m).This gives us a way to estimate the FDR for any threshhold, but normally we want to control it. The usual default value is at 0.1 – 10% of the genes (or SNP’s or anything else) you report are false alarms.To do this, it’s helpful to first of all sort the p-values (so for the 57th p-value, there are 57 tests that would be declared significant at this threshhold)We can plot these sorted values where we see that the p-values tend to stay low for a while in comparison to the Case 1 p-values above which all have a uniform distribution and which therefore tend to lie along a straight line:plot(sort.sim2pvals$x,type= l ,col= red ,xlab= Index After Sorting ,cex.lab=1.5,cex.axis=1 lines(sort(sim.pvals),col= blue )abline(c(0,1/2000),col= blue ,lty=2)11.5,ylab= Sorte abline(c(0,0.5/2000),col= red ,lty=2)legend( topleft ,c( Case I P-values , Case II P-values , Uniform Quantiles , 50% False Discovery Line ),Sorted P−values0.0 0.4 0.80Case I P−valuesCase II P−valuesUniform Quantiles50% False Discovery Line500 1000 1500 2000Index After SortingNow for each p-value, we calculate the corresponding q-value, given by pm/k where k is the number of p-values less than this onesort.sim2qvals = sort.sim2pvals$x*2000/(1:2000)plot(sort.sim2qvals,xlab= Index After Sorting ,cex.lab=1.5,cex.axis=1.5,ylab= Sorted Q-values ) abline(h=0.1)12 Sorted Q−values0.0 0.4 0.8Of whichsum(sort.sim2qvals0] 0]0.0 0.2 0.4 0.6 0.8 1.0P−valueWithout going into details of how to guess at this separation, it does so and then in each bin of the histogram we can report the local false discovery rate in that bin as being[Expected null disbtrbution counts]/[Total counts in bin]The local fdr tends to be larger than the FDR given by Benjamini and Hochberg and a cut-off of 0.2 is generally used (less than 20% of p-values around this size correspond to the null distribution).In RFortunately, there is a library in R that makes these calculations for youlibrary(fdrtool)To use this, you give fdrtool a vector of p-values (you can also use a vector of correlations or a statistic thatshould be normal under the null hypothesis – you need to tell it which. Here we give it the p-values from aboveqvals = fdrtool(sim2.pvals,statistic= pvalue ) ## Step 1 determine cutoff point## Step 2 estimate parameters of null distribution and eta015Frequency0 50 100 200 ## Step 3 compute p-values and estimate empirical PDF/CDF## Step 4 compute q-values and local fdr## Step 5 prepare for plottingType of Statistic: p−Value (eta0 = 0.7921)0.00.00.0MixtureNull Component0.2 0.4Alternative Component0.6 0.8 1.01−pvalDensity (first row) and Distribution Function (second row)0.2 0.40.6 0.8 1.01−pval(Local) False Discovery RateFdr and fdr CDF Density0.0 1.0 0.0 1.0 0 30.2 0.40.6fdr (density−based) Fdr (tail area−based)0.8 1.01−pvalfdrtool calculates both the FDR q-value and the local FDR; but note that its q-value calculation is different from the simple method we produced above. For this we rejectsum(qvals$qval