本文的作者是某国际知名制药公司在华研究中心的工程师,今年8月他们部门接受了我们的R语言培训,这篇文章就是培训后他做的presentation.

1.背景与目标

作为制药公司的一名毒理科学家,我经常需要基于各种研究数据来撰写报告。我以前都是运用Excel来整理数据,在GraphPad上面进行统计分析,在Excel里画图,然后在Word里面起草我的报告。

一开始,因为所有软件都很容易操作,所以这个方法看起来挺不错的。

然而,这个工作流程有两个主要的问题。

i. 临床病理数据(40多个指标)的统计分析会非常的重复乏味且耗时间。
ii. 如果汇总数据或者统计分析发生变化,在Word里面的报告也需要做相应的修改。这是一个累人且容易出错的过程。

于是,利用手中的R,我尝试完成了以下步骤:

i. 进行数据处理
ii. 进行数据分析
iii. 生成整洁的报告

2. 我在这个方案中是如何操作的…

(为了简洁,我只在下面放上了部分重点代码。你也可以在这里查看全部的Rmd代码。)

2.1 数据导入

这个相当的简单。利用xlsx程序包,我从excel文件中导入了全部数据,包括一系列包含体重数据、摄食量数据、临床病理数据和研究信息等的工作表。(我已经附上了这个xlsx文件minitox。)

1
2
3
filename <- 'Minitox.xlsx'
studyinfo <- read.xlsx(filename, sheetIndex = 1)
bw <- read.xlsx(filename, sheetIndex = 2)

我会一个一个工作的表地进行数据处理(如有必要还有统计分析)以防混淆。

2.2 数据再处理和概述统计结果

以体重数据为例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Importing body weight data from excel files
bw <- read.xlsx(filename, sheetIndex = 2)

# Creating a new dataframe with only main animal data
# And excluding null groups
bw_main <- subset(bw, TK == 'NO' & DAY_1 > 0)

# Obtaining valid group numbers
bw_group_no <- subset(bw, DAY_1 > 0)
group_no = length(unique(bw_group_no$DOSE_LEVEL))

# Obtaining information of study days
# Adding na.rm to include cases where animals die and
# no data are entered.
bw_days = 0
for (i in 1:(ncol(bw_main)-4)) {
  if (is.na(mean(bw_main[,(4+i)], na.rm = TRUE)) == 'FALSE') {
    if (mean(bw_main[,(4+i)], na.rm = TRUE) > 0) {
      bw_days <- bw_days + 1
    }
  }
  else if (is.na(mean(bw_main[,(4+i)], na.rm = TRUE)) == 'TRUE') {}
}

# Transforming the bw_main_desc_output into a neat table
library(plyr)
for (i in 1:length(bw_main_desc_output)){
  bw_main_desc_output[[i]]$avg = paste(as.character(round(bw_main_desc_output[[i]]$mean)),
                                       as.character(round(bw_main_desc_output[[i]]$sd)), sep=' ± ')
}
output_temp1 <- lapply(bw_main_desc_output, function(x) {x[,3]})
output_temp2 <- lapply(output_temp1, function(x) {t(x)})
bw_main_desc_output <- ldply(output_temp2)
colnames(bw_main_desc_output) <- c('DOSE', bw_dates)

2.3 统计分析

基本流程是

i. 利用Levene检验来检查数据分布的正态性

ii. 如果数据通过了Levene检验,那就做ANOVA。如果不通过,对数据进行变换(倒数、平方或者对数变换)然后再做一遍Levene检验。如果这三种变换都不能通过检验,那就做非参数检验(Kruskal-Waillis)。

iii. 检验之后:在ANOVA之后,使用Tukey方法;在Kruskal-Wallis检验之后,使用对应组比较方法。

以下的代码又长又啰嗦。我不是一个专业的程序员,所以请大家谅解。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Conducting Levene's Test between different dose groups
library(car)
bw_main_aov <- bw_main_desc_output
bw_main_aov[group_no+1,1] = c('LEVENE')
bw_main_aov[group_no+2,1] = c('ANOVA (p=)')
bw_main_aov[group_no+3,1] = c('KRUSKAL-WALLIS (p=)')
bw_main_aov[group_no+4,1] = c('- SUMMARY -')
count_aov = ncol(bw_main_aov) - 1
bw_main_aov <- data.frame(lapply(bw_main_aov, as.character), stringsAsFactors = FALSE)

# I'm not sure about what happened here. bw_main_aov was unintentionally
# changed into a dataframe with factor-type elements.
# Therefore, I had to manually change them back into characters.

for (i in 1:count_aov) {
  bw_day_levene <- leveneTest(bw_main[,4+i] ~ as.character(bw_main$DOSE_LEVEL), center = mean)
  # Add a consideration for null results of the Levene's test
  if (is.na(bw_day_levene[1,3] == TRUE)) {
    bw_main_aov[group_no+1,i+1] = c('-')
    bw_main_aov[group_no+2,i+1] = c('-')
    bw_main_aov[group_no+3,i+1] = c('-')
  }
  else if (bw_day_levene[1,3] > 0.05) {
    bw_main_aov[group_no+1,i+1] = c('PASS')
    bw_day_aov <- aov(bw_main[,4+i] ~ as.character(DOSE_LEVEL), data = bw_main)
    bw_main_aov[group_no+2,i+1] = round(summary(bw_day_aov)[[1]][1,5], digits = 4)
    bw_main_aov[group_no+3,i+1] = c('-')
  }
  else {
    bw_day_levene_log <- leveneTest(log(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL), center = mean)
    bw_day_levene_sqrt <- leveneTest(sqrt(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL), center = mean)
    bw_day_levene_recip <- leveneTest(1/(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL), center = mean)
    if (bw_day_levene_log[1,3] > 0.05){
      bw_day_aov <- aov(log(bw_main[,4+i]) ~ as.character(DOSE_LEVEL), data = bw_main)
      bw_main_aov[group_no+2,i+1] = round(summary(bw_day_aov)[[1]][1,5], digits = 4)
      bw_main_aov[group_no+3,i+1] = c('-')
      bw_main_aov[group_no+1,i+1] = c('~')
    }
    else if (bw_day_levene_sqrt[1,3] > 0.05) {
      bw_day_aov <- aov(sqrt(bw_main[,4+i]) ~ as.character(DOSE_LEVEL), data = bw_main)
      bw_main_aov[group_no+2,i+1] = round(summary(bw_day_aov)[[1]][1,5], digits = 4)
      bw_main_aov[group_no+3,i+1] = c('-')
      bw_main_aov[group_no+1,i+1] = c('~')
    }
    else if (bw_day_levene_recip[1,3] > 0.05) {
      bw_day_aov <- aov(1/bw_main[,4+i] ~ as.character(DOSE_LEVEL), data = bw_main)
      bw_main_aov[group_no+2,i+1] = round(summary(bw_day_aov)[[1]][1,5], digits = 4)
      bw_main_aov[group_no+3,i+1] = c('-')
      bw_main_aov[group_no+1,i+1] = c('~')
    }
    else { bw_main_aov[group_no+1,i+1] = c('FAIL')
           bw_main_aov[group_no+2,i+1] = c('-')
           bw_day_kruskal <- kruskal.test(bw_main[,4+i] ~ bw_main$DOSE_LEVEL)
           bw_main_aov[group_no+3,i+1] = round(bw_day_kruskal$p.value, digits = 4)
    }
  }
}

# Create a summary row
# 1. Beware of the intricate order of judgments.
# Go with the most stringent first.
for (i in 1:count_aov) {
  if (bw_main_aov[group_no+2,i+1] == "-") {
    if (as.numeric(bw_main_aov[group_no+3,i+1]) - 0.001 < 0) {
      bw_main_aov[group_no+4, i+1] = '***'
    }
    else if (as.numeric(bw_main_aov[group_no+3,i+1]) - 0.01 < 0) {
      bw_main_aov[group_no+4, i+1] = '**'
    }
    else if (as.numeric(bw_main_aov[group_no+3,i+1]) - 0.05 < 0) {
      bw_main_aov[group_no+4, i+1] = '*'
    }
    else {
      bw_main_aov[group_no+4, i+1] = '---'
    }
  }
  else if (as.numeric(bw_main_aov[group_no+2,i+1]) - 0.001 < 0) {
    bw_main_aov[group_no+4, i+1] = '***'
  }
  else if (as.numeric(bw_main_aov[group_no+2,i+1]) - 0.01 < 0) {
    bw_main_aov[group_no+4, i+1] = '**'
  }
  else if (as.numeric(bw_main_aov[group_no+2,i+1]) - 0.05 < 0) {
    bw_main_aov[group_no+4, i+1] = '*'
  }
  else {
    bw_main_aov[group_no+4, i+1] = '---'
  }
}

# Post-hoc statistical analysis for ANOVA (including non-significant results)
library(pgirmess)
bw_main_posthoc <- bw_main_aov
bw_main_posthoc[1,2:ncol(bw_main_posthoc)] = "---"
for (i in 1:count_aov) {
  # Add a consideration for failed Levene's Test
  if (paste(bw_main_posthoc[(group_no+1):(group_no+3),i+1], collapse='') == c('---')) {
    bw_main_posthoc[2:group_no,i+1] = rep('-', times = group_no - 1)
  }
  else if (bw_main_posthoc[group_no+2, i+1] >= 0 & bw_main_posthoc[group_no+1, i+1] == 'PASS') {
    bw_main_Tukey <- TukeyHSD(aov(bw_main[,4+i] ~ as.character(bw_main$DOSE_LEVEL)))
    bw_main_posthoc[2:group_no,i+1] = bw_main_Tukey[[1]][1:(group_no-1),4]
  }
  else if (bw_main_posthoc[group_no+2, i+1] >= 0 & bw_main_posthoc[group_no+1, i+1] == '~') {
    bw_day_levene_log <- leveneTest(log(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL), center = mean)
    bw_day_levene_sqrt <- leveneTest(sqrt(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL), center = mean)
    bw_day_levene_recip <- leveneTest(1/(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL), center = mean)
    if (bw_day_levene_log[1,3] > 0.05) {
      bw_main_Tukey <- TukeyHSD(aov(log(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL)))
      bw_main_posthoc[2:group_no, i+1] = bw_main_Tukey[[1]][1:(group_no-1),4]}
    else if (bw_day_levene_sqrt[1,3] > 0.05) {
      bw_main_Tukey <- TukeyHSD(aov(sqrt(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL)))
      bw_main_posthoc[2:group_no, i+1] = bw_main_Tukey[[1]][1:(group_no-1),4]}
    else if (bw_day_levene_recip[1,3] > 0.05) {
      bw_main_Tukey <- TukeyHSD(aov(1/(bw_main[,4+i]) ~ as.character(bw_main$DOSE_LEVEL)))
      bw_main_posthoc[2:group_no, i+1] = bw_main_Tukey[[1]][1:(group_no-1),4]}
  }
  else if (bw_main_posthoc[group_no+3, i+1] >= 0) {
    bw_main_Kruskal <- kruskalmc(bw_main[,4+i], as.character(bw_main$DOSE_LEVEL), prob = 0.05)
    bw_main_posthoc[2:group_no, i+1] = bw_main_Kruskal[[3]][1:group_no-1,3]
  }
}

# About kruskalmc: Its post-hoc p value can only be defined in the command line.
# To avoid "unnecessary excessive work", p is always defined as 0.05 here.
# Further work can be done to test cases where p is 0.01 or 0.001.

# Cleaning up the post-hoc table
for (i in 2:group_no) {
  for (j in 2:(count_aov+1)) {
    if (bw_main_posthoc[i,j] == 'FALSE') {bw_main_posthoc[i,j] = '---'}
    else if (bw_main_posthoc[i,j] == 'TRUE') {bw_main_posthoc[i,j] = 'p<0.05'}
    else if (bw_main_posthoc[i,j] == '---' | bw_main_posthoc[i,j] == '-') {bw_main_posthoc[i,j] = '---'}
    else if (as.numeric(bw_main_posthoc[i,j]) < 0.001) {bw_main_posthoc[i,j] = 'p<0.001'}
    else if (as.numeric(bw_main_posthoc[i,j]) < 0.01) {bw_main_posthoc[i,j] = 'p<0.01'}
    else if (as.numeric(bw_main_posthoc[i,j]) < 0.05) {bw_main_posthoc[i,j] = 'p<0.05'}
    else {bw_main_posthoc[i,j] = '---'}
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Creating line charts for body weight data
# Starting by re-creating mean+sd data
bw_main_desc <- describeBy(bw_main, bw_main$DOSE_LEVEL)
bw_main_mean = list()
bw_main_sd = list()
for (i in 1:group_no) {
  bw_main_mean[[i]] <- bw_main_desc[[i]][5:(4 + bw_days),3]
  bw_main_sd[[i]] <- round(bw_main_desc[[i]][5:(4 + bw_days),4], digits = 2)
}

bw_main_chart <- data.frame(unlist(bw_main_mean))
colnames(bw_main_chart) = 'mean'
bw_main_sd <- data.frame(unlist(bw_main_sd))
colnames(bw_main_sd) = 'sd'
bw_main_chart$sd = unlist(bw_main_sd)

# Note that without the 'unlist' function, 'bw_main_chart$sd' will
# become a datafrome with the datafrome of 'bw_main_chart', which
# will render plotting with errorbars below IMPOSSIBLE!!!
bw_main_chart$day = bw_dates
bw_main_chart$dose = rep(group_names, each = count_aov)

# Sanity check
bw_main_chart

# Creating faceted overview of body weight changes over all days for
# individual dose groups
library(ggplot2)
bw_main_chart_facet <- ggplot(bw_main_chart, aes(x = day, y = mean)) + geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.5, size = 0.5) + geom_bar(stat = 'identity', width = 0.75, fill = 'burlywood', color = 'black') + facet_grid(~ dose) + scale_y_continuous(limits = c(0,350), breaks = c(0,50,100,150,200,250,300), name = 'Mean Body Weight (g)') + theme(axis.title.x = element_blank())
bw_main_chart_facet

# For a quick reference of designated colors in ggplot, check:
# http://sape.inf.usi.ch/quick-reference/ggplot2/colour

# Creating a line plot to compare group body weight data
pd <- position_dodge(0.1)

# Predefining the position dodge value so that
# the plot command below gets neater
bw_main_chart_allinone <- ggplot(bw_main_chart, aes(x = day, y = mean, group = dose)) + 
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, size = 0.5, position = pd) + 
  geom_line(stat = 'identity', size =0.5, aes(color = dose)) + 
  scale_y_continuous(name = "Mean Body Weight (g)") + 
  geom_point(aes(color = dose), size = 4, position = pd) + 
  theme(axis.title.x = element_blank()) + 
  guides(color = guide_legend(title = ""))

# Placing the errorbar parts before lines and points
bw_main_chart_allinone        

2.4 数据输出和概述撰写

对摄食量和病理数据也做相似的分析流程。

然后我生成一个R markdown文件,将所有代码复制到这个markdown文件里面,框住代码并用以下代码来禁止所有系统响应。

1
2
```{r echo=FALSE, message=FALSE, warning=FALSE, comment="", results='hide', fig.show='hide'}
```

在这之后,我又写了额外的代码,选取了我需要的部分(数据框格式的结果,图形等等),并用Knitr进行提交。

针对临床病理数据,我利用以下代码将它从多列的输出结果划分为更窄但是更简洁的小部分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
```{r echo=FALSE, message=FALSE, warning=FALSE, comment=''}
{length = length(cp_main_desc_output)
 rep = floor(length/4)
 i = 1
 while (i <= rep) {
   print(cp_main_desc_output[,c(1,((4*i-2):(4*i+1)))], row.names = FALSE, right = FALSE)
   cat('', fill = TRUE)
   cat('---', fill = TRUE)
   cat('', fill = TRUE)
   i = i + 1}
 if ((length-1)%%4 == 0) {
   cat('', fill = TRUE)
 } else {
   print(cp_main_desc_output[,c(1,((rep*4+2):length))], row.names = FALSE, right = FALSE)
 }
}
```

```{r echo=FALSE, message=FALSE, warning=FALSE, comment=''}
{length_cp = length(cp_main_posthoc_clean)
 if (length_cp > 5) {
   rep_cp = floor(length_cp/5)
   mod_cp = length_cp%%5
   i = 1
   while (i <= rep_cp) {
     print(cp_main_posthoc_clean[,c(1,((5*i-3):(5*i+1)))], row.names = FALSE)
     cat('', fill = TRUE)
     cat('---', fill = TRUE)
     cat('', fill = TRUE)
     i = i + 1}
   if (mod_cp == 1) {
     cat('', fill = TRUE)
   } else {
     print(cp_main_posthoc_clean[,c(1,((length_cp-mod_cp+2):length_cp))], row.names = FALSE) }
 } else {
   print(cp_main_posthoc_clean, row.names = FALSE)
 }
}
```

利用以下的源于R文件的条款,和建立起来的自定义css工作表(用于在Windows系统中用Consolas字体展示数值数据),

options(rstudio.markdownToHTML =
          function(inputFile, outputFile) {
            require(markdown)
            markdownToHTML(inputFile, outputFile, stylesheet='custom.css')
          }
)

现在我只需要大概一两分钟就能生成一个概述文档了。


下面你能看到原数据是怎样的,而最后产物又是怎样的。

raw data in excel

neat output-body weight

neat output-body weight1

neat output-pathology statistical analysis

3. 后记

最后,我感谢SupStat的团队,以及 stackoverflow ,帮助我完成这个项目。

我花了好几个小时来编写和检查这些代码,最后证明这些都是值得的。现在我能边喝咖啡边生成概述让经理查阅了。

还有,在R的世界中,总有更多的值得我们去挖掘。