Hot questions for Using Ggplot2 in ggpmisc

Question:

A few years ago, a poster asked how to add regression line equation and R2 on ggplot graphs at the link below.

Adding Regression Line Equation and R2 on graph

The top solution was this:

lm_eqn <- function(df){
    m <- lm(y ~ x, df);
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
         list(a = format(coef(m)[1], digits = 2), 
              b = format(coef(m)[2], digits = 2), 
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));                 
}

p1 <- p + geom_text(x = 25, y = 300, label = lm_eqn(df), parse = TRUE)

I am using this code and it works great. However, I was wondering if it is at all possible to make this code have the R2 value and regression line equation on separate lines, instead of being separated by a comma.

Instead of like this

Something like this

Thanks in advance for your help!


Answer:

EDIT:

In addition to inserting the equation, I have fixed the sign of the intercept value. By setting the RNG to set.seed(2L) will give positive intercept. The below example produces negative intercept.

I also fixed the overlapping text in the geom_text

set.seed(3L)
library(ggplot2)
df <- data.frame(x = c(1:100))
df$y <- 2 + 3 * df$x + rnorm(100, sd = 40)

lm_eqn <- function(df){
  # browser()
  m <- lm(y ~ x, df)
  a <- coef(m)[1]
  a <- ifelse(sign(a) >= 0, 
              paste0(" + ", format(a, digits = 4)), 
              paste0(" - ", format(-a, digits = 4))  )
  eq1 <- substitute( paste( italic(y) == b, italic(x), a ), 
                     list(a = a, 
                          b = format(coef(m)[2], digits = 4)))
  eq2 <- substitute( paste( italic(R)^2 == r2 ), 
                     list(r2 = format(summary(m)$r.squared, digits = 3)))
  c( as.character(as.expression(eq1)), as.character(as.expression(eq2)))
}

labels <- lm_eqn(df)


p <- ggplot(data = df, aes(x = x, y = y)) +
  geom_smooth(method = "lm", se=FALSE, color="red", formula = y ~ x) +
  geom_point() +
  geom_text(x = 75, y = 90, label = labels[1], parse = TRUE,  check_overlap = TRUE ) +
  geom_text(x = 75, y = 70, label = labels[2], parse = TRUE, check_overlap = TRUE )

print(p)

Question:

I'm using R package ggpmisc. Wonder how to put hat on y in Regression Equation or how to get custom Response and Explanatory variable name in Regression Equation on graph.

library(ggplot2)
library(ggpmisc)

df <- data.frame(x1 = c(1:100))
set.seed(12345)
df$y1 <- 2 + 3 * df$x1 + rnorm(100, sd = 40)

p <- ggplot(data = df, aes(x = x1, y = y1)) +
  geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
  stat_poly_eq(formula = y ~ x, 
               aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
               parse = TRUE) +         
  geom_point()
p


Answer:

I would turn off the default value for y that is pasted in and build your own formula. For example

ggplot(data = df, aes(x = x1, y = y1)) +
  geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x) +
  stat_poly_eq(formula = y ~ x, eq.with.lhs=FALSE,
      aes(label = paste("hat(italic(y))","~`=`~",..eq.label..,"~~~", ..rr.label.., sep = "")), 
      parse = TRUE) +         
  geom_point()

We use eq.with.lhs=FALSE to turn off the automatic inclusion of y= and then we paste() the hat(y) on to the front (with the equals sign). Note that the formatting comes from the ?plotmath help page.

Question:

I am trying to annotate a line plot with an arrow pointing to the highest point in line plot and displaying an arrow and maximum value on the plot. I am using the mtcars dataset as my reference. Below is my code.

e <- df$mpg
ggplot(df, aes(x=e, y=df$hp)) + 
  geom_line() + 
  annotate("segment", color="blue", x=max(e), xend = max(e), y=max(df$hp), 
            yend=max(df$hp), arrow=arrow())

Thanks in advance,


Answer:

Are you looking for something like this:

labels <- data.frame(mpg = mtcars[which(mtcars$hp == max(mtcars$hp)), "mpg"]+7, hp = mtcars[which(mtcars$hp == max(mtcars$hp)), "hp"],text = paste0("Max value at mpg = ", mtcars[which(mtcars$hp == max(mtcars$hp)), "mpg"], " and hp = ", max(mtcars$hp)))


ggplot(mtcars, aes(mpg, hp))+
    geom_line()+
    geom_text(data = labels, aes(label = text))+
    annotate("segment", 
        x=mtcars[which(mtcars$hp == max(mtcars$hp)), "mpg"]+2,
        xend=mtcars[which(mtcars$hp == max(mtcars$hp)), "mpg"]+.2, 
        y= mtcars[which(mtcars$hp == max(mtcars$hp)), "hp"],
        yend= mtcars[which(mtcars$hp == max(mtcars$hp)), "hp"], 
        arrow=arrow(), color = "blue")

Explanation: In order to annotate with the max, we need to find the position of mpg that is the maximum for hp. To do this we use mtcars[which(mtcars$hp == max(mtcars$hp)), "mpg"]. The which() statement gives us the row possition of that maximum so that we can get the correct value of mpg. Next we annotate with this position adding a little bit of space (i.e., the +2 and +.2) so that it looks nicer. Lastly, we can construct a dataframe with the same positions (but different offset) and use geom_text() to add the data label.

Question:

Based on the example here Adding Regression Line Equation and R2 on graph, I am struggling to include the regression line equation for my model in each facet. However, I don't figure why is changing the limits of my x axis.

library(ggplot2)
library(reshape2)

df <- data.frame(year = seq(1979,2010), M02 = runif(32,-4,6), 
M06 = runif(32, -2.4, 5.1), M07 = runif(32, -2, 7.1))
df <- melt(df, id = c("year"))


ggplot(data = df, mapping = aes(x = year, y = value)) +
geom_point() +
scale_x_continuous() + 
stat_smooth_func(geom = 'text', method = 'lm', hjust = 0, parse = T) +
geom_smooth(method = 'lm', se = T) +
facet_wrap(~ variable) # as you can see, the scale_x_axis goes back to 1800

If I include on the x the limits,

scale_x_continuous(limits = c(1979,2010)) 

it does not show the regression coefficient anymore. What am I doing wrong here?

stat_smooth_func available here: https://gist.github.com/kdauria/524eade46135f6348140


Answer:

You can use stat_poly_eq function from the ggpmisc package.

library(reshape2)
library(ggplot2)
library(ggpmisc)
#> For news about 'ggpmisc', please, see https://www.r4photobiology.info/
#> For on-line documentation see https://docs.r4photobiology.info/ggpmisc/

df <- data.frame(year = seq(1979,2010), M02 = runif(32,-4,6), 
                 M06 = runif(32, -2.4, 5.1), M07 = runif(32, -2, 7.1))
df <- melt(df, id = c("year"))

formula1 <- y ~ x

ggplot(data = df, mapping = aes(x = year, y = value)) +
  geom_point() +
  scale_x_continuous() + 
  geom_smooth(method = 'lm', se = TRUE) +
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")), 
               label.x = "left", label.y = "top",
               formula = formula1, parse = TRUE, size = 3) +
  facet_wrap(~ variable) 

ggplot(data = df, mapping = aes(x = year, y = value)) +
  geom_point() +
  scale_x_continuous() + 
  geom_smooth(method = 'lm', se = TRUE) +
  stat_poly_eq(aes(label = paste(..eq.label.., sep = "~~~")), 
               label.x = "left", label.y = 0.15,
               eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)",
               formula = formula1, parse = TRUE, size = 4) +
  stat_poly_eq(aes(label = paste(..rr.label.., sep = "~~~")), 
               label.x = "left", label.y = "bottom",
               formula = formula1, parse = TRUE, size = 4) +
  facet_wrap(~ variable) 

Created on 2019-01-10 by the reprex package (v0.2.1.9000)

Question:

I'm trying to display the equations on the plot using the stat_poly_eq function of ggpmisc.

My problem is how to change the y= ... in the equation, by y1=... and y2=... by referring to the key argument.

I tried to add the eq.with.lhs argument in the mapping but it does not recognize the argument. I tried to pass a vector to the eq.with.lhs argument but it overlapped both elements in each equation...

Do you have a better idea?

In the last case, I could use geom_text after calculating the equation coefficients myself, but it seemed to be a less efficient way to solve the problem.

Here is a reprex of my problem.

data <- data.frame(x = rnorm(20)) %>% 
    mutate(y1 = 1.2*x + rnorm(20, sd=0.2),
           y2 = 0.9*x + rnorm(20, sd=0.3)) %>%
    gather(value = value, key = key, -x)  

ggplot(data, aes(x = x, y = value)) +
    geom_point(aes(shape = key, colour = key)) + 
    stat_poly_eq(aes(label = ..eq.label.., colour = key), 
                 formula = y  ~ poly(x, 1, raw = TRUE),
                 eq.x.rhs = "x",
                 # eq.with.lhs = c(paste0(expression(y[1]), "~`=`~"),
                 #                 paste0(expression(y[2]), "~`=`~")),
                 eq.with.lhs = paste0(expression(y[ind]), "~`=`~"),
                 parse = TRUE) +
    ylab(NULL)

Answer:

I'm not really sure if it's possible to do it through ggpmisc, but you can change the data once the plot is built, like so:

library(tidyverse)
library(ggpmisc)

data <- data.frame(x = rnorm(20)) %>% 
    mutate(y1 = 1.2*x + rnorm(20, sd=0.2),
           y2 = 0.9*x + rnorm(20, sd=0.3)) %>%
    gather(value = value, key = key, -x)  

p <- ggplot(data, aes(x = x, y = value)) +
    geom_point(aes(shape = key, colour = key)) + 
    stat_poly_eq(aes(label = ..eq.label.., colour = key), 
                 formula = y  ~ poly(x, 1, raw = TRUE),
                 eq.x.rhs = "x",
                 eq.with.lhs = paste0(expression(y), "~`=`~"),
                 parse = TRUE) +
    ylab(NULL)
temp <- ggplot_build(p)
temp$data[[2]]$label <- temp$data[[2]]$label %>% 
    fct_relabel(~ str_replace(.x, "y", paste0("y[", 1:2, "]")))
grid::grid.newpage()
grid::grid.draw(ggplot_gtable(temp))

Question:

I extracted some longitudinal temperature data from a .nc weather dataset (ncdf4 package) and would like to label the local extrema with their respective dates from x-axis using ggplot2 and its extension ggpmisc that includes stat_peaks/stat_valleys. Oddly, all the labels read the same: "Dec 1969".

I figured the most likely culprit was that my data used for the x-axis was not formatted correctly as Date, but the x-axis displays correctly and I have checked the class of the input data to confirm. I also tried applying group=1 which resulted in no change -- I admit I am new to R and ggplot2 (more familiar with Python/Pandas) and do not completely understand what group=1 does, though it was necessary to get the line to display correctly. Perhaps this is the result of a bug?

ggplot(df_denver, aes(x=Date, y=Temp..C., group=1)) + 
  geom_line() +
  scale_x_date(date_labels="%b %Y", date_breaks = "10 years", expand=c(0,0)) +
  stat_peaks(span=24, ignore_threshold = 0.80, color="red") +
  stat_peaks(geom="text", span=24, ignore_threshold = 0.80, x.label.fmt = "%b %Y", color="red", angle=90, hjust=-0.1) +
  stat_valleys(span=24, ignore_threshold = 0.55, color="blue") +
  stat_valleys(geom="text", span=24, ignore_threshold = 0.55, x.label.fmt = "%b %Y", color="blue", angle=90, hjust=1.1) +
  labs(x="Date", y="Temp (C)", title="Monthly Air Surface Temp for Denver from 1880 on")

Here are the first 100 rows of my dataset that produce 3 peaks and 3 valleys to illustrate:

          Date    Temp..C.
1   1880-01-01  2.91287017
2   1880-02-01 -2.73586297
3   1880-03-01 -2.04185677
4   1880-04-01  0.37948364
5   1880-05-01  0.78548384
6   1880-06-01  0.44176754
7   1880-07-01 -1.06966007
8   1880-08-01 -0.53162575
9   1880-09-01 -0.29665694
10  1880-10-01 -2.08401608
11  1880-11-01 -9.46955109
12  1880-12-01 -1.52052176
13  1881-01-01 -2.53366208
14  1881-02-01 -1.88263988
15  1881-03-01 -0.06864686
16  1881-04-01  3.32321167
17  1881-05-01  1.75613177
18  1881-06-01  2.82765651
19  1881-07-01  1.76543093
20  1881-08-01  1.39409852
21  1881-09-01 -0.98141575
22  1881-10-01 -0.63346595
23  1881-11-01 -1.95676208
24  1881-12-01  3.28983855
25  1882-01-01 -0.64792717
26  1882-02-01  2.15854502
27  1882-03-01  2.91465187
28  1882-04-01  0.56616443
29  1882-05-01 -1.89441001
30  1882-06-01 -0.63149375
31  1882-07-01 -0.64883423
32  1882-08-01  0.82802373
33  1882-09-01  0.66150969
34  1882-10-01 -0.54113626
35  1882-11-01 -1.21310496
36  1882-12-01  1.30559540
37  1883-01-01 -1.41802752
38  1883-02-01 -6.39232874
39  1883-03-01  2.96320987
40  1883-04-01 -0.48122203
41  1883-05-01 -0.99614143
42  1883-06-01 -0.67229420
43  1883-07-01 -0.56595141
44  1883-08-01  0.52161294
45  1883-09-01  0.09190032
46  1883-10-01 -2.65115738
47  1883-11-01  1.88332438
48  1883-12-01 -0.19942272
49  1884-01-01 -0.34669495
50  1884-02-01 -2.21085262
51  1884-03-01  0.55254096
52  1884-04-01 -1.21859336
53  1884-05-01 -0.40969065
54  1884-06-01  0.44454563
55  1884-07-01  1.28881764
56  1884-08-01 -1.09331822
57  1884-09-01  1.52377772
58  1884-10-01  1.76569140
59  1884-11-01  0.72411090
60  1884-12-01 -4.64927006
61  1885-01-01 -1.03242493
62  1885-02-01 -0.79325873
63  1885-03-01  0.65910935
64  1885-04-01 -0.10181000
65  1885-05-01 -1.50702798
66  1885-06-01 -1.25801849
67  1885-07-01 -0.88433135
68  1885-08-01 -1.18410277
69  1885-09-01  0.15284735
70  1885-10-01 -0.91721576
71  1885-11-01  1.82403481
72  1885-12-01  1.68553519
73  1886-01-01 -4.21202993
74  1886-02-01  2.43953681
75  1886-03-01 -2.24947429
76  1886-04-01 -1.22557247
77  1886-05-01  2.66594267
78  1886-06-01 -0.21662886
79  1886-07-01  1.09909940
80  1886-08-01  0.63720244
81  1886-09-01 -0.11845125
82  1886-10-01  0.49225059
83  1886-11-01 -3.16969180
84  1886-12-01  2.18220520
85  1887-01-01  0.51427501
86  1887-02-01 -0.69656581
87  1887-03-01  3.96693182
88  1887-04-01  0.92614591
89  1887-05-01  1.66550291
90  1887-06-01  1.88668025
91  1887-07-01 -1.48990893
92  1887-08-01 -0.98355341
93  1887-09-01  0.93172997
94  1887-10-01 -1.12551820
95  1887-11-01  1.07798636
96  1887-12-01 -2.15758419
97  1888-01-01 -1.69266903
98  1888-02-01  2.55955243
99  1888-03-01 -1.83599913
100 1888-04-01  3.63450384

As you can see, the labels produced by stat_peaks and stat_valleys are identical and not even within the range of the abbreviated data, rather than the correct dates corresponding to the x-axis.

Monthly Air Surface Temp for Denver from 1880 on


Answer:

stat_peaks and stat_valleys labels will work with dates in POSIXct format:

df_denver$Date <- as.POSIXct(df_denver$Date, format = "%Y-%m-%d")

ggplot(df_denver, aes(x=Date, y=Temp)) + 
  geom_line() +
  scale_x_datetime(date_labels="%b %Y", date_breaks = "1 year", expand=c(0,0)) +
  stat_peaks(span=24, ignore_threshold = 0.80, color="red") +
  stat_peaks(geom="text", span=24, ignore_threshold = 0.80, x.label.fmt = "%b %Y", color="red", angle=90, hjust=-0.1) +
  stat_valleys(span=24, ignore_threshold = 0.55, color="blue") +
  stat_valleys(geom="text", span=24, ignore_threshold = 0.55, x.label.fmt = "%b %Y", color="blue", angle=90, hjust=1.1) +
  labs(x="Date", y="Temp (C)", title="Monthly Air Surface Temp for Denver from 1880 on") +
  expand_limits(y = 6)

Note: scale_x_date was changed to scale_x_datetime. In addition, changed date_breaks to 1 year to demonstrate x-axis labels for example data, and expand_limits to ensure peak labels are readable. group=1 should not be needed.

Question:

I am trying to create some correlation plots based of a data frame that I created using dplyr's spread() function. When I used the spread function, it created NAs in the new data frame. This makes sense because the data frame had concentration values for different parameters at different time periods.

Here is an example screenshot of the original data frame:

When I used the spread function it gave me a data frame like this(sample data):

structure(list(orgid = c("11NPSWRD", "11NPSWRD", "11NPSWRD", 
"11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", 
"11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", 
"11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD"), 
    locid = c("11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2"
    ), stdate = structure(c(9891, 9891, 9891, 9920, 9920, 9920, 
    9949, 9949, 9949, 9978, 9978, 9978, 10011, 10011, 10011, 
    10067, 10067, 10073, 10073, 10073), class = "Date"), sttime = structure(c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), class = c("hms", 
    "difftime"), units = "secs"), valunit = c("uS/cm", "mg/l", 
    "mg/l", "uS/cm", "mg/l", "mg/l", "uS/cm", "mg/l", "mg/l", 
    "uS/cm", "mg/l", "mg/l", "uS/cm", "mg/l", "mg/l", "uS/cm", 
    "mg/l", "uS/cm", "mg/l", "mg/l"), swqs = c("FW2-TP", "FW2-TP", 
    "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", 
    "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", 
    "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP"
    ), WMA = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
    6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), year = c(1997L, 1997L, 1997L, 
    1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 
    1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L), 
    Chloride = c(NA, 35, NA, NA, 45, NA, NA, 30, NA, NA, 30, 
    NA, NA, 30, NA, NA, NA, NA, 35, NA), `Specific conductance` = c(224, 
    NA, NA, 248, NA, NA, 204, NA, NA, 166, NA, NA, 189, NA, NA, 
    119, NA, 194, NA, NA), `Total dissolved solids` = c(NA, NA, 
    101, NA, NA, 115, NA, NA, 96, NA, NA, 79, NA, NA, 89, NA, 
    56, NA, NA, 92)), .Names = c("orgid", "locid", "stdate", 
"sttime", "valunit", "swqs", "WMA", "year", "Chloride", "Specific conductance", 
"Total dissolved solids"), row.names = c(NA, 20L), class = "data.frame")

The problem I am having is when I try and create the correlation plot it's giving me a plot with only one point.. I'm guessing this is because there are NAs in the data frame.. But when I try and filter the NAs it gives me a data frame with 0 observations.. Any help would be greatly appreciated!!

Example code to create correlation plot:

plot1<-ggplot(data=df,aes(x="Specific conductance",y="Chloride"))+
  geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x)+
  geom_point()

I would like to create a plot like this:


Answer:

You need to remove NAs & collapse rows which have the same Date

library(tidyverse)

# clean up column names by removing spaces
df <- df %>% 
  select_all(~str_replace(., " ", "_"))

# removing NAs & collapsing rows which have the same Date 
require(data.table)
DT <- data.table(df)
DT2 <- unique(DT[, lapply(.SD, na.omit), by = stdate], by = "stdate")

library(ggpmisc)
formula1 <- y ~ x

ggplot(data = DT2, aes(x = Specific_conductance, y = Chloride)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = formula1) +
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")), 
               label.x.npc = "left", label.y.npc = "top",
               formula = formula1, parse = TRUE, size = 6) +
  theme_bw(base_size = 14)

Created on 2018-09-10 by the reprex package (v0.2.0.9000).

Question:

I have the following working-toy example:

trunctiris <- iris [1:102,] 
analysis <- trunctiris %>%
  group_by(Species) %>%
  nest() %>%
  mutate(model = map(data, ~lm(Sepal.Length ~ Sepal.Width, data = .)),
         cor = map(data, ~tidy(cor.test(.x$Sepal.Length, .x$Sepal.Width), 3)))

stats <- analysis %>%
  unnest(cor)

ggplot(trunctiris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(shape = 21) +
  geom_text(data = stats, aes(label = sprintf("r = %s", round(estimate, 3)), x = 7, y = 4)) +
  geom_text(data = stats, aes(label = sprintf("p = %s", round(p.value, 3)),  x = 7, y = 3.8)) +
  geom_smooth(method = "lm", formula = y ~ x) +
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")),
               formula = y ~ x,
               parse = TRUE) +
  facet_wrap(~Species)

The code was provided in another question. However, I haven't been able to make it work with my data. The problem is that I have some (not all) groups that have a less than 3 observations, and so, in the "analysis" part R returns:

Error in mutate_impl(.data, dots) : not enough finite observations

which is in relation to the fact that there are not enough observations in the group (in this case: virginica). I want to get around this, I've tried 'try(if nrow(data) >= 2)' or similar.. like the following:

analysis <- iris %>% 
group_by(Species) %>% 
nest() %>% mutate(model = map(data, ~lm (Sepal.Length ~ Sepal.Width, data = .)), 
    cor = if_else( nrow(data) <= 2 , warning ("Must have at least 3 rows of data"), 
        (map(data, ~tidy(cor.test(.x$Sepal.Length, .x$Sepal.Width), 3)))))

which returns:

Error in mutate_impl(.data, dots) : not enough finite observations In addition: Warning message: In if_else(nrow(list(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, : Must have at least 3 rows of data

Does anyone know an easy way to get around this? I'd like to skip the problematic group and keep on going.

Many thanks and sorry for my very basic R skills.


Answer:

purrr::safely or purrr::possibly allow for easy guarding against errors when you are mapping. In this case, a good strategy is to wrap the call to tidy(cor.test(... in possibly and return an empty data.frame if an error occurs

library(purrr)
analysis <- trunctiris %>%
  group_by(Species) %>%
  nest() %>%
  mutate(
    model = map(data, ~lm(Sepal.Length ~ Sepal.Width, data = .)),
    cor = map(data, possibly(
      ~tidy(cor.test(.x$Sepal.Length, .x$Sepal.Width), 3), otherwise = data.frame())
    )
  )
# A tibble: 3 × 4
     Species              data    model                  cor
      <fctr>            <list>   <list>               <list>
1     setosa <tibble [50 × 4]> <S3: lm> <data.frame [1 × 8]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [1 × 8]>
3  virginica  <tibble [2 × 4]> <S3: lm> <data.frame [0 × 0]> #<- Note the empty df here

Which becomes:

unnest(analysis)
# A tibble: 2 × 9
     Species  estimate statistic      p.value parameter  conf.low conf.high
      <fctr>     <dbl>     <dbl>        <dbl>     <int>     <dbl>     <dbl>
1     setosa 0.7425467  7.680738 6.709843e-10        48 0.5851391 0.8460314
2 versicolor 0.5259107  4.283887 8.771860e-05        48 0.2900175 0.7015599
# ... with 2 more variables: method <fctr>, alternative <fctr>

And so the group that gave an error is sucessfully removed from the end result.

Question:

I have run the script below numerous times and it has worked until this morning, when it suddenly produced the error message:

(Error in terms.formula(formula, data = data) : 'data' argument is of the wrong type.

I have not changed anything and I need to find out why it suddenly doesn't seem to work. Previous answers to similar questions have not helped.

My data:

DPUT(harvest2)
structure(list(Year = c(1971, 1972, 1973, 1974, 1975, 1976, 1977, 
1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 
1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 
2011, 2012, 2013, 2014, 2015, 2016), Count = c(750, 757, 592, 
693, 667, 757, 719, 670, 733, 796, 923, 921, 944, 1010, 825, 
762, 825, 844, 809, 830, 768, 823, 749, 675, 700, 637, 708, 697, 
754, 694, 636, 717, 786, 731, 769, 732, 710, 610.5, 593, 529, 
664, 788, 731, 644, 653, 656), SexRat = c(1.91812865497076, 2.34567901234568, 
1.69178082191781, 1.46766169154229, 1.30396475770925,     
1.4364406779661, 1.32098765432099, 1.48584905660377, 1.5906976744186, 
1.91414141414141, 1.48905109489051, 1.61382113821138, 1.52380952380952, 
1.87777777777778, 1.75438596491228, 1.6695652173913, 1.81566820276498, 
1.79295154185022, 1.85024154589372, 1.75446428571429, 1.83163265306122, 
1.92857142857143, 1.76635514018692, 1.5, 2.26190476190476,     1.76704545454545,
2.38125, 1.80924855491329, 2.33333333333333, 1.81182795698925, 
2.20446096654275, 2.02790697674419, 2.1140350877193, 2.05, 2.20183486238532, 
1.90983606557377, 2.02262443438914, 1.75116279069767, 1.86842105263158, 
1.87951807228916, 2.08542713567839, 2.01724137931034, 1.95833333333333, 
1.81165919282511, 2.12135922330097, 1.97260273972603)), class = "data.frame", 
row.names = c(NA, -46L))

My script:

# Function for the equation

lm_eqn = function(df){
  m = lm(y ~ poly(x, 3), df) #3rd degree polynomial
  eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
                   list(a = format(coef(m)[1], digits = 2),
                        b = format(coef(m)[2], digits = 2),
                        r2 = format(summary(m)$r.squared, digits = 4)))

  as.character(as.expression(eq))
}

# Make the plot

library(ggplot2)
ggplot(harvest2, aes(x = Year, y = Count)) +  
  scale_y_continuous(minor_breaks = seq(500, 1100, by = 50), 
                     breaks = seq(500, 1100, by = 100),
                     limits = c(500, 1100), expand = c(0, 0)) +  
  scale_x_continuous(minor_breaks = seq(1970, 2018, by = 1), 
                     breaks = seq(1970, 2018, by = 5), limits = c(1970, 2018)) +
  geom_point(stat = 'identity', size=2) +
  stat_smooth(method = "lm", se = TRUE, fill = NA, size = 1.3,
              formula = y ~ poly(x, 3, raw = TRUE), col = "red") +
  annotate("text", x = 1975, y = 1075, label = lm_eqn(df), 
           hjust = 0, size = 3.5, parse = TRUE) +
  xlab(" ") + 
  ylab("Count") +
  theme_light() +
  ggtitle(" ")

Any help much appreciated.


Answer:

How about using stat_poly_eq from the ggpmisc package? See this if you want to separate the equation and R2 into two lines.

library(ggplot2)
library(ggpmisc)

# define formula
formula1 <- y ~ poly(x, 3, raw = TRUE)

ggplot(harvest2, aes(x = Year, y = Count)) +
  scale_y_continuous(
    minor_breaks = seq(500, 1100, by = 50), breaks = seq(500, 1100, by = 100),
    limits = c(500, 1100), expand = c(0, 0)) +
  scale_x_continuous(
    minor_breaks = seq(1970, 2018, by = 1), breaks = seq(1970, 2018, by = 5),
    limits = c(1970, 2018)) +
  geom_point(stat = "identity", size = 2) +
  stat_smooth(
    method = "lm", se = TRUE, fill = NA, size = 1.3,
    formula = formula1, col = "red") +
  # show the equation and R2
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")),
    label.x.npc = "left", label.y.npc = "top",
    formula = formula1, parse = TRUE, size = 5) +
  xlab(" ") + ylab("Count") +
  theme_light() +
  ggtitle(" ")

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

Question:

I have to create a large (100+) number of ggplots of linear models. I would like to add the p-value (and potentially R2) to each plot. I know it is possible to do this using ggpmisc. Here, I employ stat_fit_glance to add the p-value. My 'problem' is that both of these require me to run lm first to be inserted as formula = my_lm.

As I have to create a large number of plots, I was wondering if there is a way to avoid creating the lm object first, and simply have it calculated while producing the ggplot? I can do it for t-tests for boxplots using stat_compare_means, and really hope to find a way to do it with lm's as well.

My code is present below. I would like to be able to skip the first line of code:

my_lm <- lm(y ~ x)


ggplot(data = complete, aes(x= x, y = y))+  
geom_point()+
theme_classic()+
geom_smooth(method = "lm")+
labs(x="Ellenberg F", y = "Species richness")+
stat_fit_glance(method = 'lm',
              method.args = list(data = complete, formula = my_lm),
              geom = 'text',
              aes(label = paste("p-value = ", signif(..p.value.., digits = 4), sep = "")),
              label.x = 8.5, label.y = 25, size = 3)

I have tried simply putting formula = y ~ x with no luck.


Answer:

From the help of ggpmisc::stat_fit_glance: method.args = list(formula = y ~ x). This means that you don't need to run an lm first. You can only specify the formula for the linear model.

library(ggpmisc)
set.seed(1)
n <- 100
x <- 8+rnorm(n)
y <- 11+x+2*rnorm(n)
complete <- data.frame(x, y)

summary(lm(y~x))
ggplot(data = complete, aes(x= x, y = y))+  
geom_point()+
theme_classic()+
geom_smooth(method = "lm")+
labs(x="Ellenberg F", y = "Species richness")+
stat_fit_glance(method = 'lm',
       method.args = list(formula = y ~ x),  geom = 'text', 
       aes(label = paste("p-value=", signif(..p.value.., digits = 4), 
                      "   R-squared=", signif(..r.squared.., digits = 3), sep = "")),
       label.x = 8.5, label.y = 25, size = 5)

Question:

I want to show the linear equation and the R-squared in the each plot in facet mode. This is my code so far.

library("ggplot2")
datos <- read.table("~/Documents/master2/plots/dosis_todos/datos.dat", header=TRUE, quote="\"")
ggplot(datos, aes(x = corriente, y = dosis, colour = cristal)) +    
geom_point() + geom_smooth(method="lm", se=F) + 
facet_wrap(~datos$cristal)

After reading about ggpmisc in this answer, I tried

my.formula <- y ~ x
library("ggpmisc")
ggplot(datos, aes(x = corriente, y = dosis, colour = cristal)) +    
geom_point() + 
geom_smooth(method="lm", se=F, formula=my.formula) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), formula = my.formula, parse = TRUE) +
facet_wrap(~datos$cristal)

Which kinda works, except that the position of the equation goes down for every plot until disappears...

If I save my plot big enough, I can see all my text in the 9 plots ....going down.

So I guess the question is how to keep fixed the position of the equation and the R-squared information?

Thanks

Ps. Yes, I know N57 has only 3 points :(

Ps. Here is the link to my data


Answer:

@murpholinox Yes, you are correct, the code in 'ggpmisc' is not smart enough (yet) to detect when aesthetics values like the different colours are unique to each panel. However, it is possible to manually position the equations passing a position in data units to parameters label.y and/or label.x. So, there is a work-around.

library("ggplot2")
library("ggpmisc")
datos <- read.table("datos.dat", header=TRUE, quote="\"")
my.formula <- y ~ x
ggplot(datos, aes(x = corriente, y = dosis, colour = cristal)) +
geom_point() +
geom_smooth(method="lm", se=F, formula=my.formula) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
             formula = my.formula, parse = TRUE, label.y = 0.9) +
ylim(0, 1) +
facet_wrap(~datos$cristal)

It is also possible to pass a vector to label.y and label.x, so that each equation can be manually positioned for each panel.

ggplot(datos, aes(x = corriente, y = dosis, colour = cristal)) +
geom_point() +
geom_smooth(method="lm", se=F, formula=my.formula) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
             formula = my.formula, parse = TRUE, 
             label.y = c(rep(0.9, 6), rep(0.15, 2), 0.9)) +
ylim(0, 0.95) +
facet_wrap(~datos$cristal)

Question:

I have the following data frame:

     expected observed group
1: 0.5371429   0.0000     1
2: 1.3428571   1.3736     1
3: 2.6857143   2.4554     1
4: 5.3714286   3.6403     1
5: 0.5294118   0.0000     2
6: 1.3235294   1.1494     2
7: 2.6470588   1.1364     2
8: 5.2941176   4.9774     2
9: 0.5201207   0.0000     3
10: 1.3003018   1.4327    3
11: 2.6006036   2.5918    3
12: 5.2012072   8.0769    3
13: 0.5155039   1.4851    4
14: 1.2887597   1.0638    4
15: 2.5775194   3.1700    4
16: 5.1550388   6.2500    4
17: 0.4976959   0.0000    5
18: 1.2442396   1.2384    5
19: 2.4884793   3.1073    5
20: 4.9769585   4.8148    5

I would like to scatter plot each dataset according to group, so I have the following code:

sp <- ggplot(new_df, aes(x = expected, y = observed, colour = group)) + geom_point()

sp + scale_color_gradientn(colours = rainbow(5)) 

and receive the below plot:

My question is how to add a linear line (intercept = 0,0) to each of the different groups? meaning, that in the end, I'll have 5 linear lines in different colors representing each group on the same plot.

And, is there a way to show the equation for each line in a legend?


Answer:

You can get the linear lines and equation/R2 text with geom_smooth from ggplot2 and stat_poly_eq from ggpmisc package

    dat <- "expected    observed    group
    0.5371429   0   1
    1.3428571   1.3736  1
    2.6857143   2.4554  1
    5.3714286   3.6403  1
    0.5294118   0   2
    1.3235294   1.1494  2
    2.6470588   1.1364  2
    5.2941176   4.9774  2
    0.5201207   0   3
    1.3003018   1.4327  3
    2.6006036   2.5918  3
    5.2012072   8.0769  3
    0.5155039   1.4851  4
    1.2887597   1.0638  4
    2.5775194   3.17    4
    5.1550388   6.25    4
    0.4976959   0   5
    1.2442396   1.2384  5
    2.4884793   3.1073  5
    4.9769585   4.8148  5
    "  
    library(ggplot2)
    library(ggpmisc)

    df <- read.table(text = dat, header = TRUE)
    df$group <- factor(df$group)

    formula <- y ~ x # needed for ggpmisc's equation and R2 text

    # Put equation & R2 coef to the top left corner
    ggplot(df, aes(expected, observed, colour = group)) +
      geom_point(size = 2, alpha = 0.3) +
      geom_smooth(method = "lm", formula = formula, se = FALSE) +
      stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")), 
                   label.x.npc = "left", label.y.npc = "top",
                   formula = formula, parse = TRUE, size = 4) +
      scale_color_brewer(palette = "Dark2") +
      theme_bw(base_size = 16)

Question:

I have ggplot2 installed and library(ggplot2) runs. However, I get the following error when I try to run library(ggpmisc). I have tried restarting R and reinstalling ggplot2 to no avail. In addition, I can generate plots using ggplot2 but can't access functions like stat_poly_eq since I can't open ggpmisc. I have the current version of RStudio installed

library(ggplot2)
library(ggpmisc)

Error in library(ggpmisc) : there is no package called ‘ggpmisc’


Answer:

Have you installed the library ggpmisc? If you haven't please run:

install.packages("ggpmisc")

and then run

library("ggpmisc")

and now everything should work!

As it is in it's CRAN webpage, ggpmisc is a set of "Extensions to 'ggplot2' respecting the grammar of graphics paradigm." i.e. isn't a part of ggplot. That means that the packages are not binded, so each one should be installed individually.

Question:

df%>%
  group_by(approved_date)%>%
  summarise(rev=sum(gmv))%>%
  ggplot(aes(x = approved_date, y = rev)) + 
    geom_line() + 
    geom_smooth(method = 'auto', se = FALSE) + 
    labs(x = 'Date', y = 'Revenue', title = 'Revenue by Date') +
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
    stat_peaks(colour = "red", span = NULL) + 
    stat_valleys(colour = "blue", span = NULL) + 
    geom_text(aes(label = round(rev, 0)),
              vjust = "inward", 
              hjust = "inward",
              show.legend = FALSE,
              check_overlap = TRUE)

I have this code which on running labels all values of Local Maxima and Minima. I want only the value of Global Maximum and Global Minimum.How to do that?


Answer:

As the code in the question cannot be run for lack of data, I show an example slightly modified from the package User Guide. In this case this different example should be enough to work out the solution.

library(ggpmisc)
ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", vjust = -0.5, 
             check_overlap = TRUE, span = NULL) +
  ylim(-100, 7300)

In other words geom "text" should be passed as argument to stat_peaks() as well as span = NULL to get a single label. If you add geom_text() directly, peaks are not selected but instead all values stored in the variable mapped to the label aesthetic are added to the plot.