Hot questions for Using Ggplot2 in ggbiplot

Question:

Using ggord one can make nice linear discriminant analysis ggplot2 biplots (cf chapter 11, Fig 11.5 in "Biplots in practice" by M. Greenacre), as in

library(MASS)
install.packages("devtools")
library(devtools)
install_github("fawda123/ggord")
library(ggord)
data(iris)
ord <- lda(Species ~ ., iris, prior = rep(1, 3)/3)
ggord(ord, iris$Species)

I would also like to add the classification regions (shown as solid regions of the same colour as their respective group with say alpha=0.5) or the posterior probabilities of class membership (with alpha then varying according to this posterior probability and the same colour as used for each group) (as can be done in BiplotGUI, but I am looking for a ggplot2 solution). Would anyone know how to do this with ggplot2, perhaps using geom_tile ?

EDIT: below someone asks how to calculate the posterior classification probabilities & predicted classes. This goes like this:

library(MASS)
library(ggplot2)
library(scales)
fit <- lda(Species ~ ., data = iris, prior = rep(1, 3)/3)
datPred <- data.frame(Species=predict(fit)$class,predict(fit)$x)
#Create decision boundaries
fit2 <- lda(Species ~ LD1 + LD2, data=datPred, prior = rep(1, 3)/3)
ld1lim <- expand_range(c(min(datPred$LD1),max(datPred$LD1)),mul=0.05)
ld2lim <- expand_range(c(min(datPred$LD2),max(datPred$LD2)),mul=0.05)
ld1 <- seq(ld1lim[[1]], ld1lim[[2]], length.out=300)
ld2 <- seq(ld2lim[[1]], ld1lim[[2]], length.out=300)
newdat <- expand.grid(list(LD1=ld1,LD2=ld2))
preds <-predict(fit2,newdata=newdat)
predclass <- preds$class
postprob <- preds$posterior
df <- data.frame(x=newdat$LD1, y=newdat$LD2, class=predclass)
df$classnum <- as.numeric(df$class)
df <- cbind(df,postprob)
head(df)

           x        y     class classnum       setosa   versicolor virginica
1 -10.122541 -2.91246 virginica        3 5.417906e-66 1.805470e-10         1
2 -10.052563 -2.91246 virginica        3 1.428691e-65 2.418658e-10         1
3  -9.982585 -2.91246 virginica        3 3.767428e-65 3.240102e-10         1
4  -9.912606 -2.91246 virginica        3 9.934630e-65 4.340531e-10         1
5  -9.842628 -2.91246 virginica        3 2.619741e-64 5.814697e-10         1
6  -9.772650 -2.91246 virginica        3 6.908204e-64 7.789531e-10         1

colorfun <- function(n,l=65,c=100) { hues = seq(15, 375, length=n+1); hcl(h=hues, l=l, c=c)[1:n] } # default ggplot2 colours
colors <- colorfun(3)
colorslight <- colorfun(3,l=90,c=50)
ggplot(datPred, aes(x=LD1, y=LD2) ) +
    geom_raster(data=df, aes(x=x, y=y, fill = factor(class)),alpha=0.7,show_guide=FALSE) +
    geom_contour(data=df, aes(x=x, y=y, z=classnum), colour="red2", alpha=0.5, breaks=c(1.5,2.5)) +
    geom_point(data = datPred, size = 3, aes(pch = Species,  colour=Species)) +
    scale_x_continuous(limits = ld1lim, expand=c(0,0)) +
    scale_y_continuous(limits = ld2lim, expand=c(0,0)) +
    scale_fill_manual(values=colorslight,guide=F)

(well not totally sure this approach for showing classification boundaries using contours/breaks at 1.5 and 2.5 is always correct - it is correct for the boundary between species 1 and 2 and species 2 and 3, but not if the region of species 1 would be next to species 3, as I would get two boundaries there then - maybe I would have to use the approach used here where each boundary between each species pair is considered separately)

This gets me as far as plotting the classification regions. I am looking for a solution though to also plot the actual posterior classification probabilities for each species at each coordinate, using alpha (opaqueness) proportional to the posterior classification probability for each species, and a species-specific colour. In other words, with a stack of three images superimposed. As alpha blending in ggplot2 is known to be order-dependent, I think the colours of this stack would have to calculated beforehand though, and plotted using something like

qplot(x, y, data=mydata, fill=rgb, geom="raster") + scale_fill_identity() 

Here is a SAS example of what I am after:

Would anyone know how to do this perhaps? Or does anyone have any thoughts on how to best represent these posterior classification probabilities?

Note that the method should work for any number of groups, not just for this specific example.


Answer:

I suppose the easiest way will be to show the posterior probabilities. It is pretty straightforward for your case:

datPred$maxProb <- apply(predict(fit)$posterior, 1, max)
ggplot(datPred, aes(x=LD1, y=LD2) ) +
  geom_raster(data=df, aes(x=x, y=y, fill = factor(class)),alpha=0.7,show_guide=FALSE) +
  geom_contour(data=df, aes(x=x, y=y, z=classnum), colour="red2", alpha=0.5, breaks=c(1.5,2.5)) +
  geom_point(data = datPred, size = 3, aes(pch = Species,  colour=Species, alpha = maxProb)) +
  scale_x_continuous(limits = ld1lim, expand=c(0,0)) +
  scale_y_continuous(limits = ld2lim, expand=c(0,0)) +
  scale_fill_manual(values=colorslight, guide=F)

You can see the points blend in at blue-green border.

Question:

I am working on an ordination package using ggplot2. Right now I am constructing biplots in the traditional way, with loadings being represented with arrows. I would also be interested though to use calibrated axes and represent the loading axes as lines through the origin, and with loading labels being shown outside the plot region. In base R this is implemented in

library(OpenRepGrid)
biplot2d(boeker)

but I am looking for a ggplot2 solution. Would anybody have any thoughts how to achieve something like this in ggplot2? Adding the variable names outside the plot region could be done like here I suppose, but how could the line segments outside the plot region be plotted?

Currently what I have is

install.packages("devtools")
library(devtools)
install_github("fawda123/ggord")
library(ggord)
data(iris)
ord <- prcomp(iris[,1:4],scale=TRUE)
ggord(ord, iris$Species)

The loadings are in ord$rotation

                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

How could I add the lines through the origin, the outside ticks and the labels outside the axis region (plossibly including the cool jittering that is applied above for overlapping labels)?

NB I do not want to turn off clipping, since some of my plot elements could sometimes go outside the bounding box

EDIT: Someone else apparently asked a similar question before, though the question is still without an answer. It points out that to do something like this in base R (though in an ugly way) one can do e.g.

plot(-1:1, -1:1, asp = 1, type = "n", xaxt = "n", yaxt = "n", xlab = "", ylab = "")
abline(a = 0, b = -0.75)
abline(a = 0, b = 0.25)
abline(a = 0, b = 2)
mtext("V1", side = 4, at = -0.75*par("usr")[2])
mtext("V2", side = 2, at = 0.25*par("usr")[1])
mtext("V3", side = 3, at = par("usr")[4]/2)

Minimal workable example in ggplot2 would be

library(ggplot2)
df <- data.frame(x = -1:1, y = -1:1)
dfLabs <- data.frame(x = c(1, -1, 1/2), y = c(-0.75, -0.25, 1), labels = paste0("V", 1:3))
p <- ggplot(data = df, aes(x = x, y = y)) +  geom_blank() +
  geom_abline(intercept = rep(0, 3), slope = c(-0.75, 0.25, 2)) +
  theme_bw() + coord_cartesian(xlim = c(-1, 1), ylim = c(-1, 1)) +
  theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(),
        panel.grid = element_blank())
p + geom_text(data = dfLabs, mapping = aes(label = labels))

but as you can see no luck with the labels, and I am looking for a solution that does not require one to turn off clipping.

EDIT2: bit of a related question is how I could add custom breaks/tick marks and labels, say in red, at the top of the X axis and right of the Y axis, to show the coordinate system of the factor loadings? (in case I would scale it relative to the factor scores to make the arrows clearer, typically combined with a unit circle)


Answer:

Maybe as an alternative, you could remove the default panel box and axes altogether, and draw a smaller rectangle in the plot region instead. Clipping the lines not to clash with the text labels is a bit tricky, but this might work.

df <- data.frame(x = -1:1, y = -1:1)
dfLabs <- data.frame(x = c(1, -1, 1/2), y = c(-0.75, -0.25, 1), 
                     labels = paste0("V", 1:3))
p <- ggplot(data = df, aes(x = x, y = y)) +  
  geom_blank() +
  geom_blank(data=dfLabs, aes(x = x, y = y)) +
  geom_text(data = dfLabs, mapping = aes(label = labels)) +
  geom_abline(intercept = rep(0, 3), slope = c(-0.75, 0.25, 2)) +
  theme_grey() +
  theme(axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank()) + 
  theme()

library(grid)
element_grob.element_custom <- function(element, ...)  {
  rectGrob(0.5,0.5, 0.8, 0.8, gp=gpar(fill="grey95"))
}

panel_custom <- function(...){ # dummy wrapper
  structure(
    list(...), 
    class = c("element_custom","element_blank", "element") 
  ) 

}

p <- p + theme(panel.background=panel_custom())


clip_layer <- function(g, layer="segment", width=1, height=1){
  id <- grep(layer, names(g$grobs[[4]][["children"]]))
  newvp <- viewport(width=unit(width, "npc"), 
                    height=unit(height, "npc"), clip=TRUE)
  g$grobs[[4]][["children"]][[id]][["vp"]] <- newvp

  g
}

g <- ggplotGrob(p)
g <- clip_layer(g, "segment", 0.85, 0.85)
grid.newpage()
grid.draw(g)

Question:

By default, the ggbiplot function gives a graph with loadings as red arrows and unit labels in black:

library(ggbiplot)
data("USArrests")
us <- princomp(USArrests)
ggbiplot(us, labels = rownames(us$scores))

is the result of that code

How can I change the colour of these labels, and incidentally their size or font?


Answer:

library(ggbiplot)
library(grid)
data("USArrests")
us <- princomp(USArrests)

# Cut the third score into 4 intervals using quantiles
cols <- cut(us$scores[,3], quantile(us$scores[,3], probs=seq(0,1,0.25)), include.lowest=T)

# Change label colors using the "group" option
# Change label font size using the "label.size" option
p <- ggbiplot(us, labels = rownames(us$scores), groups=cols, labels.size=4)

# Change label font family 
g <- ggplotGrob(p)
g$grobs[[6]]$children[[4]]$gp$fontfamily <- "mono"
grid.draw(g)

To change label colors as a whole:

p <- ggbiplot(us, labels = rownames(us$scores), groups=1, labels.size=4) +
     theme(legend.position = "none")

# Change label colors 
g <- ggplotGrob(p)
g$grobs[[6]]$children[[4]]$gp$col <- "#FF9900"
grid.draw(g)

Question:

Is it possible to change the type of lines of the normal probability ellipsoids in ggbiplot, e.g. have them dashed and dotted lines instead of or additional to the different colors? I couldn't find anything in the documentation of ggbiplot except this to be used as MWE:

library(ggbiplot)
data(wine)
wine.pca <- prcomp(wine, scale. = TRUE)
print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))

Answer:

To the best of my knowledge it isn't possible with any or the arguments passed to ggbiplot. Luckily ggbiplot is a pretty simple wrapper for some ggplot2 commands and data massaging. You can copy the source code make a custom function and change line 124 of the original source from:

g <- g + geom_path(data = ell, aes(color = groups, group = groups))

to:

g <- g + geom_path(data = ell, aes(color = groups, group = groups, linetype = groups))

Because of the plot scale it's hard to tell the lines apart without changing the size outside of the aes() statement.

Question:

I have the following PCA plot with 7 variables (see data and code below), where I want the variable names to be put in subscript. In ggbiplot() however, the variable names are automatically taken from the column names of the matrix that was used to generate the PCA, and (as far as I know) no options are available to add these manually.

Therefore I tried to create subscripts in my actual column names by using

c(expression("j"["d"]),expression("k"["1"]), etc.)

which only results in the variable names turning into "j"["d"],"k"["1"] on the PCA plot.

Is there a workaround for this problem? e.g. making the second character smaller or something alike?

The plot was generated using the data:

Param.clean <- structure(c(0.314689287410068, 0.279479887056815, 0.448790689537864, 
0.25336455455925, 0.289008161177184, 0.314501392595033, 0.291144087910652, 
0.30630205659933, 0.283940162753961, 0.293902791758693, 0.384490609026053, 
0.287376099118374, 0.308181312257512, 0.295516976083076, 0.299962079750977, 
0.377418190053724, 0.577708482228635, 0.548861542714413, 0.445100820783724, 
0.454234613160057, 0.509303280474031, 0.485557486397235, 0.512794671011103, 
0.438809386918853, 0.584488774396788, 0.485077847448695, 0.54242611837146, 
0.50295109738886, 0.462140343335828, 0.482811895512131, 0.505123975906175, 
0.454883826310242, 0.270198900375524, 0.375171915727647, 0.415330703464691, 
0.335520417496773, 0.337301727971001, 0.414448624199441, 0.413407199816623, 
0.480251511263297, 0.381597639419929, 0.299997369191106, 0.716363262901133, 
0.334190348030789, 0.339749957548872, 0.615860520668703, 0.399995858493781, 
0.290916481482953, 0.271565709782391, 0.434971269297489, 0.426102875513371, 
0.245620918873499, 0.350757895503111, 0.314711831388176, 0.233155192989713, 
0.248289747469244, 0.260316266382216, 0.435133707574921, 0.270357707406285, 
0.460714026041302, 0.300202262584418, 0.283894965208827, 0.286731230742906, 
0.476076636930455, 0.496419713185873, 0.449716335449948, 0.414301742272173, 
0.517741851053675, 0.452096411019313, 0.503619428840069, 0.46590684730812, 
0.426164828533173, 0.526710272381684, 0.60162289738927, 0.501571299694807, 
0.5204289094248, 0.501944294148778, 0.420989057029306, 0.486676941655541, 
0.581232141136961, 0.388451077003248, 0.348575471498664, 0.541637601056563, 
0.280529225927791, 0.474527715587717, 0.427368925204716, 0.247233045036043, 
0.371205006512827, 0.350378420436755, 0.334934610173675, 0.672485054825788, 
0.387370130421541, 0.442394016641109, 0.410245000087745, 0.554591956294395, 
0.292541225836523, 4.53967830833669, 4.07671677879989, 6.31220323958745, 
5.53951495559886, 6.14831664270411, 2.49250044245273, 3.56566691364472, 
0.905669305473566, 5.65883946946512, 2.64297457644716, 3.05508165226008, 
3.18054619676744, 6.40823390030613, 3.37302493648604, 3.66350222987433, 
2.54057804516827, 8.20203100920965, 2.71440979468947, 2.91087136765321, 
1.99383332931126, 2.54861955298111, 2.76731871856997, 2.8168199171933, 
1.87471640761942, 2.36744814319536, 2.70068269921467, 3.73984078429639, 
3.14727020682767, 2.44860090082511, 2.96811391366646, 3.15924824448302, 
2.75038693333045, 0.571991651163747, 0.0795687369691844, 0.199285715352744, 
0.278876264579594, 0.0878147967159748, 0.776516450569034, 0.109050695318729, 
0.214585168287158, 0.0733157177455723, 4.99666969059035, 0.681611322797835, 
0.511083496330927, 0.103351746220142, 2.09765716294448, 0.163525143420944, 
1.15420421690991, 3.94337840843946, 3.16239102048179, 5.81127347030366, 
5.75235136287908, 2.48392526193832, 2.46317023644224, 1.81044878205284, 
4.65439113788307, 5.15721979085356, 2.87866174476221, 2.5224589696154, 
5.64499958222732, 1.54218871717652, 2.67105548409745, 3.03890538634732, 
4.12773523628712, 2.30713835917413, 1.77077499361088, 2.28220948716626, 
2.27822186658159, 2.22687301691622, 2.78089357145751, 2.10296940756962, 
1.62778628757223, 2.64038743038351, 2.90401477599517, 2.85171007970348, 
2.15843657962978, 2.22618574043736, 2.01146884588525, 7.50444248911614, 
2.78893498703837, 0.369707935024053, 0.640407053288072, 1.16690336726606, 
1.19069673353806, 0.369720328599215, 0.222492094617337, 1.97003725056226, 
0.0771822298566498, 0.0491951033473015, 1.58942595124245, 0.56345314020291, 
3.61783248605207, 0.470429616980255, 4.25140980218227, 0.416233133369436, 
0.838344213552773, 2.73752116598189, 3.02740207407624, 1.11222450388595, 
2.30047973571345, 2.45276295114309, 2.64581680834914, 2.78265510452911, 
3.27471463242546, 3.1145193031989, 2.56884276599934, 3.15895244479179, 
3.14939826028422, 2.39397626441593, 3.8150685878781, 2.75754480964194, 
2.56850058445707, 1.10134650183221, 1.90155507440989, 1.71837465837598, 
2.29802527697757, 1.24960316416497, 1.95164571283385, 1.49695883272216, 
2.51760663697496, 1.04954799870029, 1.59203378250822, 1.33138975808397, 
1.30761714885011, 1.6880701482296, 1.33289409801364, 1.18721167324111, 
2.10289117880166, 4.97532397741452, 4.08132408546905, 4.83296488539005, 
5.65673472359776, 1.52374835970501, 4.90433675702661, 6.09970323707287, 
4.9611738567551, 4.98475494328886, 1.58062443907062, 6.51368894614279, 
5.82539446347704, 4.37618843046948, 6.41044923532754, 1.36502551613376, 
2.11988553637639, 2.87185357650742, 1.99067048101375, 2.6186517810449, 
1.56325603065391, 3.18718514405191, 2.67383103277534, 3.39795261013011, 
3.97267814834292, 3.7929962793986, 2.75519806658849, 2.67046403462688, 
1.61068025619412, 3.21088001991933, 2.72053461754695, 2.89008031133562, 
2.44437138317153, 1.87622146913782, 2.30537068654473, 1.77740855213876, 
1.46343538770452, 2.23820108221844, 1.83238286990672, 1.70535395620391, 
3.20611382601783, 1.42552105011418, 1.35411406867206, 1.67505901074037, 
1.27963638935859, 1.50879823369905, 3.20626547553887, 1.34457757696509, 
1.4350421866402, 3.01776180580879, 4.42168406192213, 2.68915122840554, 
5.10021819965914, 5.7995775481686, 3.69961343705654, 5.6582765411896, 
5.42690238449723, 4.12574668414891, 2.88072699628149, 7.37651169135546, 
0.877918794285506, 3.48383912583813, 7.16048748052369, 1.10681456684445, 
2.0056554214408, 0.843951161950827, 0.892569730058312, 1.82299897540361, 
1.4852886760508, 1.34192016645273, 1.12837524153292, 0.788718643598258, 
2.59369957595567, 1.03540072683245, 0.767321502789855, 1.23622662722568, 
0.784033099189401, 0.784212271682918, 1.25203002375116, 0.76296656858176, 
0.985467154532671, -0.220396793447435, 0.106205735355616, 0.124459093059103, 
-0.00892141033460625, -0.10377890299509, 0.0926174888437004, 
-0.0549866305664182, 0.278899986296892, 0.0474233353460836, -0.109990653581917, 
0.0474225463395319, 0.0338427349925041, 0.0972763542085886, 0.114185092970729, 
-0.00886052381247282, -0.0589170539751649, -2.3298938000525, 
-1.84932016064972, -3.90579685030132, -6.07330848347147, 1.8304963277777, 
-2.75582764982432, -4.77531149517745, -3.8228796642224, 1.27799427995086, 
-5.45959737020979, -2.00253919902196, -2.22745593055834, -2.6610222319141, 
-3.21496389806271, -6.01237997648368, -3.13309627585113, 0.733061715583007, 
1.04009207207213, 1.51182846035312, 1.92732659168541, 1.29081471823156, 
0.809032674878836, 0.954354595015446, 1.5407008677721, 0.961584224718311, 
0.949504360179106, 1.17135864682496, 0.758904917165637, 1.27744460012764, 
1.13802453503013, 1.12143621314317, 0.923086409457028, -0.207958693792422, 
-0.112425229889651, -0.114883695351581, -0.0418341891790419, 
0.00953924376517534, 0.0761805102229118, 0.06189932115376, -0.279747098684311, 
-0.139795616269112, -0.0957655208185315, -0.211875131353736, 
0.15934879556795, -0.307479116310674, -0.171517195557555, -0.147559982724488, 
-0.151702696457505, -3.2736393770054, -3.77255320500831, -2.50166923739016, 
-3.38403152301908, -3.54271008856346, -5.98730765748769, -3.46145537216216, 
4.08535061942786, -5.5706409165586, -5.12871207815409, -2.00980313849697, 
-1.74987012389551, -4.99012113984178, -5.52220372986048, -4.09768380224705, 
-2.65192667357127, 0.368457175791264, 0.276677635120849, 0.660357913002372, 
0.536901503801346, 0.122066596522927, 0.248441367586455, 0.326834246516228, 
0.414462564513087, 0.374505277723074, 0.157977417111397, 0.147561579942703, 
0.327407247386873, 0.481645831460754, 0.471423282288015, 0.402634991332889, 
0.600018523255984, -0.754529740661383, -0.333761049988369, -0.479209786502022, 
-0.338527356651923, -1.15156446321805, -0.187009451910853, -1.59113974179576, 
-0.26058504227052, -0.629410240799189, -1.30694395600756, -0.916838761041562, 
-1.07011808082461, -1.22627768199891, -0.492068426683545, -1.54347744450718, 
-0.104520670138299, 1.40759190544486, 7.8479420303603, 2.75170721710225, 
3.93423731307934, 9.85715725720922, 5.37083122879267, 6.15668724537641, 
4.96914687916388, 6.46718368138125, 1.0871206430619, 9.84838488511741, 
9.75057146977633, 1.26769720241924, 9.92010527290404, 9.98543933499604, 
9.69677802640945, 0.169932479038835, 0.409494925844172, 0.409328970126808, 
0.740250064991415, 0.193447070196271, 0.51206505260865, 0.250175041146576, 
0.391306541860104, 0.295244073495269, 0.243191891349852, 0.293793072924018, 
0.408511371351779, 0.281213674316803, 0.680068365474543, 0.41473192628473, 
0.371123384684324, -0.753847653398911, -0.768953422084451, -1.10839967615902, 
-0.771105247549713, -0.155546843451758, -0.282887876386441, -0.444554830901325, 
-0.305730849504471, -0.499526713974774, -0.561585665126642, -0.731248073279858, 
-0.888600847683847, -0.92906893696636, -0.32135354558875, -0.635939710773528, 
-0.817948141487935, 1.6278464008073, 3.22130429296692, 3.22974945083757, 
7.91888088977089, 9.00321571025501, 2.45285976212472, 6.25077115575721, 
5.03100211266428, 2.74867731204381, 1.44042827261984, 5.0080717401579, 
9.99481210485101, 1.47605382240812, 5.14414160776883, 7.90287723037352, 
9.74082155153155, 6.49911883796255, 4.32981192308168, 8.23767292803774, 
4.77091148030013, 9.32961360147844, 2.72199803938468, 3.16884732668598, 
0.829324850315849, 5.05701699654261, 2.34405943658203, 3.72586945071816, 
2.25735736116767, 6.09207524452358, 4.18884709620227, 3.34469041476647, 
2.58207156453282, 0.749518116253117, 10.288818734698, 4.99554592402031, 
12.6404861267656, 7.78458069656044, 5.22634824855874, 19.4536860734845, 
18.1442198018854, 2.79630955830216, 9.58980514369905, 7.07662330872069, 
2.31912315823138, 4.37504726642122, 10.6357821771254, 4.86855507971098, 
15.2487045279394, 2.38136078696698, 2.2241358395045, 2.6578728514413, 
2.00018415475885, 14.3105512278154, 2.79016453381628, 2.82581362407655, 
3.00503022968769, 0.867751725018024, 19.840897029452, 2.76507987820854, 
2.18780075665563, 2.29634886607528, 5.24806297849864, 3.00219499040395, 
3.75070014856756, 4.70993515464167, 2.90138749877612, 7.88213442545384, 
6.81599063475927, 2.61536476109177, 2.09377944457034, 1.96296699593464, 
5.57947221212089, 5.20963124415527, 3.33875400945544, 2.79699660371989, 
4.58355573223283, 1.3866287143901, 2.73741390556097, 3.83854200132191, 
3.64578404258937, 8.6152262147516, 7.4471927308167, 10.7907397029921, 
3.60064423537503, 2.21575071569532, 13.2584175148358, 6.69871457573026, 
7.57425001679609, 10.597095658568, 6.29717063158751, 19.982005469501, 
4.82507357889165, 12.0198037318264, 12.1128156222403, 0.971864601969719, 
7.90290065680941, 2.69096855397026, 3.41408452807615, 3.34672867252181, 
3.73752327635884, 2.59765107041846, 3.69969989638776, 6.49256678763777, 
1.34200508563469, 2.58853476804991, 6.08448572133978, 2.86518771201372, 
8.80958803463727, 3.21931545517097, 13.9356314451744, 3.31248219124973, 
2.6213849833856, 6.81703055885931, 4.57180783512692, 0.800920105539262, 
1.6915382258594, 1.97274463716894, 8.35091653894633, 4.07009290065616, 
4.70143776244173, 3.73712390195578, 3.32298003757993, 2.70389301739633, 
9.40504621323198, 5.22653986488779, 6.91535883117467, 4.01246708252778, 
6.83368936007222, 5.52582112283756, 9.52527065730343, 5.19140504300594, 
8.41044230709473, 6.70183763448149, 6.82530618272722, 9.10367484707385, 
14.269079283004, 4.7895190725103, 6.59831102186193, 5.62791180796921, 
6.17603897117078, 7.14836313854903, 5.96534776215752, 6.33691875052949, 
4.8933652261893, 13.8731955783442, 17.7908931604276, 13.79737106661, 
12.8477370319888, 1.87629146464169, 19.782149810344, 19.4288346171379, 
18.9387338103106, 18.204667886657, 1.15545298295716, 19.7008843562255, 
17.4563148757443, 14.7906976851945, 17.3621598140026, 1.16834398824722, 
1.00026320200414, 15.0063681416214, 3.00477131232619, 5.1401701791212, 
0.385864693671465, 4.36085817695906, 8.55562331421922, 5.70195167902857, 
5.37442221703629, 4.73609448876232, 3.56490389804045, 5.62534291626265, 
1.87125703754524, 4.91213823029151, 5.71994946518292, 5.81859005757918, 
4.93084813430905, 8.53895935813586, 8.24653955462078, 5.77503573757907, 
8.75251863257339, 6.42709499839693, 6.27165456948181, 6.44690599292517, 
16.0425839032357, 5.03840921912342, 5.87823822697004, 4.41305622781316, 
5.94650622780124, 5.56597112355133, 1.1121899643292, 5.33032094594091, 
5.42635044082999, 6.6131685036545, 11.4281271025538, 6.12669843180726, 
14.5658381014441, 18.4343010187149, 10.0486588804051, 19.8108604625488, 
19.3252983829007, 9.68507033698261, 7.62796785042932, 14.9589012558262, 
0.670023332349956, 11.1040308237076, 17.0357564882065, 1.06506975100686, 
0.740390195945899), .Dim = c(96L, 7L), .Dimnames = list(NULL, 
    c("jd", "k1", "k2", "b1", "b2", "p1", "p2")))

and code:

library(ggbiplot)
Param.pca <- prcomp(Param.clean,scale=TRUE)
groups <- factor(c(rep("Cercis L",16),rep("Cornus L",16),rep("Ginkgo L",16),rep("Cercis R",16),rep("Cornus R",16),rep("Ginkgo R",16)))
g <- ggbiplot(Param.pca, choices=c(1,2), obs.scale = 1, var.scale = 1, ellipse.prob=0.95, 
              groups = groups, ellipse = TRUE, alpha=0, varname.size=6,labels.size = 10)
g <- g + geom_point(aes(shape=groups,col=groups))
g <- g + theme(legend.direction = 'vertical', 
               legend.position = 'right',
               axis.text=element_text(size=14),axis.title=element_text(size=14),
           legend.text=element_text(size=14),legend.title=element_text(size=14))
plot(g)

Answer:

Following up on @RichardTelford's comment, you can get the code for ggbiplot by typing ggbiplot in your console. Then copy and paste this into a script file and assign it to a new name, say, my_ggbiplot. Then change the last if statement in the function from this:

if (var.axes) {
    g <- g + geom_text(data = df.v, 
                       aes(label = varname, x = xvar, y = yvar, 
                           angle = angle, hjust = hjust), 
                       color = "darkred", size = varname.size)
} 

to this:

if (var.axes) {
    df.v$varname = paste0(substr(df.v$varname,1,1),"[",substr(df.v$varname,2,2),"]")

    g <- g + geom_text(data = df.v, 
                       aes(label = varname, x = xvar, y = yvar, 
                           angle = angle, hjust = hjust), 
                       color = "darkred", size = varname.size, parse=TRUE)
}

Note the addition of the line that creates the new text labels and then the addition of parse=TRUE to geom_text. This hard-codes the labels to work specifically for your case, but you can make the function more general if you're going to do this a lot.

To run the function (once you've loaded it into your environment):

my_ggbiplot(Param.pca, choices=c(1,2), obs.scale = 1, var.scale = 1, ellipse.prob=0.95, 
            groups = groups, ellipse = TRUE, alpha=0, varname.size=6, labels.size = 10) +
  geom_point(aes(shape=groups, col=groups)) + 
  theme(axis.text=element_text(size=14),
        axis.title=element_text(size=14),
        legend.text=element_text(size=14),
        legend.title=element_text(size=14)) 

And here's the result:

Question:

I am learning biplot with wine data set. How does R know Barolo, Grignolino and Barbera are wine.class while we don't see the wine class column in the data set?

More details about the wine data set are in the following links

ggbiplot - how not to use the feature vectors in the plot

https://github.com/vqv/ggbiplot

Thanks very much


Answer:

In the wine dataset, you have 2 objects, one data.frame wine with 178 observations of 13 quantitative variables:

str(wine)
'data.frame':   178 obs. of  13 variables:
 $ Alcohol       : num  14.2 13.2 13.2 14.4 13.2 ...
 $ MalicAcid     : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ Ash           : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ AlcAsh        : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ Mg            : int  127 100 101 113 118 112 96 121 97 98 ...
 $ Phenols       : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ Flav          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ NonFlavPhenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ Proa          : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ Color         : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ Hue           : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ OD            : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ Proline       : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

There is also one vector wine.class that contains 178 observations of the qualitative wine.class variable:

str(wine.class)
 Factor w/ 3 levels "barolo","grignolino",..: 1 1 1 1 1 1 1 1 1 1 ...

The 13 quantitative variables are used to compute the PCA:

wine.pca <- prcomp(wine, scale. = TRUE)

while the wine.class variable is just used to color the points on the plot

Question:

I have created a PCA plot in using the ggbiplot() function, form the ggbiplot package, which is built on top of ggplot2. Here is an analogous, reproducible example:

library(ggbiplot)
data(wine)
wine.pca <- prcomp(wine, scale. = TRUE)
print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE))

So far so good. I want to change the legend order to grginolino, barbera, barolo (rather than barolo, grignolino, barbera as it is now). The names are stored in the environmental factor variable wine.class.

Apologies for such a simple question, but I cannot find a straight forward answer from the ggplot2 help that generalizes to this case.


Answer:

You need to reorder wine.class levels; look below:

library(ggbiplot)

data(wine, package = "ggbiplot")
wine.pca <- prcomp(wine, scale. = TRUE)
wine.class.reorder <-  factor(wine.class, levels = c("grignolino", "barbera", "barolo"))

ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class.reorder, 
                   ellipse = TRUE, circle = TRUE)

Question:

My dataset has 100 samples and 17000 variables. I would use PCA and visualize data. But the problem is that the plot is not good. How I can control the number of arrows in ggbiplot or biplot, in fact select the most contributed variables? Some sample codes are as below:

data <- matrix(rnorm(1700000), nrow=100, ncol=17000)
colnames(data) <- paste("X", 1:ncol(data), sep="")
pca <- prcomp(data, scale=T, center=T)

biplot(pca)
print(ggbiplot(pca, obs.scale = 1, var.scale = 1, 
               groups = c(rep('a',30), rep('b',70))))


Answer:

I assumed you got a recent version of ggbiplot from github (19 Jun 2015 https://github.com/vqv/ggbiplot). In this one, I don't think there's a clean way to reduce the number of arrows. You'd have to modify the original function by subsetting the df.v in two plotting calls:

around line 89:

g <- g + geom_segment(data = df.v[1:5,], # SUBSET HERE
aes(x = 0, y = 0, xend = xvar, yend = yvar), arrow = arrow(length = unit(1/2, "picas")), color = muted("red"))

and around line 127:

g <- g + geom_text(data = df.v[1:5,], # SUBSET HERE
aes(label = varname, x = xvar, y = yvar, angle = angle, hjust = hjust), color ="darkred", size = varname.size)

Question:

I am trying to use ggbiplotfrom ggfortify package. It seems its working fine but I am getting warning message as follows,

mdl <- pls::plsr(mpg ~ ., data = mtcars, scale = T)
scrs <- data.frame(pls::scores(mdl)[])
loads <- data.frame(pls::loadings(mdl)[])

ggfortify::ggbiplot(scrs, loads, 
  label.label = rownames(scrs), asp = 1, label = T, label.size = 3,
  loadings = T, loadings.label = T, loadings.label.label = rownames(loads))

Warning messages:
1: In if (value %in% columns) { :
  the condition has length > 1 and only the first element will be used
2: In if (value %in% columns) { :
  the condition has length > 1 and only the first element will be used

Have I taken any wrong step or is it a bug.


Answer:

According to the ggbiplot documentation, the label.label= parameter expects the column names from which to pull the names; it does not expect a vector of names. Same goes for loadings.label.label=. (ggplot and most tidyverse functions don't like rownames very much -- better to make them a proper column)

scrs$ID <- rownames(scrs)
loads$ID <- rownames(loads)
ggfortify::ggbiplot(scrs, loads, 
  label.label = "ID", asp = 1, label = T, label.size = 3,
  loadings = T, loadings.label = T, loadings.label.label = "ID")

Question:

I would like to mention the count of each discrete values in the legend of a PCA plot I made with ggbiplot (I used ggbiplot because it's simple to draw ellipses).

My code:

# dummy dataset
df <- iris[-5]

# the labels to display in the legend
count_legend <- paste0(levels(iris$Species)," (", table(iris$Species),")")

# the PCA plot
ggbiplot(prcomp(df), obs.scale = 1, var.scale = 1, groups=iris$Species, ellipse=T, varname.size=0, var.axes=F) +
scale_fill_discrete(labels=count_legend)

The scale_fill_discrete part usually works with other type of plots (e.g. barplot, etc), but not with ggbiplot.


Answer:

do you want something like this?

# the PCA plot
ggbiplot(prcomp(df), obs.scale = 1, var.scale = 1, groups=iris$Species, ellipse=T, varname.size=0, var.axes=F)
+scale_color_discrete(labels=count_legend)

Question:

So I do a PCA analysis, and I usually plotted the results with ggplot2, but I just recently discovered ggbiplot which can show arrows with the variables.

ggbiplot seems to be working ok, though it shows some problems (like the imposibility of changing point size, hence the whole layer thing I do in the MWE).

The problem I am facing now is that, while ggplot2 plots adjust the plot width to the plotting area, ggbiplot does not. With my data, the ggbiplot is horribly narrow and leaves horribly wide vertical margins, even though it expands the same x axis interval as the ggplot2 plot (it is, in fact, the same plot).

I am using the iris data here, so I had to make the png width extra large so the problem I am facing becomes evident. Please check the MWE below:

data(iris)
head(iris)
pca.obj <- prcomp(iris[,1:4],center=TRUE,scale.=TRUE)
pca.df <- data.frame(Species=iris$Species, as.data.frame(pca.obj$x))
rownames(pca.df) <- NULL
png(filename="test1.png", height=500, width=1000)
print(#or ggsave()
  ggplot(pca.df, aes(x=PC1, y=PC2)) +
  geom_point(aes(color=Species), cex=3)
)
dev.off()
P <- ggbiplot(pca.obj,
         obs.scale = 1, 
         var.scale=1,
         ellipse=T,
         circle=F,
         varname.size=3,
         groups=iris$Species, #no need for coloring, I'm making the points invisible
         alpha=0) #invisible points, I add them below
P$layers <- c(geom_point(aes(color=iris$Species), cex=3), P$layers) #add geom_point in a layer underneath (only way I have to change the size of the points in ggbiplot)
png(filename="test2.png", height=500, width=1000)
print(#or ggsave()
    P
)
dev.off()

This code produces the following two images.

ggplot2 output (desired plot width):

ggbiplot output (plot too narrow for plotting area):

See how, while ggplot2 adjusts the plot width, to the plot area, ggbiplot does not. With my data, the ggbiplot plot is extremely narrow and leaves large vertical margins.

My question here is: How to make ggbiplot behave as ggplot2? How can I adjust the plot width to my desired plotting area (png size) with ggbiplot? Thanks!


Answer:

Change the ratio argument in coord_equal() to some value smaller than 1 (default in ggbiplot()) and add it to your plot. From the function description: "Ratios higher than one make units on the y axis longer than units on the x-axis, and vice versa."

P + coord_equal(ratio = 0.5)

NOTE: as @Brian noted in the comments, "changing the aspect ratio would bias the interpretation of the length of the principal component vectors, which is why it's set to 1."

Question:

I am doing a PCA on plots in 2 habitat types in which I collected data on multiple environmental variables. I was able to change the colors of the points from the ggbiplot defaults. I want the size of each point to depend on canopy cover in that plot, and I was able to do that by:

point.size = df$canopy.cover * 0.1  

where df$canopy.cover values range from 0-100 and 0.1 because a canopy cover of 100%=point size 10.

The problem: I can't maintain the colors associated with the two groups after changing the size of points. Using the following pseudo-data:

env.vars<-data.frame(replicate(5,sample(0:10,20,rep=TRUE)))

cover<-c(89, 92, 72, 53, 88, 89, 71, 83, 71, 66, 23, 30,  5, 15, 57, 54,0, 23, 9, 16)

habitat<-c("habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2")

point.size<-cover*0.1

nest.env.pca <- prcomp(env.vars, center = TRUE, scale. = TRUE) 

g <- ggbiplot(nest.env.pca, obs.scale = 1, var.scale = 1, 
     group = habitat, ellipse = TRUE, 
     circle = TRUE, varname.size=3)+
     scale_colour_manual(values=c("blue", "red")) +
     geom_point(size=point.size)

print(g)

I get something similar to:

When I replace:

geom_point(size=point.size)

with:

geom_point(aes(colour=habitat), size=point.size)

as per ggbiplot - change the point size I get the following following error:

Error: Incompatible lengths for set aesthetics: size

Any suggestions? Thanks.

EDIT: some PseudoData to try it out with:


Answer:

This is a general ggplot2 syntax issue, I think, as it occurs with simpler examples (e.g., ggplot(mtcars) + geom_point(aes(disp, hp, color=cyl), size =mtcars$model))

But, I think this will work for you:

geom_point(aes(X1, X2, color=habitat, size = point.size)) + scale_size_identity()

So the full answer with the code in the question is (from Emilio's comment below):

env.vars<-data.frame(replicate(5,sample(0:10,20,rep=TRUE)))

cover<-c(89, 92, 72, 53, 88, 89, 71, 83, 71, 66, 23, 30,  5, 15, 57, 54,0, 23, 9, 16)

habitat<-c("habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2", "habitat1", "habitat2")

point.size<-cover*0.1

nest.env.pca <- prcomp(env.vars, center = TRUE, scale. = TRUE) 

g <- ggbiplot(nest.env.pca, obs.scale = 1, var.scale = 1, 
     group = habitat, ellipse = TRUE, 
     circle = TRUE, varname.size=3)+
     scale_colour_manual(values=c("blue", "red")) +
     geom_point(aes(color=habitat, size = point.size)) + scale_size_identity() 

print(g)