Single Cell Publication - Data Plotting

Author(s) orcid logoHelena Rasche avatar Helena Rasche
Reviewers
Overview
Creative Commons License: CC-BY Questions:
Objectives:
Requirements:
Time estimation: 1 hour
Level: Advanced Advanced
Supporting Materials:
Published: Nov 7, 2024
Last modification: Nov 7, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00470
version Revision: 1
Best viewed in RStudio

This tutorial is available as an RMarkdown file and best viewed in RStudio! You can load this notebook in RStudio on one of the UseGalaxy.* servers

Launching the notebook in RStudio in Galaxy

  1. Instructions to Launch RStudio
  2. Access the R console in RStudio (bottom left quarter of the screen)
  3. Run the following code:
    download.file("https://training.galaxyproject.org/training-material/topics/contributing/tutorials/meta-analysis-plot/contributing-meta-analysis-plot.Rmd", "contributing-meta-analysis-plot.Rmd")
    download.file("https://training.galaxyproject.org/training-material/assets/css/r-notebook.css", "gtn.css")
    
  4. Double click the RMarkdown document that appears in the list of files on the right.

Downloading the notebook

  1. Right click this link: tutorial.Rmd
  2. Save Link As...

Alternative Formats

  1. This tutorial is also available as a Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
Hands-on: Learning with RMarkdown in RStudio

Learning with RMarkdown is a bit different than you might be used to. Instead of copying and pasting code from the GTN into a document you’ll instead be able to run the code directly as it was written, inside RStudio! You can now focus just on the code and reading within RStudio.

  1. Load the notebook if you have not already, following the tip box at the top of the tutorial

    Screenshot of the Console in RStudio. There are three lines visible of not-yet-run R code with the download.file statements which were included in the setup tip box.

  2. Open it by clicking on the .Rmd file in the file browser (bottom right)

    Screenshot of Files tab in RStudio, here there are three files listed, a data-science-r-dplyr.Rmd file, a css and a bib file.

  3. The RMarkdown document will appear in the document viewer (top left)

    Screenshot of an open document in RStudio. There is some yaml metadata above the tutorial showing the title of the tutorial.

You’re now ready to view the RMarkdown notebook! Each notebook starts with a lot of metadata about how to build the notebook for viewing, but you can ignore this for now and scroll down to the content of the tutorial.

You can switch to the visual mode which is way easier to read - just click on the gear icon and select Use Visual Editor.

Screenshot of dropdown menu after clicking on the gear icon. The first option is `Use Visual Editor`.

You’ll see codeblocks scattered throughout the text, and these are all runnable snippets that appear like this in the document:

Screenshot of the RMarkdown document in the viewer, a cell is visible between markdown text reading library tidyverse. It is slightly more grey than the background region, and it has a run button at the right of the cell in a contextual menu.

And you have a few options for how to run them:

  1. Click the green arrow
  2. ctrl+enter
  3. Using the menu at the top to run all

    Screenshot of the run dropdown menu in R, the first item is run selected lines showing the mentioned shortcut above, the second is run next chunk, and then it also mentions a 'run all chunks below' and 'restart r and run all chunks' option.

When you run cells, the output will appear below in the Console. RStudio essentially copies the code from the RMarkdown document, to the console, and runs it, just as if you had typed it out yourself!

Screenshot of a run cell, its output is included below in the RMarkdown document and the same output is visible below in the console. It shows a log of loading the tidyverse library.

One of the best features of RMarkdown documents is that they include a very nice table browser which makes previewing results a lot easier! Instead of needing to use head every time to preview the result, you get an interactive table browser for any step which outputs a table.

Screenshot of the table browser. Below a code chunk is a large white area with two images, the first reading 'r console' and the second reading 'tbl_df'. The tbl_df is highlighted like it is active. Below that is a pretty-printed table with bold column headers like name and genus and so on. At the right of the table is a small arrow indicating you can switch to seeing more columns than just the initial three. At the bottom of the table is 1-10 of 83 rows written, and buttons for switching between each page of results.

We’ll use some ggplot and tidyverse code to plot the data we collected in part 1

Agenda

In this tutorial, we will cover:

  1. Data Cleaning
  2. Plot: Lines added/removed by date/time period
  3. Contributions over time
  4. New X over Time
  5. Pageviews

We’ll use tidyverse (which includes things like magrittr (%>%) and ggplot2) to load our data. Reshape2 provides the cast/melt functions which can be used to reshape specific datasets into formats that are easier to plot.

library(tidyverse)
library(reshape2)
# This path will probably need to be changed depending on where you downloaded that dataset.
data = read_tsv("sc.tsv")

Data Cleaning

Let’s start by cleaning our data a bit. We’ve got a couple problems with it:

  • there might be some null mergedAt dates, we’ll want to remove those
  • we probably want to ignore a couple of classes of datasets that were added/removed (images, and the explicit ‘ignore’ type.)
clean = data %>% 
	select(num, class, additions, deletions, mergedAt) %>% 
	filter(!is.na(mergedAt)) %>% 
	group_by(num, class, mergedAt, month=floor_date(mergedAt, 'month'), quarter=floor_date(mergedAt, 'quarter')) %>% 
	summarise(additions=sum(additions), deletions=-sum(deletions)) %>%
	filter(class != "ignore") %>%
	filter(class != "image") %>%
	arrange(mergedAt) %>%
	as_tibble()

We’ll setup a ‘theme’ for our plot that mainly consists of using the black and white theme which is quite elegant and readable, and then making some font sizes a wee bit larger:

theme = theme_bw() + theme(
  axis.text=element_text(size=14), 
  plot.title=element_text(size=18), 
  axis.title=element_text(size=16),
  legend.title=element_text(size=14),
  legend.text=element_text(size=14))

We want all of our plots to look the same which is why we use this trick, it helps us keep a consistent aesthetic without having to re-type the configuration every single time.

Let’s get to plotting!

Plot: Lines added/removed by date/time period

We have a dataset that looks like:

clean %>% head(10)

We’ll want to plot the changes versus the time (either quarter or month), and then maybe plot them differently based on the class of the addition/removal. So that translates into an aesthetics statement like aes(x=time, y=additions, fill=class)

In ggplot2 you can plot data some different ways, if you provide the datasets upfront, e.g. data %>% ggplot() you’ll generally do something like data %>% ggplot(aes(x=a, y=b)) + geom_something() and geom_something will take the data from the ggplot call. However if you want to plot multiple series, you can also provide the data directly to the geom_* functions like so:

ggplot() +
  geom_col(data=clean, aes(x=quarter, y=additions, fill=class)) +
  geom_col(data=clean, aes(x=quarter, y=deletions, fill=class)) + scale_fill_brewer(palette = "Paired") +
  geom_point(data=clean, aes(x=mergedAt, y=0), shape=3, alpha=0.3, color="black") +
  theme +
  xlab("Quarter") + ylab("Lines added or removed") + guides(fill=guide_legend(title="Category")) +
  ggtitle("Lines added or removed by file type across GTN Single Cell Learning Materials")
ggsave("sc-lines-by-quarter.png", width=12, height=5)

ggplot() +
  geom_col(data=clean, aes(x=month, y=additions, fill=class)) + 
  geom_col(data=clean, aes(x=month, y=deletions, fill=class)) + scale_fill_brewer(palette = "Paired") +
  theme +
  geom_point(data=clean, aes(x=mergedAt, y=0), shape=3, alpha=0.3, color="black") +
  xlab("Month") + ylab("Lines added or removed") + guides(fill=guide_legend(title="Category")) +
  ggtitle("Lines added or removed by file type across GTN Single Cell Learning Materials") 
ggsave("sc-lines-by-month.png", width=12, height=5)

very inconsistent lines added by quarter some quarters have more and some have less but it is mostly workflow and then tutorial lines being added or removed.

In this case we plotted the data for additions and deletions separately, and we additionally added points based on the actual date of the PRs to visualise their density.

Let’s produce Wendi Bacon avatar Wendi Bacon ‘s favourite running sum plots in addition. We’ll start by reshaping our data. Currently we have data that looks like:

time variable value
today measure1 1
today measure2 10
yesterday measure1 30
yesterday measure2 5

And we’ll re-shape this to look like this, which will make it easier to calculate changes over the course of a specific series:

time measure1 measure2
today 1 10
yesterday 30 5

We’ll use the dcast function to do that:

clean %>% select(month, class, additions)  %>% dcast(month ~ class, value.var="additions") %>% head()

That doesn’t look quite right so, let’s change how the data is aggregated:

# cumulative
clean %>% select(month, class, additions)  %>% dcast(month ~ class, value.var="additions", fun.aggregate = sum)

Let’s do it for real now:

cumulative = clean %>% select(month, class, additions)  %>% 
  dcast(month ~ class, value.var="additions", fun.aggregate = sum) %>%
  mutate(across(bibliography:workflows, cumsum)) %>% 
  reshape2::melt(id.var="month")

cumulative %>% ggplot(aes(x=month, y=value, color=variable)) + geom_line() + 
  theme_bw() + theme +
  xlab("Month") + ylab("Lines added") + guides(color=guide_legend(title="Category")) +
  ggtitle("Cumulative lines added by category, across GTN Single Cell materials")
ggsave("sc-lines-cumulative.png", width=12, height=8)

workflows and tutorials are the highest with on the order of 25k lines added, it used to be quite jagged with code being thrown over the wall but now it's rather smooth with consistent constant updates.

Contributions over time

Let’s again start with some cleaning, namely removing all rows with 0 records, and removing future records (at the time of writing.)

roles = read_tsv("sc-roles.tsv")
w = roles %>% 
  filter(count != 0) %>% 
  filter(!grepl("2025", date)) %>%
  filter(!grepl("2024-12-01", date))

w %>% 
  ggplot(aes(x=date, y=count, color=area)) + 
  theme +
  xlab("Date") + ylab("Unique Contributors") + guides(color=guide_legend(title="Contribution Area")) +
  ggtitle("Contributors over time to GTN Single Cell Materials") +
  geom_line()
ggsave("sc-contribs.png", width=12, height=6)

plot of several lines increasing over time in a very discrete way, authorship remains highest.

New X over Time

Let’s plot all of the new single cell things added over time, all the new FAQs, Tutorials, Slides, Etc:

added_by_time = read_tsv("single-cell-over-time.tsv")
added_by_time %>% dcast(date ~ `type`) %>% 
  arrange(date) %>% 
  mutate(across(event:workflow, cumsum)) %>% 
  melt(id.var="date") %>% 
  as_tibble() %>% arrange(date) %>% 
  ggplot(aes(x=date, y=value, color=variable)) + geom_line() + 
  theme_bw() + theme +
  xlab("Date") + ylab("New Single Cell Items") + guides(color=guide_legend(title="Contribution Type")) +
  ggtitle("New Single Cell events, FAQs, news, slides, tutorials, videos and workflows in the GTN")
ggsave("sc-files-cumulative.png", width=6, height=6)

We may also want to know how many changes there have been since the start date of this study, e.g. October 1st, 2020:

since_oct = added_by_time %>% dcast(date ~ `type`) %>% 
  arrange(date) %>% 
  mutate(across(event:workflow, cumsum)) %>% 
  filter(date > as.Date("2020-10-01"))

# Pull out the first/last date as our start ane dnwend
start_date = (since_oct %>% head(n=1))$date
end_date = (since_oct %>% tail(n=1))$date

# Table of our changes.
since_oct %>% 
  filter(date == start_date | date == end_date) %>%  # Just those rows
  mutate(date = case_when(date == start_date ~ 'start', date == end_date ~ 'end')) %>% # Relabel the start/end as literal string start and end
  pivot_longer(-date) %>% pivot_wider(names_from=date, values_from=value) %>% # Transpose the data
  mutate(increase=end - start) %>% # And calculate our increase
  select(name, increase) %>% arrange(-increase)

Pageviews

GTN uses the Galaxy Europe Plausible server for collecting metrics (you can change your preferences in your GTN privacy preferences). We can download the data from the server and plot it, filtering by our preferred start/end dates and filters (namely that page includes /topics/single-cell/). Unfortunately the data is downloaded as a zip file which we’ll then need to extract data from:

system("wget 'https://plausible.galaxyproject.eu/training.galaxyproject.org/export?period=custom&date=2024-11-07&from=2022-08-01&to=2024-11-07&filters=%7B%22page%22%3A%22~%2Ftopics%2Fsingle-cell%2F%22%7D&with_imported=true&interval=date' -O sc-stats.zip")

We can use the unzip function to read a single file directly from the zip:

views = read_csv(unzip("sc-stats.zip", "visitors.csv"))

With that we’re ready to plot. We’re going to use a new feature for our plot, annotate. Annotation allows you to draw arbitrary features atop your plot, in this case we’re going to draw rectangles to indicate outages and events that might have affected our data.

y = 3300
xoff = 3
views %>% group_by(date=floor_date(date, 'week')) %>% 
  summarise(date, visitors=sum(visitors), pageviews=sum(pageviews)) %>% 
  filter(visitors != 0 | pageviews != 0) %>%
  melt(id.var="date")  %>% 
  ggplot(aes(x=date, y=value, color=variable)) + geom_line() + 
  annotate("rect", xmin = as.Date("2023-10-01"), xmax = as.Date("2023-10-27"), ymin = 0, ymax = y,  alpha = .2) + # Outage
  annotate("rect", xmin = as.Date("2023-05-22") - xoff, xmax = as.Date("2023-05-26") - xoff, ymin = 0, ymax = y,  alpha = .2) + # Smorg3
  annotate("rect", xmin = as.Date("2024-10-07") - xoff, xmax = as.Date("2024-10-11") - xoff, ymin = 0, ymax = y,  alpha = .2) + # GTA
  annotate("rect", xmin = as.Date("2024-09-16") - xoff, xmax = as.Date("2024-09-20") - xoff, ymin = 0, ymax = y,  alpha = .2) + # Bootcamp
  theme_bw() + theme +
  scale_y_continuous(expand = c(0, 0), limits = c(0, NA)) +
  xlab("Date") + ylab("Count") + guides(color=guide_legend(title="Metric")) +
  ggtitle("GTN Single Cell Visits")
ggsave("sc-pageviews.png", width=14, height=4)

for our lovely pageview plot!

pageviews over time, it's relatively stable except for the large annotated outage, and a large spike around the bootcamp date.

Discuss this with us and we can perhaps generalise this analysis, to reduce the amount of data processing you need to do, and make it more accessible for everyone!