archivist.github Have you ever suffered because of the impossibility of reproducing graphs, tables or analysis results in R? Have you ever bothered yourself for not being able to share R objects (i.e. plots or final analysis models) within your reports, posters or articles? Or maybe simply you have too many objects you can’t manage to store in a convenient and handy way? Now you can share partial results of analysis, provide hooks to valuable R objects within articles, manage analysis results and restore objects’ pedigree with archivist package and it’s extension archivist.github. All automatically through GitHub without closing RStudio. If you are tired of archiving results by yourself, then read this tutorial.

1 Introduction

Open science needs not only reproducible research but also accessible final and partial results.

library(archivist.github)
# described functionalities are implemented in the 2.0 version or achivist and 0.1 of archivist.github
# devtools::install_github('MarcinKosinski/archivist.github') 
# install.packages('archivist.github')

The archivist is an R package for data analysis results management, which helps in managing, sharing, storing, linking and searching for R objects. The archivist package automatically retrieves the object’s meta-data and creates a rich structure that allows for easy management of calculated R objects. It also extends the reproducible research paradigm by creating new ways to retrieve and validate previously calculated objects.

This use case describes how archivist can be integrated with GitHub so that one can share (partial) results of his analysis in a more automatic and simpler way. GitHub is a platform on which collaborators can share their analysis code, figures and reports. It is also possible to share R objects which are crucial to the analysis or which calculations took a great amount of time or required a special software. Such objects are referred as artifacts. One might be only interested in a partial or final result without executing the whole analysis, which sometimes might be impossible due to lack of software, changes in R packages or simply takes to much time. If you are not using any version control system (like Git, hence GitHub), then get motivated here and here.

The archivist is a tool that makes sharing R objects more convenient and transparent. Such objects can even be added to StackOverfow questions to improve providing reproducible examples (like here).

2 Working with GitHub repository

GitHub API OAuth open autorization and archivist.github functions that are integrated with GitHub are described in an R documentation page accesed with ?agithub. Information below provides broader explanation. New functions from archivist.github extension can be seen in below workflow

archivist.github

2.1 `OAuth` open autorization

To start sharing code and analysis results on GitHub a data scientist needs to create a repository on GitHub first. It can be done manually under this link https://github.com/new or one can use the createGitHubRepo() function to do this automatically. The createGitHubRepo() function integrates with the GitHub API which enables performing operation on GitHub with a simple curl ("see URL") - requests (if a data scientist comes with an IT background it’s easy, but when he’s background is different it is not so obvious). If you haven’t worked earlier with API and are wondering why they are so important and broadly used, then get keen on here.

Working with GitHub API requires creating a simple developer application (you can create it under this link https://github.com/settings/applications/new) which will be used to authenticate your curl requests (created via httr package). It can be done once to benefit from it in future work. When application is created, one will have to copy its Client ID and Client Secret to authorize his computer with this application by running

library(httr)
myapp <- oauth_app("github",
                   key = app_key,
                   secret = app_secret)
github_token <- oauth2.0_token(oauth_endpoints("github"),
                               myapp,
                               scope = c("public_repo",
                                         "delete_repo"))
# or use wrapper for 
authoriseGitHub(ClientID, ClientSecret)

The above command created a github_token that uses OAuth open autorization system.

OAuth allows notifying a resource provider (e.g. Facebook) that the resource owner (e.g. you) grants permission to a third-party (e.g. a Facebook Application) access to their information (e.g. the list of your friends).

More about OAuth is explained in that StackOverflow answer.

The scope parameter in the oauth2.0_token function lets you specify exactly what type of access you need. Scopes limit access for OAuth tokens. They do not grant any additional permission beyond that the user already has. In this example we granted read/write access to code, commit statuses, collaborators, and deployment statuses for public repositories and organizations, which are required for starring public repositories (public_repo). We also granted access to delete adminable repositories (delete_repo). More possible values of scope for GitHub API OAuth token can be found in this table.

2.2 Create or clone your repository

When github_token is created one can set it up as a global github_token visible for most archivist functions with

aoptions("github_token", github_token)

<Token>
<oauth_endpoint>
 authorize: https://github.com/login/oauth/authorize
 access:    https://github.com/login/oauth/access_token
<oauth_app> github
  key:    1fab1e77d27079c0717d
  secret: <hidden>
<credentials> access_token, scope, token_type
---

aoptions("user", user) # user = 'MarcinKosinski'

[1] "MarcinKosinski"

invisible(aoptions("password", password))

What is more, one can specify GitHub’s user.name and user.password globally, so that future integration with Git/GitHub (i.e. commits or pushes) can be performed (with the great power of git2r package).

One can create GitHub’s repository consisting of the archivist-like Repository of artifacts with

createGitHubRepo(repo = "Gallery", default = TRUE)

[1] "MarcinKosinski"

# -> https://github.com/MarcinKosinski/Gallery

Repository is a folder with an SQLite database stored in a file named backpack.db and a subdirectory named gallery with collection of objects saved as .rda files. To learn more about it visit archivist's Wiki or run in R console ?archivist::Repository.

The default = TRUE option sets Repository under link https://github.com/user.name/repo as a visible default Repository for future archivist.github functions. In this way one won’t need to pass additional parameters to every functions’ calls.

If one already has a GitHub repository with archivist Repository it can be cloned and set as default.

# eval = FALSE - this line is not evaluated in this tutorial
cloneGitHubRepo(repoURL = 'https://github.com/MarcinKosinski/Museum/',
                repoDir = any_local_path_or_NULL,
                default = TRUE)

When the option default = TRUE is used in the above functions, then the global parameters repo and user start being visible for the achivist.github functions. One can check if the parameter is set globally for the archivist's functions

aoptions('repo')

[1] "Gallery"

aoptions('user')

[1] "MarcinKosinski"

The globally set Repository stored on GitHub is synchronized with a Local Repository that can be set (or is already set when default = TRUE) in repoDir parameter

aoptions('repoDir')

[1] "Gallery"

If a user did not specify the repoDir argument, during the Repostiory creation with createGitHubRepo(), it is by default set to the same value as repo parameter.

2.3 Archiving and exploring example

Now, when GitHub and archivist Repositories are created, one can archive and share artifacts (crucial objects) with archive function which also allows to emb a hook to the artifact in the report. We are creating .html report in rmarkdown, so the default markdown formating for hook is used. For \(\LaTeX\)-like hooks use `format = “latex”.

Let us prepare a linear model with iris data

iris.lm.model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
                                        data= iris)

and let’s archive the desired model automatically to the created Repository (https://github.com/MarcinKosinski/Gallery).

# results = 'asis' - so that a text can be understood as a URL
archive(iris.lm.model, alink = TRUE)

archivist::aread('MarcinKosinski/Gallery/caab8f8e72045c93f92091aebcb74f2d')

The archive function created a hook to the artifact. One can download this shared artifact by clicking the link or by copying and pasting it’s value to R console. It is a great way to provide a hook to the figures in posters or publications that are really an R objects (such as ggplot class). The object named iris.lm.model was archived to the GitHub (and synchronized Local) Repository. This is the final implementation of the archive prototype we presented at the BI FORUM conference in Budapest in Oct 2015.

One can check that the artifact is really on GitHub with

showRemoteRepo(repo = 'Gallery', user = 'MarcinKosinski')

                           md5hash                             name         createdDate
1 caab8f8e72045c93f92091aebcb74f2d                    iris.lm.model 2016-02-24 19:12:15
2 b0f870835c546853b70f921f57e100b4 b0f870835c546853b70f921f57e100b4 2016-02-24 19:12:15
3 7a761a2ae54f3d90060a9f6ca04b3506 7a761a2ae54f3d90060a9f6ca04b3506 2016-02-24 19:12:15

# showRemoteRepo() would also work since `user` and `repo` are set gobally

The first row corresponds to the archived artifact, the second one corresponds to the archived data extracted from an artifact. The md5hash column specifies which MD5 hash (see ?digest::digest) the artifacts has been connected with. Artifacts are archived along with a special attribute named md5hash. For each artifact, md5hash is a unique string of length 32 that is produced by digest{digest} function which uses a cryptographical MD5 hash algorithm. The md5hash of each artifact archived to the Repository is also saved on the Repository along with the artifact’s Tags (see Tags on archivist WIKI). It enables to distinguish objects in the Repository and facilitates searching and loading them.

Each artifact can be archived with its unique Tags which are attributes of an artifact. They can be the artifact’s name, class or archiving date. Furthermore, for various artifact’s classes different Tags are available. Let us archive one more artifact to explain how various Tags can be extracted during archiving and let us even specify our own userTags.

# results = 'asis' - so that a text can be understood as a URL
iris.lm.model.smry <- summary(iris.lm.model)
archive(iris.lm.model.smry, alink = TRUE, 
        userTags = paste0("summaryOf:", digest::digest(iris.lm.model)))

archivist::aread('MarcinKosinski/Gallery/067896394acf47242b30b07ac300dc48')

One can check what Tags have been extracted so far

Sys.sleep(300) 
# After a commit sometimes GitHub does not react so fast
# so we need to give it a time.
# Immiediate access depends on github.com performance
showRemoteRepo(method = "tags")[, -3]

                           artifact                                           tag
1  caab8f8e72045c93f92091aebcb74f2d                                    format:rda
2  caab8f8e72045c93f92091aebcb74f2d                            name:iris.lm.model
3  caab8f8e72045c93f92091aebcb74f2d                                      class:lm
4  caab8f8e72045c93f92091aebcb74f2d                          coefname:(Intercept)
5  caab8f8e72045c93f92091aebcb74f2d                          coefname:Sepal.Width
6  caab8f8e72045c93f92091aebcb74f2d                         coefname:Petal.Length
7  caab8f8e72045c93f92091aebcb74f2d                          coefname:Petal.Width
8  caab8f8e72045c93f92091aebcb74f2d                                        rank:4
9  caab8f8e72045c93f92091aebcb74f2d                               df.residual:146
10 caab8f8e72045c93f92091aebcb74f2d                      date:2016-02-24 19:12:15
11 b0f870835c546853b70f921f57e100b4                                    format:rda
12 caab8f8e72045c93f92091aebcb74f2d session_info:b0f870835c546853b70f921f57e100b4
13 7a761a2ae54f3d90060a9f6ca04b3506                                    format:rda
14 7a761a2ae54f3d90060a9f6ca04b3506                                    format:txt
15 7a761a2ae54f3d90060a9f6ca04b3506 relationWith:caab8f8e72045c93f92091aebcb74f2d
16 caab8f8e72045c93f92091aebcb74f2d                                    format:txt
17 067896394acf47242b30b07ac300dc48                                    format:rda
18 067896394acf47242b30b07ac300dc48                       name:iris.lm.model.smry
19 067896394acf47242b30b07ac300dc48                              class:summary.lm
20 067896394acf47242b30b07ac300dc48                                  sigma:0.3145
21 067896394acf47242b30b07ac300dc48                                          df:4
22 067896394acf47242b30b07ac300dc48                                        df:146
23 067896394acf47242b30b07ac300dc48                                          df:4
24 067896394acf47242b30b07ac300dc48                                    R^2:0.8586
25 067896394acf47242b30b07ac300dc48                           adjusted R^2:0.8557
26 067896394acf47242b30b07ac300dc48                              fstatistic:295.5
27 067896394acf47242b30b07ac300dc48                               fstatistic.df:3
28 067896394acf47242b30b07ac300dc48                             fstatistic.df:146
29 067896394acf47242b30b07ac300dc48                      date:2016-02-24 19:12:48
30 067896394acf47242b30b07ac300dc48    summaryOf:caab8f8e72045c93f92091aebcb74f2d
31 dfb4355fdd4baa980368a13c4bf8ef3f                                    format:rda
32 067896394acf47242b30b07ac300dc48 session_info:dfb4355fdd4baa980368a13c4bf8ef3f

or more convenient form for dplyr grouping and aggregation operations (thanks to @eliotmcintire for the suggestion and @wchodor for the implementation - issue).

splitTagsRemote(repo = 'Gallery', user = 'MarcinKosinski')[,-4]

                           artifact        tagKey                         tagValue
1  caab8f8e72045c93f92091aebcb74f2d        format                              rda
2  caab8f8e72045c93f92091aebcb74f2d          name                    iris.lm.model
3  caab8f8e72045c93f92091aebcb74f2d         class                               lm
4  caab8f8e72045c93f92091aebcb74f2d      coefname                      (Intercept)
5  caab8f8e72045c93f92091aebcb74f2d      coefname                      Sepal.Width
6  caab8f8e72045c93f92091aebcb74f2d      coefname                     Petal.Length
7  caab8f8e72045c93f92091aebcb74f2d      coefname                      Petal.Width
8  caab8f8e72045c93f92091aebcb74f2d          rank                                4
9  caab8f8e72045c93f92091aebcb74f2d   df.residual                              146
10 caab8f8e72045c93f92091aebcb74f2d          date              2016-02-24 19:12:15
11 b0f870835c546853b70f921f57e100b4        format                              rda
12 caab8f8e72045c93f92091aebcb74f2d  session_info b0f870835c546853b70f921f57e100b4
13 7a761a2ae54f3d90060a9f6ca04b3506        format                              rda
14 7a761a2ae54f3d90060a9f6ca04b3506        format                              txt
15 7a761a2ae54f3d90060a9f6ca04b3506  relationWith caab8f8e72045c93f92091aebcb74f2d
16 caab8f8e72045c93f92091aebcb74f2d        format                              txt
17 067896394acf47242b30b07ac300dc48        format                              rda
18 067896394acf47242b30b07ac300dc48          name               iris.lm.model.smry
19 067896394acf47242b30b07ac300dc48         class                       summary.lm
20 067896394acf47242b30b07ac300dc48         sigma                           0.3145
21 067896394acf47242b30b07ac300dc48            df                                4
22 067896394acf47242b30b07ac300dc48            df                              146
23 067896394acf47242b30b07ac300dc48            df                                4
24 067896394acf47242b30b07ac300dc48           R^2                           0.8586
25 067896394acf47242b30b07ac300dc48  adjusted R^2                           0.8557
26 067896394acf47242b30b07ac300dc48    fstatistic                            295.5
27 067896394acf47242b30b07ac300dc48 fstatistic.df                                3
28 067896394acf47242b30b07ac300dc48 fstatistic.df                              146
29 067896394acf47242b30b07ac300dc48          date              2016-02-24 19:12:48
30 067896394acf47242b30b07ac300dc48     summaryOf caab8f8e72045c93f92091aebcb74f2d
31 dfb4355fdd4baa980368a13c4bf8ef3f        format                              rda
32 067896394acf47242b30b07ac300dc48  session_info dfb4355fdd4baa980368a13c4bf8ef3f

library(dplyr)
splitTagsRemote(repo = 'graphGallery', user = 'pbiecek') %>%
    group_by(tagKey) %>%
    summarise(count = n()) %>%
    arrange(desc(count))

Source: local data frame [13 x 2]

         tagKey count
          (chr) (int)
1       varname  1229
2         class   508
3          date   500
4      coefname   493
5          name   299
6           LHS   233
7           RHS   233
8  relationWith    35
9        format    15
10       labelx     7
11       labely     7
12 session_info     4
13         data     3

library(ggplot2)
splitTagsRemote(repo = 'graphGallery', user = 'pbiecek') %>%
    group_by(tagKey) %>%
    summarise(count = n()) %>%
    arrange(desc(count)) %>%
    ggplot(aes(reorder(tagKey, count, max), count)) +
    geom_bar(stat = "identity") +
    theme_minimal() + xlab('Number of entries with this Tag') + ylab('Tags') +
    ggtitle('Barplot of counts of Tags\' types in pbiecek/graphGallery Repository')

Extracting Tags for a specific artifact can be done with

getTagsRemote(md5hash = digest::digest(iris.lm.model.smry),
              tag = "", user = 'MarcinKosinski', repo = 'Gallery' ) %>% data.frame()

                                               .
1                                     format:rda
2                        name:iris.lm.model.smry
3                               class:summary.lm
4                                   sigma:0.3145
5                                           df:4
6                                         df:146
7                                     R^2:0.8586
8                            adjusted R^2:0.8557
9                               fstatistic:295.5
10                               fstatistic.df:3
11                             fstatistic.df:146
12                      date:2016-02-24 19:12:48
13    summaryOf:caab8f8e72045c93f92091aebcb74f2d
14 session_info:dfb4355fdd4baa980368a13c4bf8ef3f

After the Repository is created and the crucial objects are achived anyone can explore public archivist Repository shared on GitHub. Knowing the object’s md5hash one can download it with

loadFromRemoteRepo(md5hash = digest::digest(iris.lm.model.smry),
                   value = TRUE) -> ddl.iris.lm.model.smry
ddl.iris.lm.model.smry$sigma

[1] 0.3145491

or can explore Repository for various objects using their Tags.

When one is interested in all objects of class lm that were created with Species explanatory variable (with versicolor factor level) in pbiecek/graphGallery repository, then one might download them, extract R-squared statistics and coefficients, bind them in one data frame and sort rows by R-squared statistics to get the best model with the following code

mm <- asearch(patterns = c('class:lm',
                     'coefname:Speciesversicolor'),
        repo = 'pbiecek/graphGallery')

mm %>%
  lapply(function(x) {
    c(r.squared = summary(x)$r.squared,
      x$coef) %>%
      # extract coeffs and R-squared statistic
      t %>%
      as.data.frame # transpose for binding
  }) %>% 
  do.call(dplyr::bind_rows, .) %>% # apply bind_rows to all list elements
  cbind(data.frame(md5hash = names(mm))) %>%
  arrange(r.squared) %>% # arrange rows by r.squared
    unique() %>%
        select(-Sepal.Length) # to fit in the output html

  r.squared (Intercept) Speciesversicolor Speciesvirginica                          md5hash
1 0.6187057    5.006000          0.930000         1.582000 0e213ac68a45b6cd454d06b91f991bc7
2 0.6187057    5.006000          0.930000         1.582000 e58d2f9d50b67ce4d397bf015ec1259c
3 0.9413717    1.462000          2.798000         4.090000 0a82efeb8250a47718cea9d7f64e5ae7
4 0.9413717    1.462000          2.798000         4.090000 378237103bb60c58600fe69bed6c7f11
5 0.9413717    1.462000          2.798000         4.090000 7f11e03539d48d35f7e7fe7780527ba7
6 0.9413717    1.462000          2.798000         4.090000 c1b1ef7bcddefb181f79176015bc3931
7 0.9604273    1.462000          2.507231         3.509429 990861c7c27812ee959f10e5f76fe2c3
8 0.9748944   -1.702342          2.210138         3.090002 2a6e492cb6982f230e48cf46023e2e4f

This might be used to extract the most valuable model (in terms of R-squared statistics) within the Repository. Imagine a Repository with hundreds of classification models and their potentional blends, where each of them is archived with additional Tags describing their performance. One gained a great tool to search for the best classifier within dozens of models.

If one would like to extract only md5hashes for given Tags instead of whole artifacts, then searchInRemoteRepo() function can be used

searchInRemoteRepo(pattern = 'name', fixed = FALSE, repo = 'graphGallery', user = 'pbiecek') %>% length

[1] 51

# return md5hashes of artifacts that have a tag containing a substring `name` 
searchInRemoteRepo(pattern = c('class:ggplot', 'class:lm'),
                        repo = 'graphGallery', user = 'pbiecek',
                        intersect = FALSE) %>% length

[1] 15

Desired artifacts can even be copied to our Local Repository from GihHub

searchInRemoteRepo(pattern = c('class:ggplot', 'class:lm'),
                        repo = 'graphGallery', user = 'pbiecek',
                        intersect = FALSE)[1:3] %>%
    copyRemoteRepo(md5hashes = ., user = 'pbiecek', repo = 'graphGallery',
                   repoTo = 'Gallery')
showLocalRepo(repoDir = 'Gallery')[, -3]

                           md5hash                                    name
1 caab8f8e72045c93f92091aebcb74f2d                           iris.lm.model
2 b0f870835c546853b70f921f57e100b4        b0f870835c546853b70f921f57e100b4
3 7a761a2ae54f3d90060a9f6ca04b3506        7a761a2ae54f3d90060a9f6ca04b3506
4 067896394acf47242b30b07ac300dc48                      iris.lm.model.smry
5 dfb4355fdd4baa980368a13c4bf8ef3f        dfb4355fdd4baa980368a13c4bf8ef3f
6 b6f183dfd0efdc8c33fa89b9038716c5                                      pl
7 b6f183dfd0efdc8c33fa89b9038716c5                                      pl
8 fcd70d55b874201d2bece12f591a2ec4                                      pl
9 4cc15e46b6008f5867a92364fe36e835 qplot(time, arr_delay, data = per_hour)

3 Advanced example with RTCGA

Let’s have a look at more advanced example. Suppose we would like to create a similar graph to this one but with slightly different input data.

aread('MarcinKosinski/coxphSGD/db03267b063709277e50bd4c0c1ddb04') -> survival.egfr.plot
class(survival.egfr.plot)

[1] "tableAndPlot" "list"

# this object is a result of survMisc:::autoplot.survfit function which mainly
# produces a list containing 2 ggplot objects

survival.egfr.plot$plot <- survival.egfr.plot$plot +
    ggtitle('Survival vs mutation in EGFR gene')  # the previous title was in polish
survMisc::autoplot(survival.egfr.plot) # it will plot survival plot and risk set table

The plot presents the Kaplan-Meier estimates of the survival curves for patients suffering from cancer, divided into 2 groups: one with a mutation in EGFR gene (EGFR=1) and the other without it (EGFR=0). The mutation can be a deletion, amplification etc..

3.1 About RTCGA

RTCGA logo The data used to produce previous plot came from The Cancer Genome Atlas Study.

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

The download of data through R is possible with the RTCGA package but most of useful datasets are already converted and available in the RTCGA family of R data packages.

3.2 Partial results archiving and objects’ pedigree restoration

The main plot of this use case was made from the data of patients suffering from one of all available cancer types in TCGA study (there are 38 available cohorts types to download). Let us prepare such plot only for patients suffering from Breast invasive carcinoma (Breast Cancer - BRCA), divided into groups related to the existence of the mutation in a EGFR gene and the expression of a gene TP53 (over and below the median in BRCA cohort).

To do so one would need to perform a few data operations such as observations filtering, column tranformations and tables merging. The very convenient set of tools to munge data are dplyr and forward-pipe operator from magrittr package. We have borrowed the %>% forward-pipe operator from magrittr version 1.0.1 and created our own archivist-forward-pipe operator `?%a which not only passes the results of the previous function to the next one, but also archives the inputs off all operations so that one can create a hooks to every partial result of forward-pipe operation/analysis.

I am a really keen on long forward-pipe operations so I’ll try my best to create a good one, which will join information about clinical state of a patient with the information about the existence of an PIK3CA gene mutations (which is the most common for BRCA) and the information about expression of a gene TP53 (which is believed to be a guardian of the genome).

library(RTCGA.rnaseq); data(BRCA.rnaseq) # information about genes' expressions
library(RTCGA.mutations); data(BRCA.mutations) # information about genes' mutations
library(RTCGA.clinical); data(BRCA.clinical) # patients' clinical data

aoptions('silent', TRUE) # This sets `silent=TRUE` in saveToRepo which is used by %a% . There will be no warning printed about archiving the same artifact or it's data twice.

[1] TRUE

BRCA.rnaseq %a%
    select(`TP53|7157`, bcr_patient_barcode) %a%
    # bcr_patient_barcode contains a key to merge patients between various datasets
    rename(TP53 = `TP53|7157`) %a%
    filter(substr(bcr_patient_barcode, 14, 15) == "01" ) %a% 
    # 01 at the 14-15th position tells these are cancer sample
    mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode),1,12)) -> 
    # in clinical info bcr_patient_barcode is only of length 12
    BRCA.rnaseq.TP53

BRCA.mutations %a%
    select(Hugo_Symbol, bcr_patient_barcode) %a%
    # Hugo_symbol tells to which gene the row corresponds.
    # Ff the rows exist for a gene, this means there was a mutation for this patient for this gene.
    filter(nchar(bcr_patient_barcode)==15) %a%
    # sometime there are inproper lengths of this code
    filter(substr(bcr_patient_barcode, 14, 15)=="01") %a%   
    # 01 at the 14-15th position tells these are cancer sample
    filter(Hugo_Symbol == 'PIK3CA') %a%
    # we are interested only in the mutations of PIK3CA
    unique() %a% 
  # sometimes there are few mutations in the same gene
    mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode),1,12)) -> 
    # in clinical info bcr_patient_barcode is only of length 12
    BRCA.mutations.PIK3CA

BRCA.clinical %a%
    select(patient.bcr_patient_barcode,
           patient.vital_status, # information whether patient is still alive
           patient.days_to_last_followup, # how many days has patient been observed if he is alive
           patient.days_to_death) %a% # how many days has patient been observed if he has passed away
    mutate(bcr_patient_barcode = toupper(as.character(patient.bcr_patient_barcode))) %a%
    # in clinical datasets the key column is in lower case and with different name
    mutate(status = ifelse(as.character(patient.vital_status) == "dead",1,0),
           times = ifelse( 
                        !is.na(patient.days_to_last_followup),
                        as.numeric(as.character(patient.days_to_last_followup)),
                        as.numeric(as.character(patient.days_to_death))
            )) %a%
    # if the patient does not have a days_to_last_followup time this means
    # he has days_to_death time
    filter(!is.na(times)) %a% 
    # sometime patient does not have any time
    filter(times > 0) -> BRCA.clinical.survival
    # sometimes by mistkae patients have non-positive times (few cases)

BRCA.rnaseq.TP53 %a%
    left_join(y = BRCA.mutations.PIK3CA,
                        by = "bcr_patient_barcode") %a%
    left_join(y = BRCA.clinical.survival,
                        by = "bcr_patient_barcode") %a%
    mutate(TP53_HighExpr = ifelse(TP53 >= median(TP53), "1", "0")) %a%
    mutate(PIK3CA_Mut = as.integer(!is.na(Hugo_Symbol))) %a%
    select(times, status, TP53_HighExpr, PIK3CA_Mut)  -> BRCA.2survfit

And then one can print artifact’s history/pedigree with the ahistory() function that works in 3 variants

ahistory(BRCA.rnaseq.TP53) # regular format

   env[[nm]]                                                                           [63678e012c5b7f40966c32eec91f828b]
-> select(`TP53|7157`, bcr_patient_barcode)                                            [4a85ce61229dd743b911d7edab0310b3]
-> rename(TP53 = `TP53|7157`)                                                          [103f2b82c41956e9f6437b3a0cd68679]
-> filter(substr(bcr_patient_barcode, 14, 15) == "01")                                 [1da5a026aae19e0d0467ba3773679e28]
-> mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode),     1, 12))  [2001f888c6262e30154688876c91cc50]

# additional chunk options: results='asis'
ahistory(BRCA.mutations.PIK3CA, format = "kable") # uses knitr::kable()

	call	md5hash
7	env[[nm]]	b2c7c1b633de515d59c9af635805ed29
6	select(Hugo_Symbol, bcr_patient_barcode)	9e4936fa0de8ca0ebc50fa310e845473
5	filter(nchar(bcr_patient_barcode) == 15)	fc93fb6885a6387d1495ce0a6456d4f6
4	filter(substr(bcr_patient_barcode, 14, 15) == “01”)	2cee0b51527271264f70aa91ee4f33e5
3	filter(Hugo_Symbol == “PIK3CA”)	e9a920ae11aa23e0b842f5e64c446a9d
2	unique()	194214bafe8287c73f44bd660b74e199
1	mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode), 1, 12))	c280e394539be64c73ea022b9ea9fa05

ahistory(BRCA.2survfit, format = "kable", alink = TRUE ) # give hooks to objects

	call	md5hash
10	env[[nm]]	63678e012c5b7f40966c32eec91f828b
9	select(`TP53\|7157`, bcr_patient_barcode)	4a85ce61229dd743b911d7edab0310b3
8	rename(TP53 = `TP53\|7157`)	103f2b82c41956e9f6437b3a0cd68679
7	filter(substr(bcr_patient_barcode, 14, 15) == “01”)	1da5a026aae19e0d0467ba3773679e28
6	mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode), 1, 12))	2001f888c6262e30154688876c91cc50
5	left_join(y = BRCA.mutations.PIK3CA, by = “bcr_patient_barcode”)	8a615a0954e03d11a0ef2d705c59766e
4	left_join(y = BRCA.clinical.survival, by = “bcr_patient_barcode”)	a8b097ee7010673b2078e8c4d47079df
3	mutate(TP53_HighExpr = ifelse(TP53 >= median(TP53), “1”, “0”))	73eef46c8dfe31a99beb47ee6edfb4f3
2	mutate(PIK3CA_Mut = as.integer(!is.na(Hugo_Symbol)))	7a4625c0521157f81a3db11ddd196779
1	select(times, status, TP53_HighExpr, PIK3CA_Mut)	684d08dfe5d3a6dff389ad210f69aa36

# Note that they are not yet available on GitHub, 
# because the archiving was only to Local Repository.

3.2.1 Overloading `print()` function

The final data would be used to plot the Kaplan-Meier estimates of the survival curves with the use of my.customized.km() function.

tail(BRCA.2survfit)

     times status TP53_HighExpr PIK3CA_Mut
1088  1550      0             1          0
1089   791      0             1          0
1090   292      0             1          0
1091   278      0             0          0
1092  3042      0             1          0
1093  2800      0             0          0

BRCA.2survfit %>%
    select(TP53_HighExpr, PIK3CA_Mut, status) %>%
    table()

, , status = 0

             PIK3CA_Mut
TP53_HighExpr   0   1
            0 359 118
            1 298 161

, , status = 1

             PIK3CA_Mut
TP53_HighExpr   0   1
            0  35   9
            1  47  13

my.customized.km() function is based on survMisc package, which was moved to CRAN archive suddenly. It’s implementation/body is too long to write it down here (and unnecessary) but can be imported to R with

# Instead of whole md5hash an abbreviation can be specified.
# If more that one md5hash matches this abbrevation, then every corresponding artifact is loaded.
aread('MarcinKosinski/Museum/30efaaedb') -> my.customized.km
# or loadFromGithubRepo(md5hash = '30efaaedb', user = 'MarcinKosinski', repo = 'Museum')

my.customized.km('times', 'status', c('TP53_HighExpr', 'PIK3CA_Mut'), BRCA.2survfit,
                                 'Survival vs TP53 expression and mutations in PIK3CA') -> km_plot
class(km_plot)

[1] "tableAndPlot" "list"

One can even overload print function for a specific class to first perform archive operation, then to extract hook in the way that is compatible with the format of report and lastly to print the object. Original idea came from here and here

print.tableAndPlot  <- function(x, ...) {
    saveToRepo(x) -> hash
    cat(alink(hash)) # uses globally set `user` and `repo`
    survMisc:::autoplot.tableAndPlot(x, ...) 
}
# results = 'asis'
archive(print.tableAndPlot, commitMessage = 'print.tableAndPlot function', alink = TRUE)

archivist::aread('MarcinKosinski/Gallery/fcd394ab60e8c14545028ab8f68eb9bc')

# results = 'asis'
print(km_plot)

archivist::aread('MarcinKosinski/Gallery/41eb3f66d56dd9df86554c9ee6022b43')

Easier version of this functionality is described in this Use Case about addHookToPrint.

3.2.2 Pushing Local `Repository` to GitHub

One might have noticed that %a% operator and saveToLocalRepo() function saved objects only to the Local Repository that is synchronized with GitHub. This is different to the archive function which archives to the Local and GitHub Repository simultaneously. It is possible to push (Git command) artifacts that are only present in the Local Repository to GitHub equivalent with pushGitHubRepo function

# number of artifacts before push
searchInRemoteRepo(pattern = "name", fixed = FALSE) %>% length

[1] 2

# one can check how many commits have been performed so far
length(jsonlite::fromJSON(rawToChar(GET('https://api.github.com/repos/MarcinKosinski/Gallery/commits')$content))$sha)

[1] 4

pushGitHubRepo() # uses globally set parametrs when none are provided
# number of artifacts after push
Sys.sleep(300) # to be sure my request isn't faster than GitHub platform after push
searchInRemoteRepo(pattern = "name", fixed = FALSE) %>% length

[1] 30

# one can check how many commits have been performed so far
length(jsonlite::fromJSON(rawToChar(GET('https://api.github.com/repos/MarcinKosinski/Gallery/commits')$content))$sha)

[1] 5

This operation might be troublesome when other collaborator has pushed his changes to the remote GitHub Repository. Sometimes it’s better to first pull (Git command) changes (new artifacts) from the synchronized Github Repository. Working with too many collaborators may occure in Git conflicts in the backpack.db file. If you have any ideas or suggestions how can this be handle please write in this issue. pullGitHubRepo() and pushGitHubRepo() both have additional parameter ... enabling passing more sophisticated options to git2r::pull and git2r::push when more complex conflicts occure.

4 `Repository` and `gallery` summaries

It is also possible to create a summary of gallery folder, which in other words mean a summary of each artifact stored in Repository. Special createMDGallery() function creates an .md file with hooks to artifacts, their list of Tags and if possible their miniature in a form of a .png file. By now only plots like lattice and ggplot object’s are saved also with their .png miniature and only for those objects mianiture can be added to gallery summary.

In the below example we extract gallery summary to the README.md file which will be appended with additional lines of the summary.

createMDGallery(output = 'Gallery/README.md', addTags = TRUE, addMiniature = TRUE)

The output of this function can be seen in the README of Gallery Repository (https://github.com/MarcinKosinski/Gallery) in which we are working in this tutorial, after it’ll be pushed to GitHub with

pushGitHubRepo(files = 'README.md')

The results is here and few interesting gallery summaries are here and here.

The archivist package also provides Repository summary with summaryGithubRepo function. This is rather old but still relevant.

summaryRemoteRepo()

Number of archived artifacts in Repository:  30 
Number of archived datasets in Repository:  1 
Number of various classes archived in Repository: 
              Number
lm                1
summary.lm        1
ggplot            3
data.frame       23
function          1
tableAndPlot      1
list              1
Saves per day in Repository: 
            Saves
2016-02-24    32
2014-08-21     2
2014-09-03     1

5 Repository deletion

One can easily delete existing Local and GitHub Repository. For GitHub one can delete only archivist-like Repository (gallery folder and backpack.db file) by default or the whole GitHub-repository with deleteRoot = TRUE. (Yes, I have used those many times while writing this Use Case :)).

# eval = FALSE
deleteLocalRepo(repoDir = 'Gallery', deleteRoot = TRUE)
deleteGitHubRepo('Gallery', deleteRoot = TRUE)

6 Feedback and Notes

By now archivist extract extra Tags only for such object’s classes

# http://stackoverflow.com/a/11005886/3857701
methodsTable <- ls(.__S3MethodsTable__.,
                   envir = asNamespace("archivist"),
                   all.names = TRUE)
grep('extractTags', methodsTable, value = TRUE) %>%
    data.frame(extracTags_methods =.) %>%
    pander::pandoc.table()

extracTags_methods
extractTags
extractTags.data.frame
extractTags.default
extractTags.ggplot
extractTags.glmnet
extractTags.htest
extractTags.lda
extractTags.lm
extractTags.partition
extractTags.qda
extractTags.summary.lm
extractTags.survfit
extractTags.trellis
extractTags.twins

If you would like to create wrappers for other classes please open an issue or even for other suggestions, comments or user requests like integration with bitbucket, support for other languages, support for json/csv files, summary plots of the Repository.

6.1 Session Info

devtools::session_info()

 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, linux-gnu           
 ui       X11                         
 language English                     
 collate  pl_PL.UTF-8                 
 tz       <NA>                        
 date     2016-02-24                  

 package          * version      date       source                                
 acepack            1.3-3.3      2014-11-24 CRAN (R 3.2.0)                        
 archivist        * 2.0.1        2016-02-21 CRAN (R 3.2.2)                        
 archivist.github * 0.1          2016-02-24 CRAN (R 3.2.2)                        
 assertthat         0.1          2013-12-06 CRAN (R 3.2.0)                        
 bitops             1.0-6        2013-08-17 CRAN (R 3.2.0)                        
 chron              2.3-47       2015-06-24 CRAN (R 3.2.1)                        
 cluster            2.0.3        2015-07-21 CRAN (R 3.2.1)                        
 codetools          0.2-14       2015-07-15 CRAN (R 3.2.1)                        
 colorspace         1.2-6        2015-03-11 CRAN (R 3.2.2)                        
 combinat           0.0-8        2012-10-29 CRAN (R 3.2.0)                        
 curl               0.9.6        2016-02-17 CRAN (R 3.2.2)                        
 data.table         1.9.7        2015-12-29 Github (Rdatatable/data.table@405f115)
 DBI                0.3.1        2014-09-24 CRAN (R 3.2.2)                        
 devtools           1.10.0       2016-01-23 CRAN (R 3.2.2)                        
 digest             0.6.9        2016-01-08 CRAN (R 3.2.2)                        
 dplyr            * 0.4.3        2015-09-01 CRAN (R 3.2.2)                        
 evaluate           0.8          2015-09-18 CRAN (R 3.2.1)                        
 foreach            1.4.3        2015-10-13 CRAN (R 3.2.2)                        
 foreign            0.8-66       2015-08-19 CRAN (R 3.2.1)                        
 formatR            1.2.1        2015-09-18 CRAN (R 3.2.1)                        
 Formula            1.2-1        2015-04-07 CRAN (R 3.2.0)                        
 gam                1.12         2015-05-13 CRAN (R 3.2.0)                        
 ggbiplot           0.55         2015-09-23 Github (vqv/ggbiplot@7325e88)         
 ggplot2          * 2.0.0        2015-12-18 CRAN (R 3.2.2)                        
 ggthemes           3.0.1        2016-01-10 CRAN (R 3.2.2)                        
 git2r              0.13.1       2015-12-10 CRAN (R 3.2.2)                        
 gridExtra          2.0.0        2015-07-14 CRAN (R 3.2.1)                        
 gtable             0.1.2        2012-12-05 CRAN (R 3.2.0)                        
 highr              0.5.1        2015-09-18 CRAN (R 3.2.1)                        
 Hmisc              3.17-1       2015-12-18 CRAN (R 3.2.2)                        
 htmltools          0.3          2015-12-29 CRAN (R 3.2.2)                        
 httr             * 1.1.0        2016-01-28 CRAN (R 3.2.2)                        
 iterators          1.0.8        2015-10-13 CRAN (R 3.2.1)                        
 jsonlite           0.9.19       2015-11-28 CRAN (R 3.2.2)                        
 km.ci              0.5-2        2009-08-30 CRAN (R 3.2.0)                        
 KMsurv             0.1-5        2012-12-03 CRAN (R 3.2.0)                        
 knitr            * 1.12.3       2016-01-22 CRAN (R 3.2.2)                        
 labeling           0.3          2014-08-23 CRAN (R 3.2.0)                        
 lattice            0.20-33      2015-07-14 CRAN (R 3.2.1)                        
 latticeExtra       0.6-26       2013-08-15 CRAN (R 3.2.0)                        
 lazyeval           0.1.10       2015-01-02 CRAN (R 3.2.2)                        
 lubridate          1.5.0        2015-12-03 CRAN (R 3.2.2)                        
 magrittr         * 1.5          2014-11-22 CRAN (R 3.2.0)                        
 memoise            1.0.0        2016-01-29 CRAN (R 3.2.2)                        
 munsell            0.4.3        2016-02-13 CRAN (R 3.2.2)                        
 nnet               7.3-11       2015-08-30 CRAN (R 3.2.1)                        
 pander             0.6.0        2015-11-23 CRAN (R 3.2.2)                        
 plyr               1.8.3        2015-06-12 CRAN (R 3.2.1)                        
 purrr              0.2.1        2016-02-13 CRAN (R 3.2.2)                        
 R6                 2.1.2        2016-01-26 CRAN (R 3.2.2)                        
 RColorBrewer       1.1-2        2014-12-07 CRAN (R 3.2.0)                        
 Rcpp               0.12.3       2016-01-10 CRAN (R 3.2.2)                        
 RCurl              1.95-4.7     2015-06-30 CRAN (R 3.2.1)                        
 rmarkdown          0.9.5        2016-02-22 CRAN (R 3.2.2)                        
 rpart              4.1-10       2015-06-29 CRAN (R 3.2.1)                        
 RSQLite            1.0.0        2014-10-25 CRAN (R 3.2.1)                        
 rstudioapi         0.5          2016-01-24 CRAN (R 3.2.2)                        
 RTCGA            * 1.1.14       2016-02-24 Github (RTCGA/RTCGA@7d6d667)          
 RTCGA.clinical   * 20151101.1.0 2016-02-22 Github (RTCGA/RTCGA.clinical@0239210) 
 RTCGA.mutations  * 20151101.0.0 2016-02-22 Github (RTCGA/RTCGA.mutations@3c3b83b)
 RTCGA.rnaseq     * 20151101.0.0 2016-02-22 Github (RTCGA/RTCGA.rnaseq@196d7d2)   
 rvest              0.3.1        2015-11-11 CRAN (R 3.2.2)                        
 scales             0.3.0        2015-08-25 CRAN (R 3.2.1)                        
 stringi            1.0-1        2015-10-22 CRAN (R 3.2.1)                        
 stringr            1.0.0        2015-04-30 CRAN (R 3.2.0)                        
 survival         * 2.38-3       2015-07-02 CRAN (R 3.2.1)                        
 survminer          0.2.0.9001   2016-02-24 local                                 
 survMisc         * 0.4.6        2015-04-15 CRAN (R 3.2.0)                        
 XML                3.98-1.3     2015-06-30 CRAN (R 3.2.1)                        
 xml2               0.1.2        2015-09-01 CRAN (R 3.2.1)                        
 yaml               2.1.13       2014-06-12 CRAN (R 3.2.0)                        
 zoo                1.7-12       2015-03-16 CRAN (R 3.2.0)

archivist and GitHub integration: archivist.github

Use case with RTCGA data

Marcin Kosiński

2016-31-01

1 Introduction

2 Working with GitHub repository

2.1 `OAuth` open autorization

2.2 Create or clone your repository

2.3 Archiving and exploring example

3 Advanced example with RTCGA

3.1 About RTCGA

3.2 Partial results archiving and objects’ pedigree restoration

3.2.1 Overloading `print()` function

3.2.2 Pushing Local `Repository` to GitHub

4 `Repository` and `gallery` summaries

5 Repository deletion

6 Feedback and Notes

6.1 Session Info

archivist and GitHub integration: archivist.github

Use case with RTCGA data

Marcin Kosiński

2016-31-01

1 Introduction

2 Working with GitHub repository

2.1 OAuth open autorization

2.2 Create or clone your repository

2.3 Archiving and exploring example

3 Advanced example with RTCGA

3.1 About RTCGA

3.2 Partial results archiving and objects’ pedigree restoration

3.2.1 Overloading print() function

3.2.2 Pushing Local Repository to GitHub

4 Repository and gallery summaries

5 Repository deletion

6 Feedback and Notes

6.1 Session Info

2.1 `OAuth` open autorization

3.2.1 Overloading `print()` function

3.2.2 Pushing Local `Repository` to GitHub

4 `Repository` and `gallery` summaries