Have you ever suffered because of the impossibility of reproducing graphs, tables or analysis results in R? Have you ever bothered yourself for not being able to share R objects (i.e. plots or final analysis models) within your reports, posters or articles? Or maybe simply you have too many objects you can’t manage to store in a convenient and handy way? Now you can share partial results of analysis, provide hooks to valuable R objects within articles, manage analysis results and restore objects’ pedigree with archivist
package and it’s extension archivist.github
. All automatically through GitHub without closing RStudio. If you are tired of archiving results by yourself, then read this tutorial.
Open science needs not only reproducible research but also accessible final and partial results.
library(archivist.github)
# described functionalities are implemented in the 2.0 version or achivist and 0.1 of archivist.github
# devtools::install_github('MarcinKosinski/archivist.github')
# install.packages('archivist.github')
The archivist
is an R package for data analysis results management, which helps in managing, sharing, storing, linking and searching for R objects. The archivist
package automatically retrieves the object’s meta-data and creates a rich structure that allows for easy management of calculated R objects. It also extends the reproducible research paradigm by creating new ways to retrieve and validate previously calculated objects.
This use case describes how archivist
can be integrated with GitHub so that one can share (partial) results of his analysis in a more automatic and simpler way. GitHub is a platform on which collaborators can share their analysis code, figures and reports. It is also possible to share R
objects which are crucial to the analysis or which calculations took a great amount of time or required a special software. Such objects are referred as artifacts. One might be only interested in a partial or final result without executing the whole analysis, which sometimes might be impossible due to lack of software, changes in R
packages or simply takes to much time. If you are not using any version control system (like Git, hence GitHub), then get motivated here and here.
The archivist
is a tool that makes sharing R
objects more convenient and transparent. Such objects can even be added to StackOverfow questions to improve providing reproducible examples (like here).
GitHub API OAuth
open autorization and archivist.github
functions that are integrated with GitHub are described in an R
documentation page accesed with ?agithub
. Information below provides broader explanation. New functions from archivist.github
extension can be seen in below workflow
OAuth
open autorizationTo start sharing code and analysis results on GitHub a data scientist needs to create a repository on GitHub first. It can be done manually under this link https://github.com/new or one can use the createGitHubRepo()
function to do this automatically. The createGitHubRepo()
function integrates with the GitHub API which enables performing operation on GitHub with a simple curl
("see URL"
) - requests (if a data scientist comes with an IT background it’s easy, but when he’s background is different it is not so obvious). If you haven’t worked earlier with API and are wondering why they are so important and broadly used, then get keen on here.
Working with GitHub API requires creating a simple developer application (you can create it under this link https://github.com/settings/applications/new) which will be used to authenticate your curl
requests (created via httr
package). It can be done once to benefit from it in future work. When application is created, one will have to copy its Client ID
and Client Secret
to authorize his computer with this application by running
library(httr)
myapp <- oauth_app("github",
key = app_key,
secret = app_secret)
github_token <- oauth2.0_token(oauth_endpoints("github"),
myapp,
scope = c("public_repo",
"delete_repo"))
# or use wrapper for
authoriseGitHub(ClientID, ClientSecret)
The above command created a github_token
that uses OAuth
open autorization system.
OAuth allows notifying a resource provider (e.g. Facebook) that the resource owner (e.g. you) grants permission to a third-party (e.g. a Facebook Application) access to their information (e.g. the list of your friends).
More about OAuth
is explained in that StackOverflow answer.
The scope
parameter in the oauth2.0_token
function lets you specify exactly what type of access you need. Scopes limit access for OAuth
tokens. They do not grant any additional permission beyond that the user already has. In this example we granted read/write access to code, commit statuses, collaborators, and deployment statuses for public repositories and organizations, which are required for starring public repositories (public_repo
). We also granted access to delete adminable repositories (delete_repo
). More possible values of scope
for GitHub API OAuth
token can be found in this table.
When github_token
is created one can set it up as a global github_token
visible for most archivist
functions with
aoptions("github_token", github_token)
<Token>
<oauth_endpoint>
authorize: https://github.com/login/oauth/authorize
access: https://github.com/login/oauth/access_token
<oauth_app> github
key: 1fab1e77d27079c0717d
secret: <hidden>
<credentials> access_token, scope, token_type
---
aoptions("user", user) # user = 'MarcinKosinski'
[1] "MarcinKosinski"
invisible(aoptions("password", password))
What is more, one can specify GitHub’s user.name
and user.password
globally, so that future integration with Git/GitHub (i.e. commits or pushes) can be performed (with the great power of git2r
package).
One can create GitHub’s repository consisting of the archivist
-like Repository
of artifacts with
createGitHubRepo(repo = "Gallery", default = TRUE)
[1] "MarcinKosinski"
# -> https://github.com/MarcinKosinski/Gallery
Repository
is a folder with an SQLite database stored in a file named backpack.db
and a subdirectory named gallery
with collection of objects saved as .rda
files. To learn more about it visit archivist's
Wiki or run in R
console ?archivist::Repository
.
The default = TRUE
option sets Repository
under link https://github.com/user.name/repo
as a visible default Repository
for future archivist.github
functions. In this way one won’t need to pass additional parameters to every functions’ calls.
If one already has a GitHub repository with archivist
Repository it can be cloned and set as default.
# eval = FALSE - this line is not evaluated in this tutorial
cloneGitHubRepo(repoURL = 'https://github.com/MarcinKosinski/Museum/',
repoDir = any_local_path_or_NULL,
default = TRUE)
When the option default = TRUE
is used in the above functions, then the global parameters repo
and user
start being visible for the achivist.github
functions. One can check if the parameter is set globally for the archivist's
functions
aoptions('repo')
[1] "Gallery"
aoptions('user')
[1] "MarcinKosinski"
The globally set Repository
stored on GitHub is synchronized with a Local Repository
that can be set (or is already set when default = TRUE
) in repoDir
parameter
aoptions('repoDir')
[1] "Gallery"
If a user did not specify the repoDir
argument, during the Repostiory
creation with createGitHubRepo()
, it is by default set to the same value as repo
parameter.
Now, when GitHub and archivist
Repositories
are created, one can archive and share artifacts (crucial objects) with archive
function which also allows to emb a hook to the artifact in the report. We are creating .html
report in rmarkdown
, so the default markdown formating for hook is used. For \(\LaTeX\)-like hooks use `format = “latex”.
Let us prepare a linear model with iris
data
iris.lm.model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data= iris)
and let’s archive the desired model automatically to the created Repository
(https://github.com/MarcinKosinski/Gallery).
# results = 'asis' - so that a text can be understood as a URL
archive(iris.lm.model, alink = TRUE)
archivist::aread('MarcinKosinski/Gallery/caab8f8e72045c93f92091aebcb74f2d')
The archive
function created a hook to the artifact. One can download this shared artifact by clicking the link or by copying and pasting it’s value to R console. It is a great way to provide a hook to the figures in posters or publications that are really an R
objects (such as ggplot
class). The object named iris.lm.model
was archived to the GitHub (and synchronized Local) Repository
. This is the final implementation of the archive
prototype we presented at the BI FORUM conference in Budapest in Oct 2015.
One can check that the artifact is really on GitHub with
showRemoteRepo(repo = 'Gallery', user = 'MarcinKosinski')
md5hash name createdDate
1 caab8f8e72045c93f92091aebcb74f2d iris.lm.model 2016-02-24 19:12:15
2 b0f870835c546853b70f921f57e100b4 b0f870835c546853b70f921f57e100b4 2016-02-24 19:12:15
3 7a761a2ae54f3d90060a9f6ca04b3506 7a761a2ae54f3d90060a9f6ca04b3506 2016-02-24 19:12:15
# showRemoteRepo() would also work since `user` and `repo` are set gobally
The first row corresponds to the archived artifact, the second one corresponds to the archived data extracted from an artifact. The md5hash
column specifies which MD5 hash (see ?digest::digest
) the artifacts has been connected with. Artifacts are archived along with a special attribute named md5hash
. For each artifact, md5hash
is a unique string of length 32 that is produced by digest{digest}
function which uses a cryptographical MD5 hash algorithm. The md5hash
of each artifact archived to the Repository
is also saved on the Repository
along with the artifact’s Tags
(see Tags
on archivist
WIKI). It enables to distinguish objects in the Repository
and facilitates searching and loading them.
Each artifact can be archived with its unique Tags
which are attributes of an artifact. They can be the artifact’s name, class or archiving date. Furthermore, for various artifact’s classes different Tags
are available. Let us archive one more artifact to explain how various Tags
can be extracted during archiving and let us even specify our own userTags
.
# results = 'asis' - so that a text can be understood as a URL
iris.lm.model.smry <- summary(iris.lm.model)
archive(iris.lm.model.smry, alink = TRUE,
userTags = paste0("summaryOf:", digest::digest(iris.lm.model)))
archivist::aread('MarcinKosinski/Gallery/067896394acf47242b30b07ac300dc48')
One can check what Tags
have been extracted so far
Sys.sleep(300)
# After a commit sometimes GitHub does not react so fast
# so we need to give it a time.
# Immiediate access depends on github.com performance
showRemoteRepo(method = "tags")[, -3]
artifact tag
1 caab8f8e72045c93f92091aebcb74f2d format:rda
2 caab8f8e72045c93f92091aebcb74f2d name:iris.lm.model
3 caab8f8e72045c93f92091aebcb74f2d class:lm
4 caab8f8e72045c93f92091aebcb74f2d coefname:(Intercept)
5 caab8f8e72045c93f92091aebcb74f2d coefname:Sepal.Width
6 caab8f8e72045c93f92091aebcb74f2d coefname:Petal.Length
7 caab8f8e72045c93f92091aebcb74f2d coefname:Petal.Width
8 caab8f8e72045c93f92091aebcb74f2d rank:4
9 caab8f8e72045c93f92091aebcb74f2d df.residual:146
10 caab8f8e72045c93f92091aebcb74f2d date:2016-02-24 19:12:15
11 b0f870835c546853b70f921f57e100b4 format:rda
12 caab8f8e72045c93f92091aebcb74f2d session_info:b0f870835c546853b70f921f57e100b4
13 7a761a2ae54f3d90060a9f6ca04b3506 format:rda
14 7a761a2ae54f3d90060a9f6ca04b3506 format:txt
15 7a761a2ae54f3d90060a9f6ca04b3506 relationWith:caab8f8e72045c93f92091aebcb74f2d
16 caab8f8e72045c93f92091aebcb74f2d format:txt
17 067896394acf47242b30b07ac300dc48 format:rda
18 067896394acf47242b30b07ac300dc48 name:iris.lm.model.smry
19 067896394acf47242b30b07ac300dc48 class:summary.lm
20 067896394acf47242b30b07ac300dc48 sigma:0.3145
21 067896394acf47242b30b07ac300dc48 df:4
22 067896394acf47242b30b07ac300dc48 df:146
23 067896394acf47242b30b07ac300dc48 df:4
24 067896394acf47242b30b07ac300dc48 R^2:0.8586
25 067896394acf47242b30b07ac300dc48 adjusted R^2:0.8557
26 067896394acf47242b30b07ac300dc48 fstatistic:295.5
27 067896394acf47242b30b07ac300dc48 fstatistic.df:3
28 067896394acf47242b30b07ac300dc48 fstatistic.df:146
29 067896394acf47242b30b07ac300dc48 date:2016-02-24 19:12:48
30 067896394acf47242b30b07ac300dc48 summaryOf:caab8f8e72045c93f92091aebcb74f2d
31 dfb4355fdd4baa980368a13c4bf8ef3f format:rda
32 067896394acf47242b30b07ac300dc48 session_info:dfb4355fdd4baa980368a13c4bf8ef3f
or more convenient form for dplyr
grouping and aggregation operations (thanks to @eliotmcintire
for the suggestion and @wchodor
for the implementation - issue).
splitTagsRemote(repo = 'Gallery', user = 'MarcinKosinski')[,-4]
artifact tagKey tagValue
1 caab8f8e72045c93f92091aebcb74f2d format rda
2 caab8f8e72045c93f92091aebcb74f2d name iris.lm.model
3 caab8f8e72045c93f92091aebcb74f2d class lm
4 caab8f8e72045c93f92091aebcb74f2d coefname (Intercept)
5 caab8f8e72045c93f92091aebcb74f2d coefname Sepal.Width
6 caab8f8e72045c93f92091aebcb74f2d coefname Petal.Length
7 caab8f8e72045c93f92091aebcb74f2d coefname Petal.Width
8 caab8f8e72045c93f92091aebcb74f2d rank 4
9 caab8f8e72045c93f92091aebcb74f2d df.residual 146
10 caab8f8e72045c93f92091aebcb74f2d date 2016-02-24 19:12:15
11 b0f870835c546853b70f921f57e100b4 format rda
12 caab8f8e72045c93f92091aebcb74f2d session_info b0f870835c546853b70f921f57e100b4
13 7a761a2ae54f3d90060a9f6ca04b3506 format rda
14 7a761a2ae54f3d90060a9f6ca04b3506 format txt
15 7a761a2ae54f3d90060a9f6ca04b3506 relationWith caab8f8e72045c93f92091aebcb74f2d
16 caab8f8e72045c93f92091aebcb74f2d format txt
17 067896394acf47242b30b07ac300dc48 format rda
18 067896394acf47242b30b07ac300dc48 name iris.lm.model.smry
19 067896394acf47242b30b07ac300dc48 class summary.lm
20 067896394acf47242b30b07ac300dc48 sigma 0.3145
21 067896394acf47242b30b07ac300dc48 df 4
22 067896394acf47242b30b07ac300dc48 df 146
23 067896394acf47242b30b07ac300dc48 df 4
24 067896394acf47242b30b07ac300dc48 R^2 0.8586
25 067896394acf47242b30b07ac300dc48 adjusted R^2 0.8557
26 067896394acf47242b30b07ac300dc48 fstatistic 295.5
27 067896394acf47242b30b07ac300dc48 fstatistic.df 3
28 067896394acf47242b30b07ac300dc48 fstatistic.df 146
29 067896394acf47242b30b07ac300dc48 date 2016-02-24 19:12:48
30 067896394acf47242b30b07ac300dc48 summaryOf caab8f8e72045c93f92091aebcb74f2d
31 dfb4355fdd4baa980368a13c4bf8ef3f format rda
32 067896394acf47242b30b07ac300dc48 session_info dfb4355fdd4baa980368a13c4bf8ef3f
library(dplyr)
splitTagsRemote(repo = 'graphGallery', user = 'pbiecek') %>%
group_by(tagKey) %>%
summarise(count = n()) %>%
arrange(desc(count))
Source: local data frame [13 x 2]
tagKey count
(chr) (int)
1 varname 1229
2 class 508
3 date 500
4 coefname 493
5 name 299
6 LHS 233
7 RHS 233
8 relationWith 35
9 format 15
10 labelx 7
11 labely 7
12 session_info 4
13 data 3
library(ggplot2)
splitTagsRemote(repo = 'graphGallery', user = 'pbiecek') %>%
group_by(tagKey) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ggplot(aes(reorder(tagKey, count, max), count)) +
geom_bar(stat = "identity") +
theme_minimal() + xlab('Number of entries with this Tag') + ylab('Tags') +
ggtitle('Barplot of counts of Tags\' types in pbiecek/graphGallery Repository')
Extracting Tags
for a specific artifact can be done with
getTagsRemote(md5hash = digest::digest(iris.lm.model.smry),
tag = "", user = 'MarcinKosinski', repo = 'Gallery' ) %>% data.frame()
.
1 format:rda
2 name:iris.lm.model.smry
3 class:summary.lm
4 sigma:0.3145
5 df:4
6 df:146
7 R^2:0.8586
8 adjusted R^2:0.8557
9 fstatistic:295.5
10 fstatistic.df:3
11 fstatistic.df:146
12 date:2016-02-24 19:12:48
13 summaryOf:caab8f8e72045c93f92091aebcb74f2d
14 session_info:dfb4355fdd4baa980368a13c4bf8ef3f
After the Repository
is created and the crucial objects are achived anyone can explore public archivist
Repository
shared on GitHub. Knowing the object’s md5hash
one can download it with
loadFromRemoteRepo(md5hash = digest::digest(iris.lm.model.smry),
value = TRUE) -> ddl.iris.lm.model.smry
ddl.iris.lm.model.smry$sigma
[1] 0.3145491
or can explore Repository
for various objects using their Tags
.
When one is interested in all objects of class lm
that were created with Species
explanatory variable (with versicolor
factor level) in pbiecek/graphGallery
repository, then one might download them, extract R-squared statistics and coefficients, bind them in one data frame and sort rows by R-squared statistics to get the best model with the following code
mm <- asearch(patterns = c('class:lm',
'coefname:Speciesversicolor'),
repo = 'pbiecek/graphGallery')
mm %>%
lapply(function(x) {
c(r.squared = summary(x)$r.squared,
x$coef) %>%
# extract coeffs and R-squared statistic
t %>%
as.data.frame # transpose for binding
}) %>%
do.call(dplyr::bind_rows, .) %>% # apply bind_rows to all list elements
cbind(data.frame(md5hash = names(mm))) %>%
arrange(r.squared) %>% # arrange rows by r.squared
unique() %>%
select(-Sepal.Length) # to fit in the output html
r.squared (Intercept) Speciesversicolor Speciesvirginica md5hash
1 0.6187057 5.006000 0.930000 1.582000 0e213ac68a45b6cd454d06b91f991bc7
2 0.6187057 5.006000 0.930000 1.582000 e58d2f9d50b67ce4d397bf015ec1259c
3 0.9413717 1.462000 2.798000 4.090000 0a82efeb8250a47718cea9d7f64e5ae7
4 0.9413717 1.462000 2.798000 4.090000 378237103bb60c58600fe69bed6c7f11
5 0.9413717 1.462000 2.798000 4.090000 7f11e03539d48d35f7e7fe7780527ba7
6 0.9413717 1.462000 2.798000 4.090000 c1b1ef7bcddefb181f79176015bc3931
7 0.9604273 1.462000 2.507231 3.509429 990861c7c27812ee959f10e5f76fe2c3
8 0.9748944 -1.702342 2.210138 3.090002 2a6e492cb6982f230e48cf46023e2e4f
This might be used to extract the most valuable model (in terms of R-squared statistics) within the Repository
. Imagine a Repository
with hundreds of classification models and their potentional blends, where each of them is archived with additional Tags
describing their performance. One gained a great tool to search for the best classifier within dozens of models.
If one would like to extract only md5hashes
for given Tags
instead of whole artifacts, then searchInRemoteRepo()
function can be used
searchInRemoteRepo(pattern = 'name', fixed = FALSE, repo = 'graphGallery', user = 'pbiecek') %>% length
[1] 51
# return md5hashes of artifacts that have a tag containing a substring `name`
searchInRemoteRepo(pattern = c('class:ggplot', 'class:lm'),
repo = 'graphGallery', user = 'pbiecek',
intersect = FALSE) %>% length
[1] 15
Desired artifacts can even be copied to our Local Repository
from GihHub
searchInRemoteRepo(pattern = c('class:ggplot', 'class:lm'),
repo = 'graphGallery', user = 'pbiecek',
intersect = FALSE)[1:3] %>%
copyRemoteRepo(md5hashes = ., user = 'pbiecek', repo = 'graphGallery',
repoTo = 'Gallery')
showLocalRepo(repoDir = 'Gallery')[, -3]
md5hash name
1 caab8f8e72045c93f92091aebcb74f2d iris.lm.model
2 b0f870835c546853b70f921f57e100b4 b0f870835c546853b70f921f57e100b4
3 7a761a2ae54f3d90060a9f6ca04b3506 7a761a2ae54f3d90060a9f6ca04b3506
4 067896394acf47242b30b07ac300dc48 iris.lm.model.smry
5 dfb4355fdd4baa980368a13c4bf8ef3f dfb4355fdd4baa980368a13c4bf8ef3f
6 b6f183dfd0efdc8c33fa89b9038716c5 pl
7 b6f183dfd0efdc8c33fa89b9038716c5 pl
8 fcd70d55b874201d2bece12f591a2ec4 pl
9 4cc15e46b6008f5867a92364fe36e835 qplot(time, arr_delay, data = per_hour)
Let’s have a look at more advanced example. Suppose we would like to create a similar graph to this one but with slightly different input data.
aread('MarcinKosinski/coxphSGD/db03267b063709277e50bd4c0c1ddb04') -> survival.egfr.plot
class(survival.egfr.plot)
[1] "tableAndPlot" "list"
# this object is a result of survMisc:::autoplot.survfit function which mainly
# produces a list containing 2 ggplot objects
survival.egfr.plot$plot <- survival.egfr.plot$plot +
ggtitle('Survival vs mutation in EGFR gene') # the previous title was in polish
survMisc::autoplot(survival.egfr.plot) # it will plot survival plot and risk set table
The plot presents the Kaplan-Meier estimates of the survival curves for patients suffering from cancer, divided into 2 groups: one with a mutation in EGFR gene (EGFR=1
) and the other without it (EGFR=0
). The mutation can be a deletion, amplification etc..
The data used to produce previous plot came from The Cancer Genome Atlas Study.
The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.
The download of data through R
is possible with the RTCGA
package but most of useful datasets are already converted and available in the RTCGA family
of R
data packages.
The main plot of this use case was made from the data of patients suffering from one of all available cancer types in TCGA study (there are 38 available cohorts types to download). Let us prepare such plot only for patients suffering from Breast invasive carcinoma (Breast Cancer - BRCA), divided into groups related to the existence of the mutation in a EGFR gene and the expression of a gene TP53 (over and below the median in BRCA cohort).
To do so one would need to perform a few data operations such as observations filtering, column tranformations and tables merging. The very convenient set of tools to munge data are dplyr
and forward-pipe operator from magrittr
package. We have borrowed the %>%
forward-pipe operator from magrittr
version 1.0.1 and created our own archivist-forward-pipe operator `?%a
which not only passes the results of the previous function to the next one, but also archives the inputs off all operations so that one can create a hooks to every partial result of forward-pipe operation/analysis.
I am a really keen on long forward-pipe operations so I’ll try my best to create a good one, which will join information about clinical state of a patient with the information about the existence of an PIK3CA gene mutations (which is the most common for BRCA) and the information about expression of a gene TP53 (which is believed to be a guardian of the genome).
library(RTCGA.rnaseq); data(BRCA.rnaseq) # information about genes' expressions
library(RTCGA.mutations); data(BRCA.mutations) # information about genes' mutations
library(RTCGA.clinical); data(BRCA.clinical) # patients' clinical data
aoptions('silent', TRUE) # This sets `silent=TRUE` in saveToRepo which is used by %a% . There will be no warning printed about archiving the same artifact or it's data twice.
[1] TRUE
BRCA.rnaseq %a%
select(`TP53|7157`, bcr_patient_barcode) %a%
# bcr_patient_barcode contains a key to merge patients between various datasets
rename(TP53 = `TP53|7157`) %a%
filter(substr(bcr_patient_barcode, 14, 15) == "01" ) %a%
# 01 at the 14-15th position tells these are cancer sample
mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode),1,12)) ->
# in clinical info bcr_patient_barcode is only of length 12
BRCA.rnaseq.TP53
BRCA.mutations %a%
select(Hugo_Symbol, bcr_patient_barcode) %a%
# Hugo_symbol tells to which gene the row corresponds.
# Ff the rows exist for a gene, this means there was a mutation for this patient for this gene.
filter(nchar(bcr_patient_barcode)==15) %a%
# sometime there are inproper lengths of this code
filter(substr(bcr_patient_barcode, 14, 15)=="01") %a%
# 01 at the 14-15th position tells these are cancer sample
filter(Hugo_Symbol == 'PIK3CA') %a%
# we are interested only in the mutations of PIK3CA
unique() %a%
# sometimes there are few mutations in the same gene
mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode),1,12)) ->
# in clinical info bcr_patient_barcode is only of length 12
BRCA.mutations.PIK3CA
BRCA.clinical %a%
select(patient.bcr_patient_barcode,
patient.vital_status, # information whether patient is still alive
patient.days_to_last_followup, # how many days has patient been observed if he is alive
patient.days_to_death) %a% # how many days has patient been observed if he has passed away
mutate(bcr_patient_barcode = toupper(as.character(patient.bcr_patient_barcode))) %a%
# in clinical datasets the key column is in lower case and with different name
mutate(status = ifelse(as.character(patient.vital_status) == "dead",1,0),
times = ifelse(
!is.na(patient.days_to_last_followup),
as.numeric(as.character(patient.days_to_last_followup)),
as.numeric(as.character(patient.days_to_death))
)) %a%
# if the patient does not have a days_to_last_followup time this means
# he has days_to_death time
filter(!is.na(times)) %a%
# sometime patient does not have any time
filter(times > 0) -> BRCA.clinical.survival
# sometimes by mistkae patients have non-positive times (few cases)
BRCA.rnaseq.TP53 %a%
left_join(y = BRCA.mutations.PIK3CA,
by = "bcr_patient_barcode") %a%
left_join(y = BRCA.clinical.survival,
by = "bcr_patient_barcode") %a%
mutate(TP53_HighExpr = ifelse(TP53 >= median(TP53), "1", "0")) %a%
mutate(PIK3CA_Mut = as.integer(!is.na(Hugo_Symbol))) %a%
select(times, status, TP53_HighExpr, PIK3CA_Mut) -> BRCA.2survfit
And then one can print artifact’s history/pedigree with the ahistory()
function that works in 3 variants
ahistory(BRCA.rnaseq.TP53) # regular format
env[[nm]] [63678e012c5b7f40966c32eec91f828b]
-> select(`TP53|7157`, bcr_patient_barcode) [4a85ce61229dd743b911d7edab0310b3]
-> rename(TP53 = `TP53|7157`) [103f2b82c41956e9f6437b3a0cd68679]
-> filter(substr(bcr_patient_barcode, 14, 15) == "01") [1da5a026aae19e0d0467ba3773679e28]
-> mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode), 1, 12)) [2001f888c6262e30154688876c91cc50]
# additional chunk options: results='asis'
ahistory(BRCA.mutations.PIK3CA, format = "kable") # uses knitr::kable()
call | md5hash | |
---|---|---|
7 | env[[nm]] | b2c7c1b633de515d59c9af635805ed29 |
6 | select(Hugo_Symbol, bcr_patient_barcode) | 9e4936fa0de8ca0ebc50fa310e845473 |
5 | filter(nchar(bcr_patient_barcode) == 15) | fc93fb6885a6387d1495ce0a6456d4f6 |
4 | filter(substr(bcr_patient_barcode, 14, 15) == “01”) | 2cee0b51527271264f70aa91ee4f33e5 |
3 | filter(Hugo_Symbol == “PIK3CA”) | e9a920ae11aa23e0b842f5e64c446a9d |
2 | unique() | 194214bafe8287c73f44bd660b74e199 |
1 | mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode), 1, 12)) | c280e394539be64c73ea022b9ea9fa05 |
ahistory(BRCA.2survfit, format = "kable", alink = TRUE ) # give hooks to objects
call | md5hash | |
---|---|---|
10 | env[[nm]] | 63678e012c5b7f40966c32eec91f828b |
9 | select(TP53|7157 , bcr_patient_barcode) |
4a85ce61229dd743b911d7edab0310b3 |
8 | rename(TP53 = TP53|7157 ) |
103f2b82c41956e9f6437b3a0cd68679 |
7 | filter(substr(bcr_patient_barcode, 14, 15) == “01”) | 1da5a026aae19e0d0467ba3773679e28 |
6 | mutate(bcr_patient_barcode = substr(as.character(bcr_patient_barcode), 1, 12)) | 2001f888c6262e30154688876c91cc50 |
5 | left_join(y = BRCA.mutations.PIK3CA, by = “bcr_patient_barcode”) | 8a615a0954e03d11a0ef2d705c59766e |
4 | left_join(y = BRCA.clinical.survival, by = “bcr_patient_barcode”) | a8b097ee7010673b2078e8c4d47079df |
3 | mutate(TP53_HighExpr = ifelse(TP53 >= median(TP53), “1”, “0”)) | 73eef46c8dfe31a99beb47ee6edfb4f3 |
2 | mutate(PIK3CA_Mut = as.integer(!is.na(Hugo_Symbol))) | 7a4625c0521157f81a3db11ddd196779 |
1 | select(times, status, TP53_HighExpr, PIK3CA_Mut) | 684d08dfe5d3a6dff389ad210f69aa36 |
# Note that they are not yet available on GitHub,
# because the archiving was only to Local Repository.
print()
functionThe final data would be used to plot the Kaplan-Meier estimates of the survival curves with the use of my.customized.km()
function.
tail(BRCA.2survfit)
times status TP53_HighExpr PIK3CA_Mut
1088 1550 0 1 0
1089 791 0 1 0
1090 292 0 1 0
1091 278 0 0 0
1092 3042 0 1 0
1093 2800 0 0 0
BRCA.2survfit %>%
select(TP53_HighExpr, PIK3CA_Mut, status) %>%
table()
, , status = 0
PIK3CA_Mut
TP53_HighExpr 0 1
0 359 118
1 298 161
, , status = 1
PIK3CA_Mut
TP53_HighExpr 0 1
0 35 9
1 47 13
my.customized.km()
function is based on survMisc
package, which was moved to CRAN archive suddenly. It’s implementation/body is too long to write it down here (and unnecessary) but can be imported to R
with
# Instead of whole md5hash an abbreviation can be specified.
# If more that one md5hash matches this abbrevation, then every corresponding artifact is loaded.
aread('MarcinKosinski/Museum/30efaaedb') -> my.customized.km
# or loadFromGithubRepo(md5hash = '30efaaedb', user = 'MarcinKosinski', repo = 'Museum')
my.customized.km('times', 'status', c('TP53_HighExpr', 'PIK3CA_Mut'), BRCA.2survfit,
'Survival vs TP53 expression and mutations in PIK3CA') -> km_plot
class(km_plot)
[1] "tableAndPlot" "list"
One can even overload print
function for a specific class to first perform archive
operation, then to extract hook in the way that is compatible with the format of report and lastly to print the object. Original idea came from here and here
print.tableAndPlot <- function(x, ...) {
saveToRepo(x) -> hash
cat(alink(hash)) # uses globally set `user` and `repo`
survMisc:::autoplot.tableAndPlot(x, ...)
}
# results = 'asis'
archive(print.tableAndPlot, commitMessage = 'print.tableAndPlot function', alink = TRUE)
archivist::aread('MarcinKosinski/Gallery/fcd394ab60e8c14545028ab8f68eb9bc')
# results = 'asis'
print(km_plot)
archivist::aread('MarcinKosinski/Gallery/41eb3f66d56dd9df86554c9ee6022b43')
Easier version of this functionality is described in this Use Case about addHookToPrint.
Repository
to GitHubOne might have noticed that %a%
operator and saveToLocalRepo()
function saved objects only to the Local Repository
that is synchronized with GitHub. This is different to the archive
function which archives to the Local and GitHub Repository
simultaneously. It is possible to push
(Git command) artifacts that are only present in the Local Repository
to GitHub equivalent with pushGitHubRepo
function
# number of artifacts before push
searchInRemoteRepo(pattern = "name", fixed = FALSE) %>% length
[1] 2
# one can check how many commits have been performed so far
length(jsonlite::fromJSON(rawToChar(GET('https://api.github.com/repos/MarcinKosinski/Gallery/commits')$content))$sha)
[1] 4
pushGitHubRepo() # uses globally set parametrs when none are provided
# number of artifacts after push
Sys.sleep(300) # to be sure my request isn't faster than GitHub platform after push
searchInRemoteRepo(pattern = "name", fixed = FALSE) %>% length
[1] 30
# one can check how many commits have been performed so far
length(jsonlite::fromJSON(rawToChar(GET('https://api.github.com/repos/MarcinKosinski/Gallery/commits')$content))$sha)
[1] 5
This operation might be troublesome when other collaborator has pushed his changes to the remote GitHub Repository
. Sometimes it’s better to first pull
(Git command) changes (new artifacts) from the synchronized Github Repository
. Working with too many collaborators may occure in Git conflicts in the backpack.db
file. If you have any ideas or suggestions how can this be handle please write in this issue. pullGitHubRepo()
and pushGitHubRepo()
both have additional parameter ...
enabling passing more sophisticated options to git2r::pull
and git2r::push
when more complex conflicts occure.
Repository
and gallery
summariesIt is also possible to create a summary of gallery
folder, which in other words mean a summary of each artifact stored in Repository
. Special createMDGallery()
function creates an .md
file with hooks to artifacts, their list of Tags
and if possible their miniature in a form of a .png
file. By now only plots like lattice
and ggplot
object’s are saved also with their .png
miniature and only for those objects mianiture can be added to gallery
summary.
In the below example we extract gallery
summary to the README.md
file which will be appended with additional lines of the summary.
createMDGallery(output = 'Gallery/README.md', addTags = TRUE, addMiniature = TRUE)
The output of this function can be seen in the README
of Gallery
Repository
(https://github.com/MarcinKosinski/Gallery) in which we are working in this tutorial, after it’ll be pushed to GitHub with
pushGitHubRepo(files = 'README.md')
The results is here and few interesting gallery
summaries are here and here.
The archivist
package also provides Repository
summary with summaryGithubRepo
function. This is rather old but still relevant.
summaryRemoteRepo()
Number of archived artifacts in Repository: 30
Number of archived datasets in Repository: 1
Number of various classes archived in Repository:
Number
lm 1
summary.lm 1
ggplot 3
data.frame 23
function 1
tableAndPlot 1
list 1
Saves per day in Repository:
Saves
2016-02-24 32
2014-08-21 2
2014-09-03 1
One can easily delete existing Local and GitHub Repository
. For GitHub one can delete only archivist
-like Repository
(gallery folder and backpack.db
file) by default or the whole GitHub-repository with deleteRoot = TRUE
. (Yes, I have used those many times while writing this Use Case :)).
# eval = FALSE
deleteLocalRepo(repoDir = 'Gallery', deleteRoot = TRUE)
deleteGitHubRepo('Gallery', deleteRoot = TRUE)
By now archivist extract extra Tags
only for such object’s classes
# http://stackoverflow.com/a/11005886/3857701
methodsTable <- ls(.__S3MethodsTable__.,
envir = asNamespace("archivist"),
all.names = TRUE)
grep('extractTags', methodsTable, value = TRUE) %>%
data.frame(extracTags_methods =.) %>%
pander::pandoc.table()
extracTags_methods |
---|
extractTags |
extractTags.data.frame |
extractTags.default |
extractTags.ggplot |
extractTags.glmnet |
extractTags.htest |
extractTags.lda |
extractTags.lm |
extractTags.partition |
extractTags.qda |
extractTags.summary.lm |
extractTags.survfit |
extractTags.trellis |
extractTags.twins |
If you would like to create wrappers for other classes please open an issue or even for other suggestions, comments or user requests like integration with bitbucket, support for other languages, support for json/csv files, summary plots of the Repository
.
devtools::session_info()
setting value
version R version 3.2.2 (2015-08-14)
system x86_64, linux-gnu
ui X11
language English
collate pl_PL.UTF-8
tz <NA>
date 2016-02-24
package * version date source
acepack 1.3-3.3 2014-11-24 CRAN (R 3.2.0)
archivist * 2.0.1 2016-02-21 CRAN (R 3.2.2)
archivist.github * 0.1 2016-02-24 CRAN (R 3.2.2)
assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
bitops 1.0-6 2013-08-17 CRAN (R 3.2.0)
chron 2.3-47 2015-06-24 CRAN (R 3.2.1)
cluster 2.0.3 2015-07-21 CRAN (R 3.2.1)
codetools 0.2-14 2015-07-15 CRAN (R 3.2.1)
colorspace 1.2-6 2015-03-11 CRAN (R 3.2.2)
combinat 0.0-8 2012-10-29 CRAN (R 3.2.0)
curl 0.9.6 2016-02-17 CRAN (R 3.2.2)
data.table 1.9.7 2015-12-29 Github (Rdatatable/data.table@405f115)
DBI 0.3.1 2014-09-24 CRAN (R 3.2.2)
devtools 1.10.0 2016-01-23 CRAN (R 3.2.2)
digest 0.6.9 2016-01-08 CRAN (R 3.2.2)
dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.2)
evaluate 0.8 2015-09-18 CRAN (R 3.2.1)
foreach 1.4.3 2015-10-13 CRAN (R 3.2.2)
foreign 0.8-66 2015-08-19 CRAN (R 3.2.1)
formatR 1.2.1 2015-09-18 CRAN (R 3.2.1)
Formula 1.2-1 2015-04-07 CRAN (R 3.2.0)
gam 1.12 2015-05-13 CRAN (R 3.2.0)
ggbiplot 0.55 2015-09-23 Github (vqv/ggbiplot@7325e88)
ggplot2 * 2.0.0 2015-12-18 CRAN (R 3.2.2)
ggthemes 3.0.1 2016-01-10 CRAN (R 3.2.2)
git2r 0.13.1 2015-12-10 CRAN (R 3.2.2)
gridExtra 2.0.0 2015-07-14 CRAN (R 3.2.1)
gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
highr 0.5.1 2015-09-18 CRAN (R 3.2.1)
Hmisc 3.17-1 2015-12-18 CRAN (R 3.2.2)
htmltools 0.3 2015-12-29 CRAN (R 3.2.2)
httr * 1.1.0 2016-01-28 CRAN (R 3.2.2)
iterators 1.0.8 2015-10-13 CRAN (R 3.2.1)
jsonlite 0.9.19 2015-11-28 CRAN (R 3.2.2)
km.ci 0.5-2 2009-08-30 CRAN (R 3.2.0)
KMsurv 0.1-5 2012-12-03 CRAN (R 3.2.0)
knitr * 1.12.3 2016-01-22 CRAN (R 3.2.2)
labeling 0.3 2014-08-23 CRAN (R 3.2.0)
lattice 0.20-33 2015-07-14 CRAN (R 3.2.1)
latticeExtra 0.6-26 2013-08-15 CRAN (R 3.2.0)
lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.2)
lubridate 1.5.0 2015-12-03 CRAN (R 3.2.2)
magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
memoise 1.0.0 2016-01-29 CRAN (R 3.2.2)
munsell 0.4.3 2016-02-13 CRAN (R 3.2.2)
nnet 7.3-11 2015-08-30 CRAN (R 3.2.1)
pander 0.6.0 2015-11-23 CRAN (R 3.2.2)
plyr 1.8.3 2015-06-12 CRAN (R 3.2.1)
purrr 0.2.1 2016-02-13 CRAN (R 3.2.2)
R6 2.1.2 2016-01-26 CRAN (R 3.2.2)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0)
Rcpp 0.12.3 2016-01-10 CRAN (R 3.2.2)
RCurl 1.95-4.7 2015-06-30 CRAN (R 3.2.1)
rmarkdown 0.9.5 2016-02-22 CRAN (R 3.2.2)
rpart 4.1-10 2015-06-29 CRAN (R 3.2.1)
RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.1)
rstudioapi 0.5 2016-01-24 CRAN (R 3.2.2)
RTCGA * 1.1.14 2016-02-24 Github (RTCGA/RTCGA@7d6d667)
RTCGA.clinical * 20151101.1.0 2016-02-22 Github (RTCGA/RTCGA.clinical@0239210)
RTCGA.mutations * 20151101.0.0 2016-02-22 Github (RTCGA/RTCGA.mutations@3c3b83b)
RTCGA.rnaseq * 20151101.0.0 2016-02-22 Github (RTCGA/RTCGA.rnaseq@196d7d2)
rvest 0.3.1 2015-11-11 CRAN (R 3.2.2)
scales 0.3.0 2015-08-25 CRAN (R 3.2.1)
stringi 1.0-1 2015-10-22 CRAN (R 3.2.1)
stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
survival * 2.38-3 2015-07-02 CRAN (R 3.2.1)
survminer 0.2.0.9001 2016-02-24 local
survMisc * 0.4.6 2015-04-15 CRAN (R 3.2.0)
XML 3.98-1.3 2015-06-30 CRAN (R 3.2.1)
xml2 0.1.2 2015-09-01 CRAN (R 3.2.1)
yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
zoo 1.7-12 2015-03-16 CRAN (R 3.2.0)