BiocPkgTools 1.2.0
Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.
Functionality includes access to :
The Bioconductor build reports are available online as HTML pages.
However, they are not very computable.
The biocBuildReport
function does some heroic parsing of the HTML
to produce a tidy data.frame for further processing in R.
library(BiocPkgTools)
head(biocBuildReport())
## # A tibble: 6 x 9
## pkg version author commit last_changed_date node stage result
## <chr> <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 a4 1.31.0 Tobia… a53c… 2018-10-30 00:00:00 malb… inst… OK
## 2 a4 1.31.0 Tobia… a53c… 2018-10-30 00:00:00 malb… buil… OK
## 3 a4 1.31.0 Tobia… a53c… 2018-10-30 00:00:00 malb… chec… OK
## 4 a4 1.31.0 Tobia… a53c… 2018-10-30 00:00:00 toka… inst… OK
## 5 a4 1.31.0 Tobia… a53c… 2018-10-30 00:00:00 toka… buil… OK
## 6 a4 1.31.0 Tobia… a53c… 2018-10-30 00:00:00 toka… chec… OK
## # … with 1 more variable: bioc_version <chr>
Because developers may be interested in a quick view of their own
packages, there is a simple function, problemPage
, to produce an HTML report of
the build status of packages matching a given author regex. The default is
to report only “problem” build statuses (ERROR, WARNING).
problemPage()
When run in an interactive environment, the problemPage
function
will open a browser window for user interaction. Note that if you want
to include all your package results, not just the broken ones, simply
specify includeOK = TRUE
.
Bioconductor supplies download stats for all packages. The biocDownloadStats
function grabs all available download stats for all packages in all
Experiment Data, Annotation Data, and Software packages. The results
are returned as a tidy data.frame for further analysis.
head(biocDownloadStats())
## # A tibble: 6 x 7
## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo Date
## <chr> <int> <chr> <int> <int> <chr> <date>
## 1 ABarray 2019 Jan 104 210 Software 2019-01-01
## 2 ABarray 2019 Feb 80 164 Software 2019-02-01
## 3 ABarray 2019 Mar 144 192 Software 2019-03-01
## 4 ABarray 2019 Apr 140 259 Software 2019-04-01
## 5 ABarray 2019 May 0 0 Software 2019-05-01
## 6 ABarray 2019 Jun 0 0 Software 2019-06-01
The download statistics reported are for all available versions of a package. There are no separate, publicly available statistics broken down by version.
The R DESCRIPTION
file contains a plethora of information regarding package
authors, dependencies, versions, etc. In a repository such as Bioconductor, these
details are available in bulk for all inclucded packages. The biocPkgList
returns
a data.frame with a row for each package. Tons of information are avaiable, as
evidenced by the column names of the results.
bpi = biocPkgList()
colnames(bpi)
## [1] "Package" "Version"
## [3] "Depends" "Suggests"
## [5] "License" "MD5sum"
## [7] "NeedsCompilation" "Title"
## [9] "Description" "biocViews"
## [11] "Author" "Maintainer"
## [13] "git_url" "git_branch"
## [15] "git_last_commit" "git_last_commit_date"
## [17] "Date/Publication" "source.ver"
## [19] "win.binary.ver" "mac.binary.el-capitan.ver"
## [21] "vignettes" "vignetteTitles"
## [23] "hasREADME" "hasNEWS"
## [25] "hasINSTALL" "hasLICENSE"
## [27] "Rfiles" "Enhances"
## [29] "dependsOnMe" "Imports"
## [31] "importsMe" "suggestsMe"
## [33] "LinkingTo" "Archs"
## [35] "VignetteBuilder" "URL"
## [37] "SystemRequirements" "BugReports"
## [39] "Video" "linksToMe"
## [41] "OS_type" "License_restricts_use"
## [43] "PackageStatus" "License_is_FOSS"
## [45] "organism"
Some of the variables are parsed to produce list
columns.
head(bpi)
## # A tibble: 6 x 45
## Package Version Depends Suggests License MD5sum NeedsCompilation Title
## <chr> <chr> <list> <list> <chr> <chr> <chr> <chr>
## 1 a4 1.31.0 <chr [… <chr [4… GPL-3 31072… no Auto…
## 2 a4Base 1.31.0 <chr [… <chr [2… GPL-3 2dec7… no Auto…
## 3 a4Clas… 1.31.0 <chr [… <chr [1… GPL-3 4bbcd… no Auto…
## 4 a4Core 1.31.0 <chr [… <chr [1… GPL-3 a2c0c… no Auto…
## 5 a4Prep… 1.31.0 <chr [… <chr [2… GPL-3 087b7… no Auto…
## 6 a4Repo… 1.31.0 <chr [… <chr [1… GPL-3 1635a… no Auto…
## # … with 37 more variables: Description <chr>, biocViews <list>,
## # Author <list>, Maintainer <list>, git_url <chr>, git_branch <chr>,
## # git_last_commit <chr>, git_last_commit_date <chr>,
## # `Date/Publication` <chr>, source.ver <chr>, win.binary.ver <chr>,
## # `mac.binary.el-capitan.ver` <chr>, vignettes <list>,
## # vignetteTitles <list>, hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>,
## # hasLICENSE <chr>, Rfiles <list>, Enhances <list>, dependsOnMe <list>,
## # Imports <list>, importsMe <list>, suggestsMe <list>, LinkingTo <list>,
## # Archs <list>, VignetteBuilder <chr>, URL <chr>,
## # SystemRequirements <chr>, BugReports <chr>, Video <chr>,
## # linksToMe <list>, OS_type <chr>, License_restricts_use <chr>,
## # PackageStatus <chr>, License_is_FOSS <chr>, organism <chr>
As a simple example of how these columns can be used, extracting
the importsMe
column to find the packages that import the
GEOquery package.
require(dplyr)
bpi = biocPkgList()
bpi %>%
filter(Package=="GEOquery") %>%
pull(importsMe) %>%
unlist()
## [1] "bigmelon" "ChIPXpress" "coexnet" "crossmeta"
## [5] "EGAD" "GAPGOM" "GSEABenchmarkeR" "MACPET"
## [9] "minfi" "MoonlightR" "phantasus" "recount"
## [13] "SRAdb"
For the end user of Bioconductor, an analysis often starts with finding a
package or set of packages that perform required tasks or are tailored
to a specific operation or data type. The biocExplore()
function
implements an interactive bubble visualization with filtering based on
biocViews terms. Bubbles are sized based on download statistics. Tooltip
and detail-on-click capabilities are included. To start a local session:
biocExplore()
The Bioconductor ecosystem is built around the concept of interoperability
and dependencies. These interdependencies are available as part of the
biocPkgList()
output. The BiocPkgTools
provides some convenience
functions to convert package dependencies to R graphs. A modular approach leads
to the following workflow.
data.frame
of dependencies using buildPkgDependencyDataFrame
.igraph
object from the dependency data frame using buildPkgDependencyIgraph
igraph
functionality to perform arbitrary network operations.
Convenience functions, inducedSubgraphByPkgs
and subgraphByDegree
are available.A dependency graph for all of Bioconductor is a starting place.
library(BiocPkgTools)
dep_df = buildPkgDependencyDataFrame()
g = buildPkgDependencyIgraph(dep_df)
g
## IGRAPH a244fce DN-- 3113 25939 --
## + attr: name (v/c), edgetype (e/c)
## + edges from a244fce (vertex names):
## [1] a4 ->a4Base a4 ->a4Preproc
## [3] a4 ->a4Classif a4 ->a4Core
## [5] a4 ->a4Reporting a4Base ->methods
## [7] a4Base ->graphics a4Base ->grid
## [9] a4Base ->Biobase a4Base ->AnnotationDbi
## [11] a4Base ->annaffy a4Base ->mpm
## [13] a4Base ->genefilter a4Base ->limma
## [15] a4Base ->multtest a4Base ->glmnet
## + ... omitted several edges
library(igraph)
head(V(g))
## + 6/3113 vertices, named, from a244fce:
## [1] a4 a4Base a4Classif a4Core a4Preproc a4Reporting
head(E(g))
## + 6/25939 edges from a244fce (vertex names):
## [1] a4 ->a4Base a4 ->a4Preproc a4 ->a4Classif
## [4] a4 ->a4Core a4 ->a4Reporting a4Base->methods
See inducedSubgraphByPkgs
and subgraphByDegree
to produce
subgraphs based on a subset of packages.
See the igraph documentation for more detail on graph analytics, setting vertex and edge attributes, and advanced subsetting.
The visNetwork package is a nice interactive visualization tool that implements graph plotting in a browser. It can be integrated into shiny applications. Interactive graphs can also be included in Rmarkdown documents (see vignette)
igraph_network = buildPkgDependencyIgraph(buildPkgDependencyDataFrame())
The full dependency graph is really not that informative to look at, though doing so is possible. A common use case is to visualize the graph of dependencies “centered” on a package of interest. In this case, I will focus on the GEOquery package.
igraph_geoquery_network = subgraphByDegree(igraph_network, "GEOquery")
The subgraphByDegree()
function returns all nodes and connections within
degree
of the named package; the default degree
is 1
.
The visNework package can plot igraph
objects directly, but more flexibility
is offered by first converting the graph to visNetwork form.
library(visNetwork)
data <- toVisNetworkData(igraph_geoquery_network)
The next few code chunks highlight just a few examples of the visNetwork capabilities, starting with a basic plot.
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px")
For fun, we can watch the graph stabilize during drawing, best viewed interactively.
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visPhysics(stabilization=FALSE)
Add arrows and colors to better capture dependencies.
data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visEdges(arrows='from')
Add a legend.
ledges <- data.frame(color = c("green", "lightblue", "red"),
label = c("Depends", "Suggests", "Imports"), arrows =c("from", "from", "from"))
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visEdges(arrows='from') %>%
visLegend(addEdges=ledges)
[Work in progress]
The biocViews package is a small ontology of terms describing Bioconductor packages. This is a work-in-progress section, but here is a small example of plotting the biocViews graph.
library(biocViews)
data(biocViewsVocab)
biocViewsVocab
## A graphNEL graph with directed edges
## Number of Nodes = 476
## Number of Edges = 475
library(igraph)
g = igraph.from.graphNEL(biocViewsVocab)
library(visNetwork)
gv = toVisNetworkData(g)
visNetwork(gv$nodes, gv$edges, width="100%") %>%
visIgraphLayout(layout = "layout_as_tree", circular=TRUE) %>%
visNodes(size=20) %>%
visPhysics(stabilization=FALSE)
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biocViews_1.52.0 visNetwork_2.0.6 igraph_1.2.4.1
## [4] dplyr_0.8.0.1 BiocPkgTools_1.2.0 htmlwidgets_1.3
## [7] knitr_1.22 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 compiler_3.6.0 pillar_1.3.1
## [4] BiocManager_1.30.4 bitops_1.0-6 tools_3.6.0
## [7] digest_0.6.18 jsonlite_1.6 evaluate_0.13
## [10] tibble_2.1.1 pkgconfig_2.0.2 rlang_0.3.4
## [13] graph_1.62.0 rex_1.1.2 cli_1.1.0
## [16] curl_3.3 yaml_2.2.0 parallel_3.6.0
## [19] xfun_0.6 httr_1.4.0 stringr_1.4.0
## [22] xml2_1.2.0 hms_0.4.2 stats4_3.6.0
## [25] DT_0.5 tidyselect_0.2.5 glue_1.3.1
## [28] Biobase_2.44.0 R6_2.4.0 gh_1.0.1
## [31] fansi_0.4.0 XML_3.98-1.19 RBGL_1.60.0
## [34] rmarkdown_1.12 bookdown_0.9 tidyr_0.8.3
## [37] readr_1.3.1 purrr_0.3.2 magrittr_1.5
## [40] htmltools_0.3.6 BiocGenerics_0.30.0 rvest_0.3.3
## [43] RUnit_0.4.32 assertthat_0.2.1 utf8_1.1.4
## [46] stringi_1.4.3 lazyeval_0.2.2 RCurl_1.95-4.12
## [49] crayon_1.3.4