SQLDataFrame 1.0.0
date: “last edit: 9/30/2019”
Load packages into R session. It will automatically load the package
of dplyr
and dbplyr
.
library(SQLDataFrame)
library(DBI)
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile)
obj <- SQLDataFrame(
conn = conn, dbtable = "state", dbkey = "state")
obj
#> SQLDataFrame with 50 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Alabama | East South Central South 3615 medium
#> Alaska | Pacific West 365 small
#> Arizona | Mountain West 2280 medium
#> Arkansas | West South Central South 2110 medium
#> California | Pacific West 21198 large
#> ... . ... ... ... ...
#> Virginia | South Atlantic South 4981 medium
#> Washington | Pacific West 3559 medium
#> West Virginia | South Atlantic South 1799 medium
#> Wisconsin | East North Central North Central 4589 medium
#> Wyoming | Mountain West 376 small
To make the SQLDataFrame
object as light and compact as possible,
there are only 5 slots contained in the object: tblData
, dbkey
,
dbnrows
, dbconcatKey
, indexes
. Metadata information could be
returned through these 5 slots using slot accessors or other utility
functions.
slotNames(obj)
#> [1] "dbkey" "dbnrows" "tblData" "indexes" "dbconcatKey"
dbtable(obj)
#> [1] "state"
dbkey(obj)
#> [1] "state"
tblData
slotThe tblData
slot saves the dbplyr::tbl_dbi
version of the database
table, which is a light-weight representation of the database table in
R. Of note is that this lazy tbl only contains unique rows. It could
also be sorted by the dbkey(obj)
if the SQLDataFrame
object was
generated from union
or rbind
. So when the saveSQLDataFrame()
function was called, a database table will be written into a physical
disk space and have the unique records.
Accessor function is made avaible for this slot:
tblData(obj)
#> # Source: table<state> [?? x 5]
#> # Database: sqlite 3.29.0
#> # [/tmp/RtmpQ6wsJj/Rinst5f171d59c7fb/SQLDataFrame/extdata/test.db]
#> division region state population size
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 East South Central South Alabama 3615 medium
#> 2 Pacific West Alaska 365 small
#> 3 Mountain West Arizona 2280 medium
#> 4 West South Central South Arkansas 2110 medium
#> 5 Pacific West California 21198 large
#> 6 Mountain West Colorado 2541 medium
#> 7 New England Northeast Connecticut 3100 medium
#> 8 South Atlantic South Delaware 579 small
#> 9 South Atlantic South Florida 8277 large
#> 10 South Atlantic South Georgia 4931 medium
#> # … with more rows
dbnrows
and dbconcatKey
The dbnrows
slot saves the number of rows corresponding to the
tblData
, and dbconcatKey
saves the realized (concatenated if
multiple) key columns corresponding to the tblData
. Accessor
functions are also available for these 2 slots:
dbnrows(obj)
#> [1] 50
dbconcatKey(obj)
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas"
#> [5] "California" "Colorado" "Connecticut" "Delaware"
#> [9] "Florida" "Georgia" "Hawaii" "Idaho"
#> [13] "Illinois" "Indiana" "Iowa" "Kansas"
#> [17] "Kentucky" "Louisiana" "Maine" "Maryland"
#> [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
#> [25] "Missouri" "Montana" "Nebraska" "Nevada"
#> [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
#> [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
#> [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
#> [41] "South Dakota" "Tennessee" "Texas" "Utah"
#> [45] "Vermont" "Virginia" "Washington" "West Virginia"
#> [49] "Wisconsin" "Wyoming"
indexes
slotThe indexes
slots is an unnamed list saving the row and column
indexes respectively corresponding to the tblData
slot, so that the
SQLDataFrame
could possibly have duplicate rows or only a subset of
data records from the tblData
, while the tblData
slot doesn’t need
to be changed. To be consistent, the slots of dbnrows
and
dbconcatKey
will also remain unchanged.
obj@indexes
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
obj_sub <- obj[sample(5, 3, replace = TRUE), 2:3]
obj_sub
#> SQLDataFrame with 3 rows and 2 columns
#> state | region population
#> <character> | <character> <numeric>
#> Arkansas | South 2110
#> California | West 21198
#> Alaska | West 365
obj_sub@indexes
#> [[1]]
#> [1] 4 5 2
#>
#> [[2]]
#> [1] 2 3
identical(tblData(obj), tblData(obj_sub))
#> [1] TRUE
With a filter
or select
function (which is similar to [i, ]
subsetting), only the indexes
slot will be updated for the row or
column index pointing to the tblData
.
obj_filter <- obj %>% filter(division == "South Atlantic" & size == "medium")
obj_filter@indexes
#> [[1]]
#> [1] 10 20 40 46 48
#>
#> [[2]]
#> NULL
identical(tblData(obj), tblData(obj_filter))
#> [1] TRUE
obj_select <- obj %>% select(division, size)
obj_select@indexes
#> [[1]]
#> NULL
#>
#> [[2]]
#> [1] 1 4
identical(tblData(obj), tblData(obj_select))
#> [1] TRUE
The ROWNAMES,SQLDataFrame
method was defined to return the
(concatenated if multiple) key column(s) value, so that the row
subsetting with character vector works for the SQLDataFrame
objects.
rnms <- ROWNAMES(obj)
obj[sample(rnms, 3), ]
#> SQLDataFrame with 3 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Minnesota | West North Central North Central 3921 medium
#> Wisconsin | East North Central North Central 4589 medium
#> Idaho | Mountain West 813 small
For SQLDataFrame
object with composite keys:
obj1 <- SQLDataFrame(conn = conn, dbtable = "state",
dbkey = c("region", "population"))
ROWNAMES(obj1[1:10,])
#> [1] "South:3615.0" "West:365.0" "West:2280.0"
#> [4] "South:2110.0" "West:21198.0" "West:2541.0"
#> [7] "Northeast:3100.0" "South:579.0" "South:8277.0"
#> [10] "South:4931.0"
obj1[c("South:3615.0", "West:365.0"), ]
#> SQLDataFrame with 2 rows and 3 columns
#> region population | division state size
#> <character> <numeric> | <character> <character> <character>
#> South 3615 | East South Central Alabama medium
#> West 365 | Pacific Alaska small
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] DBI_1.0.0 SQLDataFrame_1.0.0 S4Vectors_0.24.0
#> [4] BiocGenerics_0.32.0 dbplyr_1.4.2 dplyr_0.8.3
#> [7] BiocStyle_2.14.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.2 pillar_1.4.2 compiler_3.6.1
#> [4] BiocManager_1.30.9 tools_3.6.1 zeallot_0.1.0
#> [7] digest_0.6.22 bit_1.1-14 memoise_1.1.0
#> [10] evaluate_0.14 RSQLite_2.1.2 tibble_2.1.3
#> [13] pkgconfig_2.0.3 rlang_0.4.1 cli_1.1.0
#> [16] yaml_2.2.0 xfun_0.10 stringr_1.4.0
#> [19] knitr_1.25 vctrs_0.2.0 bit64_0.9-7
#> [22] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
#> [25] fansi_0.4.0 rmarkdown_1.16 bookdown_0.14
#> [28] purrr_0.3.3 blob_1.2.0 magrittr_1.5
#> [31] backports_1.1.5 htmltools_0.4.0 assertthat_0.2.1
#> [34] utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2
#> [37] crayon_1.3.4