There are numerous packages that already interface with the AWS S3
protocol for object storage. Most rely directly on calls to the
low-level S3 REST
API
through R packages such as curl
or httr
, which requires significant
amounts of code to provide high-level functionality (e.g. handling
authentication, paging over results, parsing returned XML), and is thus
prone to inefficiency and bugs. Many also implicitly assume that Amazon
is the underlying provider, making it difficult or impossible to work
with a substantial and growing number of object stores now conform to
the AWS S3 standard. These include NSF’s
OpenStorageNetwork,
Jetstream2
(both based on open source Redhat
CEPH),
NCAR’s Stratus (based on
Western Digital S3), and MinIO Servers (another open
source implementation popular with companies and developers), as well as
Google Cloud Storage’s S3 compatibility mode.
In contrast, the MinIO Client, an open-source, AGPL-v3 software developed in the Go language by the MinIO team, provides a high-performance utility with intuitive design for working across multiple cloud-based object stores as well as local filesystems. This package provides a thin R wrapper around that client – maximizing performance and minimizing potential for maintenance and bugs. A helper utility provides a convenient way to install and update the golang binary across operating systems and architectures. The client supports parallel threads by default, intuitive handling of bucket permissions such as granting or revoking anonymous access, and persistent configurations across multiple clouds. After struggling against the limitations of many different R wrappers for S3 object stores, this is now my go-to.
You can install the development version of minioclient
from
GitHub with:
# install.packages("devtools")
devtools::install_github("cboettig/minioclient")
At first use, all operations will attempt to install the client (after
prompting) if not already installed. Users can also install latest
version of the minio client can be installed using install_mc
.
library(minioclient)
install_mc()
The MinIO client is designed to support multiple endpoints for cloud storage, including AWS, Google Cloud Storage (via S3-compatibility), and other S3 compatible clients such as open source MinIO or Redhat CEPH storage systems. MinIO uses a syntax based around aliases to allow access across multiple platforms. Aliases can be configured using access key pairs to allow authenticated access.
By default, the client comes pre-configured with credentials for the
MinIO play
platform, designed for public experimental storage and
examples. We can use mc_alias_ls()
to see all clients, specify the
client we want:
mc_alias_ls("play")
Some S3 object storage systems allow access without credentials.
Confusingly, attempting to access public data with invalid credentials
will still fail, so we need to specify an anonymous endpoint with no
credentials. By default, mc_alias_set
will seek to use
AWS_S3_ENDPOINT
, AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
in
your environment, if set. This allows minioclient
to be used in
scripts with authentication keys passed in securely as environmental
variables. To set an anonymous access, simply indicate empty
credentials, like so:
mc_alias_set("anon", "s3.amazonaws.com", access_key = "", secret_key = "")
Configuration of aliases is stored in a persistent configuration file,
so aliases need be created only once on a given machine. All mc
functions specify which cloud provider using a filepath notation,
<ALIAS>/<BUCKET>/<PATH>
. For instance, we can list all objects found
in the bucket gbif-open-data-us-east-1
, which is a public
bucket included in the AWS Open
Data Registry:
mc_ls("anon/gbif-open-data-us-east-1")
#> [1] "index.html" "occurrence/"
All mc
functions can also understand local filesystem paths. Any
absolute path (path starting with /
), or any relative path not
recognized as a registered alias (Note: be careful not to have local
folders using the same name as remote aliases!) will be interpreted as a
local path. For instance, we can list the contents of the local R/
directory:
mc_ls("R")
#> [1] "install_mc.R" "mc.R" "mc_alias.R" "mc_anonymous.R"
#> [5] "mc_cat.R" "mc_config_set.R" "mc_cp.R" "mc_diff.R"
#> [9] "mc_du.R" "mc_head.R" "mc_ls.R" "mc_mb.R"
#> [13] "mc_mirror.R" "mc_mv.R" "mc_rb.R" "mc_rm.R"
#> [17] "mc_sql.R" "mc_stat.R"
This notation makes it easy to move data between local and remote
systems, or even between two remote systems. Let’s copy the index.html
file from GBIF to our local file system.
mc_cp("anon/gbif-open-data-us-east-1/index.html", "gbif.html")
Just to prove this is indeed a local copy, we can list local directory:
fs::file_info("gbif.html")
#> # A tibble: 1 × 18
#> path type size permissions modification_time user group device_id
#> <fs::path> <fct> <fs::b> <fs::perms> <dttm> <chr> <chr> <dbl>
#> 1 gbif.html file 31.6K rw-r--r-- 2023-11-05 22:54:15 cboe… cboe… 66307
#> # ℹ 10 more variables: hard_links <dbl>, special_device_id <dbl>, inode <dbl>,
#> # block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>,
#> # access_time <dttm>, change_time <dttm>, birth_time <dttm>
For any object store where we have adequate permissions, we can create new buckets:
random_name <- paste0(sample(letters, 12, replace = TRUE), collapse = "")
play_bucket <- paste0("play/play-", random_name)
mc_mb(play_bucket)
#> Bucket created successfully `play/play-hmdzuvevfzdi`.
We can copy files or directories to the remote bucket:
mc_cp("anon/gbif-open-data-us-east-1/index.html", play_bucket)
mc_cp("R/", play_bucket, recursive = TRUE, verbose = TRUE)
#> `/home/cboettig/cboettig/minioclient/R/mc.R` -> `play/play-hmdzuvevfzdi/mc.R`
#> `/home/cboettig/cboettig/minioclient/R/install_mc.R` -> `play/play-hmdzuvevfzdi/install_mc.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_alias.R` -> `play/play-hmdzuvevfzdi/mc_alias.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_anonymous.R` -> `play/play-hmdzuvevfzdi/mc_anonymous.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_config_set.R` -> `play/play-hmdzuvevfzdi/mc_config_set.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_cat.R` -> `play/play-hmdzuvevfzdi/mc_cat.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_cp.R` -> `play/play-hmdzuvevfzdi/mc_cp.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_diff.R` -> `play/play-hmdzuvevfzdi/mc_diff.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_du.R` -> `play/play-hmdzuvevfzdi/mc_du.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_head.R` -> `play/play-hmdzuvevfzdi/mc_head.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_ls.R` -> `play/play-hmdzuvevfzdi/mc_ls.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_mb.R` -> `play/play-hmdzuvevfzdi/mc_mb.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_mirror.R` -> `play/play-hmdzuvevfzdi/mc_mirror.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_mv.R` -> `play/play-hmdzuvevfzdi/mc_mv.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_rb.R` -> `play/play-hmdzuvevfzdi/mc_rb.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_rm.R` -> `play/play-hmdzuvevfzdi/mc_rm.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_sql.R` -> `play/play-hmdzuvevfzdi/mc_sql.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_stat.R` -> `play/play-hmdzuvevfzdi/mc_stat.R`
#> Total: 0 B, Transferred: 22.00 KiB, Speed: 314.03 KiB/s
Note the use of recursive = TRUE
to transfer all objects matching the
pattern. In S3 object stores, file paths are really just prefixes, thus
this query includes not only everything in the R
folder, but also
README.md
, since it also matches the prefix. (Had we used the prefix
R/
, README.md
would not be matched and the R scripts would go
directly into play_bucket
root instead of an R/
sub-path.)
We can examine disk usage of remote objects or directories:
mc_du(play_bucket)
We can also adjust permissions for anonymous access:
mc_anonymous_set(play_bucket, "download")
Public objects can be accessed directly over HTTPS connection using the endpoint URL, bucket name and path:
bucket <- basename(play_bucket) # strip alias from path
# use full domain name as prefix instead:
public_url <- paste0("https://play.min.io/", bucket, "/index.html")
download.file(public_url, "index.html", quiet = TRUE)
Any command supported by the minio client can be accessed using the
function mc()
. This function can be used in place of any of the above
methods, or to access additional methods where no wrapper exists, see
mc("-h")
for complete list. R functions such as mc_ls()
are merely
helpful wrappers around the more generic mc()
utility,
e.g. mc("ls play")
is equivalent to mc_ls("play")
. Providing helper
methods allows tab-completion discovery of functions, R-based
documentation, and improved handling of display behavior
(e.g. verbose=FALSE
by default on certain commands.) See official mc
client
docs
for details.
In addition to usual R documentation, users can display full help
information for any method using the argument "-h"
. This includes
details on optional flags and further examples.
mc_du("-h")
We can now use arbitrary mc
commands (see
quickstart).
For example, examine file information to confirm that eTags (md5sums
here) match for these objects:
mc(paste("stat", "anon/gbif-open-data-us-east-1/index.html", paste0(play_bucket, "/index.html")))