lmutils.r

Installation

# select option 1, default installation
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then install the package using the following command:

install.packages("https://github.com/GMELab/lmutils.r/archive/refs/heads/master.tar.gz", repos=NULL) # use .zip for Windows
# OR
devtools::install_github("mrvillage/lmutils.r")

Important Information

Terms

Matrix convertable object - a data frame, matrix, file name (to read from), a numeric column vector, or a Mat object.
List of matrix convertable objects - a list of matrix convertable objects, a character vector of file names (to read from), or a single matrix convertable object.
Standard output file - a character vector of file names matching the length of the inputs, or NULL to return the output. If a single input, not in a list, was provided, the output will not be in a list.

File Types

csv (requires column headers)
tsv (requires column headers)
txt (requires column headers)
json
cbor
rkyv
rdata (NOTE: these files can only be processed sequentially, not in parallel like the rest) All files can be optionally compressed with gzip, rdata files are assumed to be compressed without looking for a .gz file extension.

Introduction

lmutils is an R package that provides utilities for working with matrices and data frames. It is built on top of the Rust programming language for performance and safety. The package provides a way to store matrices in memory and perform operations on them, as well as functions for working with data frames.

lmutils is built primarily around the Mat object. These are designed to be used to perform operations on matrices without loading them into memory until necessary. This can be useful for working with lots of large matrices, like hundreds of gene blocks.

To get started with your first Mat object, you can use the following code:

mat <- lmutils::Mat$new("matrix1.csv")

This will create a new Mat object from a file. You can then perform operations on this object, like combining it with other matrices, removing columns, or standardizing the columns. If you want this matrix to be loaded into R, you can use the r method:

mat$combine_columns("matrix2.csv")
mat$remove_columns(c(1, 2, 3))
mat$standardize_columns()
m <- mat$r()

You can also pass the object directly into functions that accept a matrix convertable object, it'll then be loaded automatically (with all the stored operations applied) only when needed.

lmutils::calculate_r2(
    mat,
    "outcomes1.RData",
)

Example

outcomes <- lmutils::Mat$new("outcomes.RData")
geneBlocks <- lapply(c(
    "geneBlock1.csv",
    "geneBlock2.csv",
    "geneBlock3.csv",
    "geneBlock4.csv",
    "geneBlock5.csv",
), function(mat) {
    mat <- lmutils::Mat$new(mat)
    mat$match_to_by_name(outcomes$col("eid"), "IID", 0)
    mat$remove_column("IID")
    mat$min_column_sum(2)
    mat$na_to_column_mean()
    mat$standardize_columns()
    mat
})
outcomes$remove_column("eid")
results <- lmutils::calculate_r2(geneBlocks, outcomes)

`Mat` Objects

lmutils::Mat objects are a way to store matrices in memory and perform operations on them. They can be used to store operations or chain operations together for later execution. This can be useful if, for example, you wish to a hundred large matrices from files and standardize them all before using lmutils::calculate_r2. Using Mat objects, you can store the operations you wish to perform and Mat will execute them only when the matrix is loaded.

Passing the same Mat object multiple times in a single function call may cause undefined behavior. For example, the following code may not work as expected:

mat <- lmutils::Mat$new("matrix1.csv")
lmutils::calculate_r2(list(mat, mat), mat)

`lmutils::Mat$new`

Creates a new Mat object.

data is a matrix convertable object.

mat <- lmutils::Mat$new("matrix1.csv")

`lmutils::Mat$r`

Loads the matrix from the Mat object.

m <- mat$r()

`lmutils::Mat$col`

Get a column by name or index.

col <- mat$col("eid")
col <- mat$col(1)

`lmutils::Mat$save`

Saves the matrix to a file.

file is the file name to write to.

mat$save("matrix1.rkyv.gz")

`lmutils::Mat$combine_columns`

Combines this matrix with other matrices by columns. (cbind)

data is a list of matrix convertable objects.

mat$combine_columns("matrix2.csv")

`lmutils::Mat$combine_rows`

Combines this matrix with other matrices by rows. (rbind)

data is a list of matrix convertable objects.

mat$combine_rows("matrix2.csv")

`lmutils::Mat$remove_columns`

Removes columns from the matrix.

columns is a vector of column indices (1-based) to remove.

mat$remove_columns(c(1, 2, 3))

`lmutils::Mat$remove_column`

Removes a column from the matrix by name.

column is the column name to remove.

mat$remove_column("eid")

`lmutils::Mat$remove_column_if_exists`

Removes a column from the matrix by name if it exists.

column is the column name to remove.

mat$remove_column_if_exists("eid")

`lmutils::Mat$remove_rows`

Removes rows from the matrix.

rows is a vector of row indices (1-based) to remove.

mat$remove_rows(c(1, 2, 3))

`lmutils::Mat$transpose`

Transposes the matrix.

mat$transpose()

`lmutils::Mat$sort`

Sort by the column at the given index.

by is the column index (1-based) to sort by.

mat$sort(1)

`lmutils::Mat$sort_by_name`

Sort by the column with the given name.

by is the column name to sort by.

mat$sort_by_name("eid")

`lmutils::Mat$sort_by_order`

Sort by the given order of rows.

order is a vector of row indices (1-based) to sort by.

mat$sort_by_order(c(3, 2, 1))

`lmutils::Mat$dedup`

Deduplicate the matrix by a column.

by is the column index (1-based) to deduplicate by.

mat$dedup(1)

`lmutils::Mat$dedup_by_name`

Deduplicate the matrix by a column name.

by is the column name to deduplicate by.

mat$dedup_by_name("eid")

`lmutils::Mat$match_to`

Match the rows of the matrix to the values in a vector by a column.

with is a numeric vector to match the rows to.
by is the column index (1-based) to match the rows by.
join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.

mat$match_to(c(1, 2, 3), 1, 0)

`lmutils::Mat$match_to_by_name`

Match the rows of the matrix to the values in a vector by a column name.

with is a numeric vector to match the rows to.
by is the column name to match the rows by.
join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.

mat$match_to_by_name(c(1, 2, 3), "eid", 0)

`lmutils::Mat$join`

Join the matrix with another matrix by a column.

other is a matrix convertable object.
by is the column index (1-based) to join by.
join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.

mat$join("matrix2.csv", 1, 0)

`lmutils::Mat$join_by_name`

Join the matrix with another matrix by a column name.

other is a matrix convertable object.
by is the column name to join by.
join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.

mat$join_by_name("matrix2.csv", "eid", 0)

`lmutils::Mat$standardize_columns`

Standardize the columns of the matrix to have a mean of 0 and a standard deviation of 1.

mat$standardize_columns()

`lmutils::Mat$standardize_rows`

Standardize the rows of the matrix to have a mean of 0 and a standard deviation of 1.

mat$standardize_rows()

`lmutils::Mat$remove_na_rows`

Remove rows with any NA values.

mat$remove_na_rows()

`lmutils::Mat$remove_na_columns`

Remove columns with any NA values.

mat$remove_na_columns()

`lmutils::Mat$na_to_value`

Replace all NA values with a given value.

mat$na_to_value(0)

`lmutils::Mat$na_to_column_mean`

Replace all NA values with the mean of the column.

mat$na_to_column_mean()

`lmutils::Mat$na_to_row_mean`

Replace all NA values with the mean of the row.

mat$na_to_row_mean()

`lmutils::Mat$min_column_sum`

Remove columns with a sum less than a given value.

mat$min_column_sum(10)

`lmutils::Mat$max_column_sum`

Remove columns with a sum greater than a given value.

mat$max_column_sum(10)

`lmutils::Mat$min_row_sum`

Remove rows with a sum less than a given value.

mat$min_row_sum(10)

`lmutils::Mat$max_row_sum`

Remove rows with a sum greater than a given value.

mat$max_row_sum(10)

`lmutils::Mat$rename_column`

Rename a column by name.

mat$rename_column("IID", "eid")

`lmutils::Mat$rename_column_if_exists`

Rename a column by name if it exists.

mat$rename_column_if_exists("IID", "eid")

`lmutils::Mat$remove_duplicate_columns`

Remove columns that are duplicates of other columns. The first column is kept.

mat$remove_duplicate_columns()

`lmutils::Mat$remove_identical_columns`

Remove columns with all identical entries.

mat$remove_identical_columns()

Matrix Functions

`lmutils::save`

Saves a list of matrix convertable objects to files.

from is a list of matrix convertable objects.
to is a character vector of file names to write to.

lmutils::save(
    list("file1.csv", matrix(1:9, nrow=3), 1:3, data.frame(a=1:3, b=4:6)),
    c("file1.json", "file2.rkyv.gz", "file3.csv", "file4.rdata"),
)

`lmutils::save_dir`

Recursively converts a directory of files to the selected file type.

from is a string directory name to read the files from.
to is a string directory name to write the files to or NULL to write to from.
file_type is a string file extension to write the files as.

lmutils::save_dir(
    "data",
    "converted_data", # or NULL
    "rkyv.gz",
)

`lmutils::calculate_r2`

Calculates the R^2 and adjusted R^2 values for blocks and outcomes.

data is a list of matrix convertable objects.
outcomes is a single matrix convertable object. Returns a data frame with columns r2, adj_r2, data, outcome, n, m, and predicted.

results <- lmutils::calculate_r2(
    c("block1.csv", "block2.rkyv.gz"),
    "outcomes1.RData",
)

`lmutils::column_p_values`

Compute the p value of a linear regression between each pair of columns in data and outcomes.

data is a list of matrix convertable objects.
outcomes is a single matrix convertable object. The function returns a data frame with columns p_value, data, data_column, and outcome.

results <- lmutils::column_p_values(
    c("block1.csv", "block2.rkyv.gz"),
    "outcomes1.RData",
)

`lmutils::combine_vectors`

Combine a list of double vectors into a single matrix using the vectors as columns.

data is a list of double vectors.
out is an output file name or NULL to return the matrix.

lmutils::combine_vectors(
    list(1:3, 4:6),
    "combined_matrix.csv",
)

`lmutils::remove_rows`

Removes rows from a matrix.

data is list of matrix convertable objects.
rows is a vector of row indices (1-based) to remove.
out is a standard output file.

lmutils::remove_rows(
    "matrix1.csv",
    c(1, 2, 3),
    "matrix1_removed_rows.csv",
)

`lmutils::crossprod`

Calculates the cross product of two matrices. Equivalent to t(data) %*% data.

data is a list of matrix convertable objects.
out is a standard output file.

lmutils::crossprod(
    "matrix1.csv",
    "crossprod_matrix1.csv",
)

`lmutils::mul`

Multiplies two matrices. Equivalent to a %*% b.

a is a list of matrix convertable objects.
b is a list of matrix convertable objects.
out is a standard output file.

lmutils::mul(
    "matrix1.csv",
    "matrix2.rkyv.gz",
    "mul_matrix1_matrix2.csv",
)

`lmutils::load`

Loads a matrix convertable object into R.

obj is a list matrix convertable objects. If a single object is provided, the function will return the matrix directly, otherwise it will return a list of matrices.

lmutils::load("matrix1.csv")

`lmutils::match_rows`

Matches rows of a matrix by the values of a vector.

data is a list of matrix convertable objects.
with is a numeric vector.
by is the column name to match the rows by.
out is a standard output file.

lmutils::match_rows(
    "matrix1.csv",
    c(1, 2, 3),
    "eid",
    "matched_matrix1.csv",
)

`lmutils::match_rows_dir`

Matches rows of all matrices in a directory to the values in a vector by a column.

from is a string directory name to read the files from.
to is a string directory name to write the files to or NULL to write to from.
with is a numeric vector to match the rows to.
by is the column name to match the rows by.

lmutils::match_rows_dir(
    "matrices",
    "matched_matrices",
    c(1, 2, 3),
    "eid",
)

`lmutils::dedup`

Deduplicate a matrix by a column. The first occurrence of each value is kept.

data is a list of matrix convertable objects.
by is the column name to deduplicate by.
out is a standard output file.

lmutils::dedup(
    "matrix1.csv",
    "eid",
    "matrix1_dedup.csv",
)

Data Frame Functions

`lmutils::new_column_from_regex`

Compute a new column for a data frame from a Rust-flavored regex and an existing column.

df is a data frame.
column is the column name to match.
regex is the regex to match. The first capture group is used.
new_column is the new column name.

lmutils::new_column_from_regex(
    data.frame(a=c("a1", "b2", "c3")),
    "a",
    "([a-z])",
    "b",
)

`lmutils::map_from_pairs`

Converts two character vectors into a named list, where the first vector is the names and the second vector is the values. Only the first occurrence of each name is used, essentially creating a map.

names is a character vector of names.
values is a character vector of values.

lmutils::map_from_pairs(
    c("a", "b", "c"),
    c("1", "2", "3"),
)

`lmutils::new_column_from_map`

Compute a new column for a data frame from a list of values and an existing column, matching by the names of the values.

df is a data frame.
column is the column name to match.
values is a named list of values.
new_column is the new column name.

lmutils::new_column_from_map(
    data.frame(a=c("a", "b", "c")),
    "a",
    lmutils::map_from_pairs(
        c("a", "b", "c"),
        c("1", "2", "3"),
    ),
    "b",
)

`lmutils::new_column_from_map_pairs`

Compute a new column for a data frame from two character vectors of names and values, matching by the names.

df is a data frame.
column is the column name to match.
names is a character vector of names.
values is a character vector of values.
new_column is the new column name.

lmutils::new_column_from_map_pairs(
    data.frame(a=c("a", "b", "c")),
    "a",
    c("a", "b", "c"),
    c("1", "2", "3"),
    "b",
)

`lmutils::df_sort_asc`

Mutably sorts a data frame in ascending order by multiple columns in ascending order. All columns must be numeric (double or integer), character, or logical vectors.

df is a data frame.
columns is a character vector of column names to sort by.

df <- data.frame(a=c(3, 3, 2, 2, 1, 1), b=c("b", "a", "b", "a", "b", "a"))
lmutils::df_sort_asc(
    df,
    c("a", "b"),
)

`lmutils::df_split`

Splits a data frame into multiple data frames by a column. This function will mutably sort the data frame by the column before splitting.

df is a data frame.
by is the column name to split by.

df <- data.frame(a=c(1, 2, 3), b=c("a", "b", "c"))
lmutils::df_split(
    df,
    "b",
)

Configuration

lmutils exposes three global config options that can be set using environment variables or the lmutils package functions:

LMUTILS_LOG/lmutils::set_log_level to set the log level (default: info). Available log levels in order of increasing verbosity are off, error, warn, info, debug, and trace.
LMUTILS_CORE_PARALLELISM/lmutils::set_core_parallelism to set the core parallelism (default: 16). This is the number of primary operations to run in parallel.
LMUTILS_NUM_WORKER_THREADS/lmutils::set_num_worker_threads to set the number of worker threads to use (default: num_cpus::get() / 2). This is the number of threads to use for parallel operations. Once an operation has been run, this value cannot be changed.
LMUTILS_DISABLE_PREDICTED/lmutils::disabled_predicted/lmutils::enable_predicted to disable the calculation of the predicted values in lmutils::calculate_r2.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
R		R
man		man
scripts		scripts
src		src
.Rbuildignore		.Rbuildignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md

GMELab/lmutils.r

Folders and files

Latest commit

History

Repository files navigation

lmutils.r

Table of Contents

Installation

Important Information

Terms

File Types

Introduction

Example

Mat Objects

lmutils::Mat$new

lmutils::Mat$r

lmutils::Mat$col

lmutils::Mat$save

lmutils::Mat$combine_columns

lmutils::Mat$combine_rows

lmutils::Mat$remove_columns

lmutils::Mat$remove_column

lmutils::Mat$remove_column_if_exists

lmutils::Mat$remove_rows

lmutils::Mat$transpose

lmutils::Mat$sort

lmutils::Mat$sort_by_name

lmutils::Mat$sort_by_order

lmutils::Mat$dedup

lmutils::Mat$dedup_by_name

lmutils::Mat$match_to

lmutils::Mat$match_to_by_name

lmutils::Mat$join

lmutils::Mat$join_by_name

lmutils::Mat$standardize_columns

lmutils::Mat$standardize_rows

lmutils::Mat$remove_na_rows

lmutils::Mat$remove_na_columns

lmutils::Mat$na_to_value

lmutils::Mat$na_to_column_mean

lmutils::Mat$na_to_row_mean

lmutils::Mat$min_column_sum

lmutils::Mat$max_column_sum

lmutils::Mat$min_row_sum

lmutils::Mat$max_row_sum

lmutils::Mat$rename_column

lmutils::Mat$rename_column_if_exists

lmutils::Mat$remove_duplicate_columns

lmutils::Mat$remove_identical_columns

Matrix Functions

lmutils::save

lmutils::save_dir

lmutils::calculate_r2

lmutils::column_p_values

lmutils::combine_vectors

lmutils::remove_rows

lmutils::crossprod

lmutils::mul

lmutils::load

lmutils::match_rows

lmutils::match_rows_dir

lmutils::dedup