Skip to content

Utilities for linear regression-based statistical computation, particularly in the same vein as methods like MonsterLM and RARity

Notifications You must be signed in to change notification settings

GMELab/lmutils.r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lmutils.r

Table of Contents

Installation

Requires the Rust programming language.

# select option 1, default installation
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then install the package using the following command:

install.packages("https://github.com/GMELab/lmutils.r/archive/refs/heads/master.tar.gz", repos=NULL) # use .zip for Windows
# OR
devtools::install_github("mrvillage/lmutils.r")

Important Information

Terms

  • Matrix convertable object - a data frame, matrix, file name (to read from), a numeric column vector, or a Mat object.
  • List of matrix convertable objects - a list of matrix convertable objects, a character vector of file names (to read from), or a single matrix convertable object.
  • Standard output file - a character vector of file names matching the length of the inputs, or NULL to return the output. If a single input, not in a list, was provided, the output will not be in a list.

File Types

  • csv (requires column headers)
  • tsv (requires column headers)
  • txt (requires column headers)
  • json
  • cbor
  • rkyv
  • rdata (NOTE: these files can only be processed sequentially, not in parallel like the rest) All files can be optionally compressed with gzip, rdata files are assumed to be compressed without looking for a .gz file extension.

Introduction

lmutils is an R package that provides utilities for working with matrices and data frames. It is built on top of the Rust programming language for performance and safety. The package provides a way to store matrices in memory and perform operations on them, as well as functions for working with data frames.

lmutils is built primarily around the Mat object. These are designed to be used to perform operations on matrices without loading them into memory until necessary. This can be useful for working with lots of large matrices, like hundreds of gene blocks.

To get started with your first Mat object, you can use the following code:

mat <- lmutils::Mat$new("matrix1.csv")

This will create a new Mat object from a file. You can then perform operations on this object, like combining it with other matrices, removing columns, or standardizing the columns. If you want this matrix to be loaded into R, you can use the r method:

mat$combine_columns("matrix2.csv")
mat$remove_columns(c(1, 2, 3))
mat$standardize_columns()
m <- mat$r()

You can also pass the object directly into functions that accept a matrix convertable object, it'll then be loaded automatically (with all the stored operations applied) only when needed.

lmutils::calculate_r2(
    mat,
    "outcomes1.RData",
)

Example

outcomes <- lmutils::Mat$new("outcomes.RData")
geneBlocks <- lapply(c(
    "geneBlock1.csv",
    "geneBlock2.csv",
    "geneBlock3.csv",
    "geneBlock4.csv",
    "geneBlock5.csv",
), function(mat) {
    mat <- lmutils::Mat$new(mat)
    mat$match_to_by_name(outcomes$col("eid"), "IID", 0)
    mat$remove_column("IID")
    mat$min_column_sum(2)
    mat$na_to_column_mean()
    mat$standardize_columns()
    mat
})
outcomes$remove_column("eid")
results <- lmutils::calculate_r2(geneBlocks, outcomes)

Mat Objects

lmutils::Mat objects are a way to store matrices in memory and perform operations on them. They can be used to store operations or chain operations together for later execution. This can be useful if, for example, you wish to a hundred large matrices from files and standardize them all before using lmutils::calculate_r2. Using Mat objects, you can store the operations you wish to perform and Mat will execute them only when the matrix is loaded.

Passing the same Mat object multiple times in a single function call may cause undefined behavior. For example, the following code may not work as expected:

mat <- lmutils::Mat$new("matrix1.csv")
lmutils::calculate_r2(list(mat, mat), mat)

lmutils::Mat$new

Creates a new Mat object.

  • data is a matrix convertable object.
mat <- lmutils::Mat$new("matrix1.csv")

lmutils::Mat$r

Loads the matrix from the Mat object.

m <- mat$r()

lmutils::Mat$col

Get a column by name or index.

col <- mat$col("eid")
col <- mat$col(1)

lmutils::Mat$save

Saves the matrix to a file.

  • file is the file name to write to.
mat$save("matrix1.rkyv.gz")

lmutils::Mat$combine_columns

Combines this matrix with other matrices by columns. (cbind)

  • data is a list of matrix convertable objects.
mat$combine_columns("matrix2.csv")

lmutils::Mat$combine_rows

Combines this matrix with other matrices by rows. (rbind)

  • data is a list of matrix convertable objects.
mat$combine_rows("matrix2.csv")

lmutils::Mat$remove_columns

Removes columns from the matrix.

  • columns is a vector of column indices (1-based) to remove.
mat$remove_columns(c(1, 2, 3))

lmutils::Mat$remove_column

Removes a column from the matrix by name.

  • column is the column name to remove.
mat$remove_column("eid")

lmutils::Mat$remove_column_if_exists

Removes a column from the matrix by name if it exists.

  • column is the column name to remove.
mat$remove_column_if_exists("eid")

lmutils::Mat$remove_rows

Removes rows from the matrix.

  • rows is a vector of row indices (1-based) to remove.
mat$remove_rows(c(1, 2, 3))

lmutils::Mat$transpose

Transposes the matrix.

mat$transpose()

lmutils::Mat$sort

Sort by the column at the given index.

  • by is the column index (1-based) to sort by.
mat$sort(1)

lmutils::Mat$sort_by_name

Sort by the column with the given name.

  • by is the column name to sort by.
mat$sort_by_name("eid")

lmutils::Mat$sort_by_order

Sort by the given order of rows.

  • order is a vector of row indices (1-based) to sort by.
mat$sort_by_order(c(3, 2, 1))

lmutils::Mat$dedup

Deduplicate the matrix by a column.

  • by is the column index (1-based) to deduplicate by.
mat$dedup(1)

lmutils::Mat$dedup_by_name

Deduplicate the matrix by a column name.

  • by is the column name to deduplicate by.
mat$dedup_by_name("eid")

lmutils::Mat$match_to

Match the rows of the matrix to the values in a vector by a column.

  • with is a numeric vector to match the rows to.
  • by is the column index (1-based) to match the rows by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to(c(1, 2, 3), 1, 0)

lmutils::Mat$match_to_by_name

Match the rows of the matrix to the values in a vector by a column name.

  • with is a numeric vector to match the rows to.
  • by is the column name to match the rows by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to_by_name(c(1, 2, 3), "eid", 0)

lmutils::Mat$join

Join the matrix with another matrix by a column.

  • other is a matrix convertable object.
  • by is the column index (1-based) to join by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join("matrix2.csv", 1, 0)

lmutils::Mat$join_by_name

Join the matrix with another matrix by a column name.

  • other is a matrix convertable object.
  • by is the column name to join by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join_by_name("matrix2.csv", "eid", 0)

lmutils::Mat$standardize_columns

Standardize the columns of the matrix to have a mean of 0 and a standard deviation of 1.

mat$standardize_columns()

lmutils::Mat$standardize_rows

Standardize the rows of the matrix to have a mean of 0 and a standard deviation of 1.

mat$standardize_rows()

lmutils::Mat$remove_na_rows

Remove rows with any NA values.

mat$remove_na_rows()

lmutils::Mat$remove_na_columns

Remove columns with any NA values.

mat$remove_na_columns()

lmutils::Mat$na_to_value

Replace all NA values with a given value.

mat$na_to_value(0)

lmutils::Mat$na_to_column_mean

Replace all NA values with the mean of the column.

mat$na_to_column_mean()

lmutils::Mat$na_to_row_mean

Replace all NA values with the mean of the row.

mat$na_to_row_mean()

lmutils::Mat$min_column_sum

Remove columns with a sum less than a given value.

mat$min_column_sum(10)

lmutils::Mat$max_column_sum

Remove columns with a sum greater than a given value.

mat$max_column_sum(10)

lmutils::Mat$min_row_sum

Remove rows with a sum less than a given value.

mat$min_row_sum(10)

lmutils::Mat$max_row_sum

Remove rows with a sum greater than a given value.

mat$max_row_sum(10)

lmutils::Mat$rename_column

Rename a column by name.

mat$rename_column("IID", "eid")

lmutils::Mat$rename_column_if_exists

Rename a column by name if it exists.

mat$rename_column_if_exists("IID", "eid")

lmutils::Mat$remove_duplicate_columns

Remove columns that are duplicates of other columns. The first column is kept.

mat$remove_duplicate_columns()

lmutils::Mat$remove_identical_columns

Remove columns with all identical entries.

mat$remove_identical_columns()

Matrix Functions

lmutils::save

Saves a list of matrix convertable objects to files.

  • from is a list of matrix convertable objects.
  • to is a character vector of file names to write to.
lmutils::save(
    list("file1.csv", matrix(1:9, nrow=3), 1:3, data.frame(a=1:3, b=4:6)),
    c("file1.json", "file2.rkyv.gz", "file3.csv", "file4.rdata"),
)

lmutils::save_dir

Recursively converts a directory of files to the selected file type.

  • from is a string directory name to read the files from.
  • to is a string directory name to write the files to or NULL to write to from.
  • file_type is a string file extension to write the files as.
lmutils::save_dir(
    "data",
    "converted_data", # or NULL
    "rkyv.gz",
)

lmutils::calculate_r2

Calculates the R^2 and adjusted R^2 values for blocks and outcomes.

  • data is a list of matrix convertable objects.
  • outcomes is a single matrix convertable object. Returns a data frame with columns r2, adj_r2, data, outcome, n, m, and predicted.
results <- lmutils::calculate_r2(
    c("block1.csv", "block2.rkyv.gz"),
    "outcomes1.RData",
)

lmutils::column_p_values

Compute the p value of a linear regression between each pair of columns in data and outcomes.

  • data is a list of matrix convertable objects.
  • outcomes is a single matrix convertable object. The function returns a data frame with columns p_value, data, data_column, and outcome.
results <- lmutils::column_p_values(
    c("block1.csv", "block2.rkyv.gz"),
    "outcomes1.RData",
)

lmutils::combine_vectors

Combine a list of double vectors into a single matrix using the vectors as columns.

  • data is a list of double vectors.
  • out is an output file name or NULL to return the matrix.
lmutils::combine_vectors(
    list(1:3, 4:6),
    "combined_matrix.csv",
)

lmutils::remove_rows

Removes rows from a matrix.

  • data is list of matrix convertable objects.
  • rows is a vector of row indices (1-based) to remove.
  • out is a standard output file.
lmutils::remove_rows(
    "matrix1.csv",
    c(1, 2, 3),
    "matrix1_removed_rows.csv",
)

lmutils::crossprod

Calculates the cross product of two matrices. Equivalent to t(data) %*% data.

  • data is a list of matrix convertable objects.
  • out is a standard output file.
lmutils::crossprod(
    "matrix1.csv",
    "crossprod_matrix1.csv",
)

lmutils::mul

Multiplies two matrices. Equivalent to a %*% b.

  • a is a list of matrix convertable objects.
  • b is a list of matrix convertable objects.
  • out is a standard output file.
lmutils::mul(
    "matrix1.csv",
    "matrix2.rkyv.gz",
    "mul_matrix1_matrix2.csv",
)

lmutils::load

Loads a matrix convertable object into R.

  • obj is a list matrix convertable objects. If a single object is provided, the function will return the matrix directly, otherwise it will return a list of matrices.
lmutils::load("matrix1.csv")

lmutils::match_rows

Matches rows of a matrix by the values of a vector.

  • data is a list of matrix convertable objects.
  • with is a numeric vector.
  • by is the column name to match the rows by.
  • out is a standard output file.
lmutils::match_rows(
    "matrix1.csv",
    c(1, 2, 3),
    "eid",
    "matched_matrix1.csv",
)

lmutils::match_rows_dir

Matches rows of all matrices in a directory to the values in a vector by a column.

  • from is a string directory name to read the files from.
  • to is a string directory name to write the files to or NULL to write to from.
  • with is a numeric vector to match the rows to.
  • by is the column name to match the rows by.
lmutils::match_rows_dir(
    "matrices",
    "matched_matrices",
    c(1, 2, 3),
    "eid",
)

lmutils::dedup

Deduplicate a matrix by a column. The first occurrence of each value is kept.

  • data is a list of matrix convertable objects.
  • by is the column name to deduplicate by.
  • out is a standard output file.
lmutils::dedup(
    "matrix1.csv",
    "eid",
    "matrix1_dedup.csv",
)

Data Frame Functions

lmutils::new_column_from_regex

Compute a new column for a data frame from a Rust-flavored regex and an existing column.

  • df is a data frame.
  • column is the column name to match.
  • regex is the regex to match. The first capture group is used.
  • new_column is the new column name.
lmutils::new_column_from_regex(
    data.frame(a=c("a1", "b2", "c3")),
    "a",
    "([a-z])",
    "b",
)

lmutils::map_from_pairs

Converts two character vectors into a named list, where the first vector is the names and the second vector is the values. Only the first occurrence of each name is used, essentially creating a map.

  • names is a character vector of names.
  • values is a character vector of values.
lmutils::map_from_pairs(
    c("a", "b", "c"),
    c("1", "2", "3"),
)

lmutils::new_column_from_map

Compute a new column for a data frame from a list of values and an existing column, matching by the names of the values.

  • df is a data frame.
  • column is the column name to match.
  • values is a named list of values.
  • new_column is the new column name.
lmutils::new_column_from_map(
    data.frame(a=c("a", "b", "c")),
    "a",
    lmutils::map_from_pairs(
        c("a", "b", "c"),
        c("1", "2", "3"),
    ),
    "b",
)

lmutils::new_column_from_map_pairs

Compute a new column for a data frame from two character vectors of names and values, matching by the names.

  • df is a data frame.
  • column is the column name to match.
  • names is a character vector of names.
  • values is a character vector of values.
  • new_column is the new column name.
lmutils::new_column_from_map_pairs(
    data.frame(a=c("a", "b", "c")),
    "a",
    c("a", "b", "c"),
    c("1", "2", "3"),
    "b",
)

lmutils::df_sort_asc

Mutably sorts a data frame in ascending order by multiple columns in ascending order. All columns must be numeric (double or integer), character, or logical vectors.

  • df is a data frame.
  • columns is a character vector of column names to sort by.
df <- data.frame(a=c(3, 3, 2, 2, 1, 1), b=c("b", "a", "b", "a", "b", "a"))
lmutils::df_sort_asc(
    df,
    c("a", "b"),
)

lmutils::df_split

Splits a data frame into multiple data frames by a column. This function will mutably sort the data frame by the column before splitting.

  • df is a data frame.
  • by is the column name to split by.
df <- data.frame(a=c(1, 2, 3), b=c("a", "b", "c"))
lmutils::df_split(
    df,
    "b",
)

Configuration

lmutils exposes three global config options that can be set using environment variables or the lmutils package functions:

  • LMUTILS_LOG/lmutils::set_log_level to set the log level (default: info). Available log levels in order of increasing verbosity are off, error, warn, info, debug, and trace.
  • LMUTILS_CORE_PARALLELISM/lmutils::set_core_parallelism to set the core parallelism (default: 16). This is the number of primary operations to run in parallel.
  • LMUTILS_NUM_WORKER_THREADS/lmutils::set_num_worker_threads to set the number of worker threads to use (default: num_cpus::get() / 2). This is the number of threads to use for parallel operations. Once an operation has been run, this value cannot be changed.
  • LMUTILS_DISABLE_PREDICTED/lmutils::disabled_predicted/lmutils::enable_predicted to disable the calculation of the predicted values in lmutils::calculate_r2.

About

Utilities for linear regression-based statistical computation, particularly in the same vein as methods like MonsterLM and RARity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages