README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# viraldomain

<!-- badges: start -->
<!-- badges: end -->

The goal of viraldomain is to provide methods for assessing the applicability domain of models that predict viral load and CD4 (Cluster of Differentiation 4) lymphocyte counts. These methods help determine the extent of extrapolation when making predictions.

## Installation

You can install the development version of viraldomain from GitHub with:

```{r install, warning=FALSE, message=FALSE}
# install.packages("devtools")
devtools::install_github("juanv66x/viraldomain")
```

## Data

### Predictive Modeling Data for Viral Load and CD4 Lymphocyte Counts

This data set serves as input for predictive modeling tasks related to HIV research. It contains numeric measurements of CD4 lymphocyte counts (cd) and viral load (vl) at three different time points: 2019, 2021, and 2022. These measurements are crucial indicators of HIV disease progression.

```{r viral}
library(viraldomain)

data(viral)
print(head(viral))
```

### Seropositive Data for Applicability Domain Testing

This data set is designed for testing the applicability domain of methods related to HIV research. It provides a tibble with 53 rows and 2 columns containing numeric measurements of CD4 lymphocyte counts (cd_2022) and viral load (vl_2022) for seropositive individuals in 2022.

```{r sero}
data(sero)
print(head(sero))
```

## Function

### knn_domain_score

This function fits a K-Nearest Neighbor (KNN) model to the provided data and computes a domain applicability score based on PCA distances.

```{r knn}
# Example usage of knn_domain_score
domain_scores <- knn_domain_score(
  featured = "cd_2022",
  train_data = viral |> dplyr::select(cd_2022, vl_2022),
  knn_hyperparameters = list(neighbors = 5, weight_func = "optimal", dist_power = 0.33),
  test_data = sero,
  threshold_value = 0.99
)
print(domain_scores)
```

### simple_domain_plot

This function generates a domain plot for a simple model based on PCA distances of the provided data.

```{r simp}
# Example usage of simple_domain_plot
simple_domain_plot(
  featured_col = "cd_2022",
  train_data = viral |> dplyr::select(cd_2022, vl_2022),
  test_data = sero,
  treshold_value = 0.99
)
```