queryparser translates SQL queries into lists of unevaluated R expressions.
For an introduction to tidyquery and queryparser, watch the recording of the talk “Bridging the Gap between SQL and R” from rstudio::conf(2020).
Install the released version of queryparser from CRAN with:
install.packages("queryparser")
Or install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("ianmcook/queryparser")
Call the function parse_query()
, passing a SELECT
statement enclosed
in quotes as the first argument:
library(queryparser)
parse_query("SELECT DISTINCT carrier FROM flights WHERE dest = "HNL"")
#> $select
#> $select[[1]]
#> carrier
#>
#> attr(,"distinct")
#> [1] TRUE
#>
#> $from
#> $from[[1]]
#> flights
#>
#>
#> $where
#> $where[[1]]
#> dest == "HNL"
Queries can include the clauses SELECT
, FROM
, WHERE
, GROUP BY
,
HAVING
, ORDER BY
, and LIMIT
:
parse_query(
" SELECT origin, dest,
COUNT(flight) AS num_flts,
round(SUM(seats)) AS num_seats,
round(AVG(arr_delay)) AS avg_delay
FROM flights f LEFT OUTER JOIN planes p
ON f.tailnum = p.tailnum
WHERE distance BETWEEN 200 AND 300
AND air_time IS NOT NULL
GROUP BY origin, dest
HAVING num_flts > 3000
ORDER BY num_seats DESC, avg_delay ASC
LIMIT 2;"
)
#> $select
#> $select[[1]]
#> origin
#>
#> $select[[2]]
#> dest
#>
#> $select$num_flts
#> sum(!is.na(flight), na.rm = TRUE)
#>
#> $select$num_seats
#> round(sum(seats, na.rm = TRUE))
#>
#> $select$avg_delay
#> round(mean(arr_delay, na.rm = TRUE))
#>
#> attr(,"aggregate")
#> num_flts num_seats avg_delay
#> FALSE FALSE TRUE TRUE TRUE
#>
#> $from
#> $from$f
#> flights
#>
#> $from$p
#> planes
#>
#> attr(,"join_types")
#> [1] "left outer join"
#> attr(,"join_conditions")
#> attr(,"join_conditions")[[1]]
#> f.tailnum == p.tailnum
#>
#>
#> $where
#> $where[[1]]
#> (distance >= 200 & distance <= 300) & !is.na(air_time)
#>
#>
#> $group_by
#> $group_by[[1]]
#> origin
#>
#> $group_by[[2]]
#> dest
#>
#>
#> $having
#> $having[[1]]
#> num_flts > 3000
#>
#>
#> $order_by
#> $order_by[[1]]
#> -xtfrm(num_seats)
#>
#> $order_by[[2]]
#> avg_delay
#>
#> attr(,"aggregate")
#> [1] FALSE FALSE
#>
#> $limit
#> $limit[[1]]
#> [1] 2
#>
#>
#> attr(,"aggregate")
#> [1] TRUE
Set the argument tidyverse
to TRUE
to use functions from
tidyverse packages including
dplyr,
stringr, and
lubridate in the R expressions:
parse_query("SELECT COUNT(*) AS n FROM t WHERE x BETWEEN y AND z ORDER BY n DESC", tidyverse = TRUE)
#> $select
#> $select$n
#> dplyr::n()
#>
#> attr(,"aggregate")
#> n
#> TRUE
#>
#> $from
#> $from[[1]]
#> t
#>
#>
#> $where
#> $where[[1]]
#> dplyr::between(x, y, z)
#>
#>
#> $order_by
#> $order_by[[1]]
#> dplyr::desc(n)
#>
#> attr(,"aggregate")
#> [1] FALSE
#>
#> attr(,"aggregate")
#> [1] TRUE
queryparser will translate only explicitly allowed functions and operators, preventing injection of malicious code:
parse_query("SELECT x FROM y WHERE system("rm -rf /")")
#> Error: Unrecognized function or operator: system
queryparser does not currently support:
- Subqueries
- Unions
- SQL-89-style (implicit) join notation
- The
WITH
clause (common table expressions) OVER
expressions (window or analytic functions)- Some SQL functions and operators
queryparser currently has the following known limitations:
- Some SQL expressions will translate only when
tidyverse
is set toTRUE
. An example of this isCOUNT(DISTINCT ...)
expressions with multiple arguments. - When logical operators (such as
IS NULL
) have unparenthesized expressions as their operands, R will interpret the resulting code using a different order of operations than a SQL engine would. When using an expression as the operand to a logical operator, always enclose the expression in parentheses. - The error messages that occur when attempting to parse invalid or unrecognized SQL are often non-informative.
queryparser is not intended to:
- Translate other types of SQL statements (such as
INSERT
orUPDATE
) - Customize translations for specific SQL dialects
- Fully validate the syntax of the
SELECT
statements passed to it - Efficiently process large batches of queries
- Facilitate the analysis of queries (for example, to identify patterns)
The sqlparseR package (CRAN) provides a wrapper around the Python module sqlparse.