- Feature enhance: support yaml multi-pattern input @issue 76
- Accuracy enhance: corrent fix some wild-query bugs
- Cross Platform: support Win\Mac\Linux
I hope to continue enhance weggli. And this project will remain active for a long time to come.
weggli is a fast and robust semantic search tool for C and C codebases. It is designed to help security researchers identify interesting functionality in large codebases.
weggli performs pattern matching on Abstract Syntax Trees based on user provided queries. Its query language resembles C and C code, making it easy to turn interesting code patterns into queries.
weggli is inspired by great tools like Semgrep, Coccinelle, joern and CodeQL, but makes some different design decisions:
-
Minimal setup: weggli should work out-of-the box against most software you will encounter. weggli does not require the ability to build the software and can work with incomplete sources or missing dependencies.
-
Interactive: weggli is designed for interactive usage and fast query performance. Most of the time, a weggli query will be faster than a grep search. The goal is to enable an interactive workflow where quick switching between code review and query creation/improvement is possible.
-
Greedy: weggli's pattern matching is designed to find as many (useful) matches as possible for a specific query. While this increases the risk of false positives it simplifies query creation. For example, the query
$x = 10;
will match both assignment expressions (foo = 10;
) and declarations (int bar = 10;
). -
C support (Temporarily not supported due to cross-platform reasons): weggli has first class support for modern C constructs, such as lambda expressions, range-based for loops and constexprs.
Use -h for short descriptions and --help for more details.
Homepage: https://github.com/LordCasser/weggli
USAGE: weggli-enhance [OPTIONS] <RULES> <PATH>
ARGS:
<RULES>
A weggli search pattern. weggli's query language closely resembles
C and C with a small number of extra features.
For example, the pattern '{_ $buf[_]; memcpy($buf,_,_);}' will
find all calls to memcpy that directly write into a stack buffer.
Besides normal C and C constructs, weggli's query language
supports the following features:
_ Wildcard. Will match on any AST node.
$var Variables. Can be used to write queries that are independent
of identifiers. Variables match on identifiers, types,
field names or namespaces. The --unique option
optionally enforces that $x != $y != $z. The --regex option can
enforce that the variable has to match (or not match) a
regular expression.
_(..) Subexpressions. The _(..) wildcard matches on arbitrary
sub expressions. This can be helpful if you are looking for some
operation involving a variable, but don't know more about it.
For example, _(test) will match on expressions like test 10,
buf[test->size] or f(g(&test));
not: Negative sub queries. Only show results that do not match the
following sub query. For example, '{not: $fv==NULL; not: $fv!=NULL *$v;}'
would find pointer dereferences that are not preceded by a NULL check.
strict: Enable stricter matching. This turns off statement unwrapping and greedy
function name matching. For example 'strict: func();' will not match
on 'if (func() == 1)..' or 'a->func()' anymore.
weggli automatically unwraps expression statements in the query source
to search for the inner expression instead. This means that the query `{func($x);}`
will match on `func(a);`, but also on `if (func(a)) {..}` or `return func(a)`.
Matching on `func(a)` will also match on `func(a,b,c)` or `func(z,a)`.
Similarly, `void func($t $param)` will also match function definitions
with multiple parameters.
Additional patterns can be specified using the --pattern (-p) option. This makes
it possible to search across functions or type definitions.
<PATH>
Input directory or file to search. By default, weggli will search inside
.c and .h files for the default C mode or .cc, .cpp, .cxx, .h and .hpp files when
executing in C mode (using the --cpp option).
Alternative file endings can be specified using the --extensions=h,c (-e) option.
When combining weggli with other tools or preprocessing steps,
files can also be specified via STDIN by setting the directory to '-'
and piping a list of filenames.
OPTIONS:
-A, --after <after>
Lines to print after a match. Default = 5.
-B, --before <before>
Lines to print before a match. Default = 5.
-C, --color
Force enable color output.
--exclude <exclude>...
Exclude files that match the given regex.
-e, --extensions <extensions>...
File extensions to include in the search.
-f, --force
Force a search even if the queries contains syntax errors.
-h, --help
Prints help information.
--include <include>...
Only search files that match the given regex.
-l, --limit
Only show the first match in each function.
-n, --line-numbers
Enable line numbers
-u, --unique
Enforce uniqueness of variable matches.
By default, two variables such as $a and $b can match on identical values.
For example, the query '$x=malloc($a); memcpy($x, _, $b);' would
match on both
void *buf = malloc(size);
memcpy(buf, src, size);
and
void *buf = malloc(some_constant);
memcpy(buf, src, size);
Using the unique flag would filter out the first match as $a==$b.
-v, --verbose
Sets the level of verbosity.
-V, --version
Prints version information.
Calls to memcpy that write into a stack-buffer:
weggli '{
_ $buf[_];
memcpy($buf,_,_);angular2html
}' ./target/src
Calls to foo that don't check the return value:
weggli '{
strict: foo(_);
}' ./target/src
Potentially vulnerable snprintf() users:
weggli '{
$ret = snprintf($b,_,_);
$b[$ret] = _;
}' ./target/src
Potentially uninitialized pointers:
weggli '{ _* $p;
NOT: $p = _;
$func(&$p);
}' ./target/src
Potentially insecure WeakPtr usage:
weggli --cpp '{
$x = _.GetWeakPtr();
DCHECK($x);
$x->_;}' ./target/src
Debug only iterator validation:
weggli -X 'DCHECK(_!=_.end());' ./target/src
Functions that perform writes into a stack-buffer based on a function argument.
weggli '_ $fn(_ $limit) {
_ $buf[_];
for (_; $i<$limit; _) {
$buf[$i]=_;
}
}' ./target/src
Functions with the string decode in their name
weggli -R func=decode '_ $func(_) {_;}'
Encoding/Conversion functions
weggli '_ $func($t *$input, $t2 *$output) {
for (_($i);_;_) {
$input[$i]=_($output);
}
}' ./target/src
$ cargo install weggli
# optional: install rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
git clone https://github.com/googleprojectzero/weggli.git
cd weggli; cargo build --release
./target/release/weggli
Hacking Weggli (copy from @carstein)
This document's goal is to give a high level overview of a weggli - how it works, what are the basic building blocks and how to navigate the source code. We hope that it will help future developers to quickly comprehend main concepts and allow them to either make significant changes in the code or implement new features.
I might change this document to fit the weggli-enhance, only if I got enough time. —— by LordCasser
The most important external library used in weggli is Tree-sitter - a parser generator that combines the ability to transform a source code into an AST (Abstract Syntax Tree) as well as running complex queries against such AST. Knowledge of how to use this library is essential if you want to add new rules or support for new programming languages.
As it was already mentioned - two most important elements of this library is a parser that can turn a source code into an AST and a query language that could find certain patterns inside such AST.
To discover their inner working we can check the example below.
{
int x = 10;
void *y = malloc(x);
}
If we transform those two C instructions into an AST we would end up with a following one:
translation_unit [0, 0] - [4, 0]
compound_statement [0, 0] - [3, 1]
declaration [1, 2] - [1, 13]
type: primitive_type [1, 2] - [1, 5]
declarator: init_declarator [1, 6] - [1, 12]
declarator: identifier [1, 6] - [1, 7]
value: number_literal [1, 10] - [1, 12]
declaration [2, 2] - [2, 22]
type: primitive_type [2, 2] - [2, 6]
declarator: init_declarator [2, 7] - [2, 21]
declarator: pointer_declarator [2, 7] - [2, 9]
declarator: identifier [2, 8] - [2, 9]
value: call_expression [2, 12] - [2, 21]
function: identifier [2, 12] - [2, 18]
arguments: argument_list [2, 18] - [2, 21]
identifier [2, 19] - [2, 20]
Now, if we are interested in finding certain patterns in the code we can write a query that looks like this.
(
(declaration (init_declarator value: (call_expression (identifier) @1)))
(#eq? @1 "malloc")
)
Applying this query to aforementioned AST will result in finding a malloc()
call.
The life of a weggli query begins when the user provides a set of parameters to the executable. The most important ones are Pattern and Path. Pattern is an expression in a weggli query language that closely resembles C/C with a small number of extra features. Path is just a file or directory that we are going to process looking for our pattern. Parameter extraction is happening in cli::parse_argument()
and it stores all the results in Args
structure. Besides the already mentioned parameters we are also capturing a lot of supplementary ones. You can find more about them by reading the src/cli.rs
file.
User-provided pattern is first sent to a parse_search_pattern()
function where it is normalized (fixing missing semicolon or lack of curly braces) and validated. After normalization we end up with a tree-sitter AST of our pattern represented by a Tree
type. Validation is happening inside validate_query()
function and the main objective is to verify if it has no syntax errors and if it is rooted correctly. In the absence of error function returns a TreeCursor
that points to the root node of the AST of a user pattern.
In weggli a correctly rooted expression means that it has a single root of one of the expected types. So, in normal terms - if this is a single compound statement, function definition or valid
struc
,enum
,union
orclass
.
The real heavy lifting starts when we pass the cursor to a builder::build_query_tree()
- function responsible for turning our AST into a tree-sitter search query. This query will reside in QueryTree
- along variables, captures and negations. The important part is that a single user pattern will usually result in a tree of sub-queries. Main reason is tree-sitter query language inability to search iteratively. A typical example would be a nested function calls like int x = func_1(func_2(buf))
- searching for a func_1($buf)
would miss the nested calls.
When the QueryTree
is ready we can put it into a WorkItem
together with all the defined identifiers like function names, variables and types.
Captures are simply a variables like
$var
that we have defined in our pattern.Negation is simply a negative query that later on will be used to filter out results that match this particular branch.
When our pattern is finally transformed into a tree of tree-sitter queries and a set of files to be scanned is locked we are ready to start our workers.
We begin with parse_files_worker()
as our first line of workers. What happens here is that we have a pool of threads that process the files we've defined as our target. Processing actually is happening in two steps - in the first step we simply check if the file in a raw form contains any of the identifiers we are interested in. If this is not the case then this file is skipped, otherwise it is transformed into an AST using again a tree-sitter parser and sent to the second line of workers via an established mpsc channel.
The execute_queries_worker()
function starts second line workers. Their main task is to recivce an AST of the target files and apply a set of queries from the WorkItem
to them.
The whole process of running a query against given AST is happening in multiple stages as well and the starting point is QueryTree.match_internal()
and some more stages that follows usually involve filtering out duplicates and enforcing some limit.
In case we were running multiple queries there is also a third line worker spawned by multi_query_worker()
function. Main job of this worker is to capture all independent results to filter them looking if variable assignments are valid for all the queries. Regardless if we have gone through the last line of workers or not we end up with an array of QueryResult
objects that represent all our findings.
Each of the QueryResult
objects has a display()
method that is responsible for printing the results. It always prints out the found node and surrounding lines of code to the console - at least for now. The function also tries to merge multiple different findings into one where applicable (for example if there are two findings in the same function).
Weggli is built on top of the tree-sitter
parsing library and its C
and C
grammars.
Search queries are first parsed using an extended version of the corresponding grammar, and the resulting AST
is
transformed into a set of tree-sitter queries
in builder.rs
.
The actual query matching is implemented in query.rs
, which is a relatively small wrapper around tree-sitter's query engine to add weggli specific features.
Apache 2.0 for weggli-rs code; see LICENSE
for details.
Special Terms and Conditions
for weggli-enhance code.