The beast is an experimental, flexible, declarative-oriented toolkit to read machinereadable data from the various sources and transform them into follow the money entities.
Do not rely on this one until it is out of alpha. Everything is very volatile
The FTM proposal: alephdata/followthemoney#717
The sample mapping with tons of comments to make you understand an idea better (beware, it's just an example, format is the subject to change): https://github.com/dchaplinsky/thebeast/blob/main/thebeast/tests/sample/mappings/ukrainian_mps.yaml
Validator for the mappings in json schema format (again, work in progress and tons of comments): https://github.com/dchaplinsky/thebeast/blob/main/thebeast/conf/mapping_validator.json
First proposal of the mapping (obsolete, but can give you a better idea) https://gist.github.com/dchaplinsky/8021b530ea7e44c9443afcc3318042fd
- Ingest from databases (mongo, postgres) using SQLAlchemy or PeeWee
- Tests for the databases ingest
- Basic CLI
- Signals on exceptions and policy for the incorrectly parsed entity values (drop, drop all, drop entity, reraise)
- Tests for the signals
- Stats collector (number of signals of each type, number of invalid entities, etc)
- Packaging (partially done in
packaging_and_spark_integration
branch) - Documentation (@legless, your notes will be very valuable)
- Advanced ingest routines: regex validation to discard values that do not pass the test?
- Tests for the resolver wrappers
Done
- Basic ingest for json/jsonlines/csv, both local and remote, compressed or not, singular or multiple files
- Tests for the basic ingest
- Mapping reader
- Tests for mapping reader
- Basic digest routines
- Tests for basic digest routines
- Advanced ingest routines: constant entities (think Country or Organization)
- Advanced ingest routines: backreferencing (think talking from subcollections to parent items)
- Advanced ingest routines: nested collections (think parsing involved JSON)
- Advanced ingest routines: templates (think combining fields when setting the entity field)
- Advanced ingest routines: multiple values for the entity property
- Advanced ingest routines: split string into multiple values
- Advanced ingest routines: full entity validation and red/green sorting
- Advanced ingest routines: augmentations/transformations
- Advanced ingest routines: records transformations
- Tests for records transformations
- Tests for the individual resolvers
- Tests for digest routines
- Advanced digest routines: multiprocessing
- Tests for advanced digest routines
- Basic dump routines (stdout/files)
- Basic dump routines: statements
- Tests for basic dump routines
- Tests for basic dump routines: statements
- Remove inflate/deflate and pass dicts rather than entities between digest and dump
- Python 3.11 support (https://github.com/dchaplinsky/thebeast/actions/runs/3802499820/jobs/6468041810, ICRAR/ijson#80)
pip install -r requirements.txt
python -m pytest
/bin/
directory contains scripts to run Beast inside Docker container.
Use /bin/run data/mapping.yaml
to run Beast with selected mapping.
Note: mapping and source file(s) must be in Beast root (sub-)directory. E.g. ./data/mapping.yaml
You can't point Beast to a file outside it's root directory.
Use /bin/tests
to run tests.
Use /bin/black
to run black to format source files before contributing a pull request.