Scripts to update the community Avro flat files as described at data.pyafscgap.org.
Due to API limitations that prevent filtering joined data prior to downloading locally, community flat files in Avro format offer pre-joined data with indicies which can be used by pyafscgap
to avoid downloading all catch data or specifying individual hauls. This directory contains scripts used to update those resources which are availble at data.pyafscgap.org.
The updater can be executed with individual scripts or in its entirety through bash. Note that some of these steps use environment variables specified in local setup.
These community files are used by default when interacting with the pyafscgap
library. See pyafscgap.org for instructions. These Avro files will be requested and iterated by the client without the user needing to understand the underlying file format. Only the pyafscgap
interface is intended to be maintained across major versions for backwards compatibility.
Prebuilt Avro files are avialable via HTTPS through data.pyafscgap.org. There are two subdirectories of files.
First, index contains "index data files" which indicate where catch data can be found. These indicies include filename that can be found in joined
. Each file maps from a value for the filename's variable to a set of joined flat files with those data can be found. Each value refers to a specific haul where floating point values are rounded to two decimal places. Note that, due to this rounding, more precise filters will have to further sub-filter after collecting relevant data from the joined
subdirectory.
Second, joined includes all catch data joined against the species list and hauls table to create a single "flat" file which fully describes all information available for each catch. Each record is a single catch and each file is a single haul where a haul takes place within a specific year and survey.
Note that, while provided as a service to the community, these Avro files and directory structure may be changed in the future. These files exist to serve the pyafscgap
functionality as the NOAA APIs change over time. Therefore, for a long term stable interface with documentation and further type annotation, please consider using the pyafscgap
library isntead.
In order to build the Avro files yourself by requesting, joining, and indexing original upstream API data, you can simply execute bash execute_all.sh
after local setup. These will build these files on S3 but they may be deployed to an SFTP server trivially.
Local environment setup varies depending on how these files are used.
Simply install pyafscgap
normally to have the library automatically use the flat files for queries.
These files may be used by any programming language or environment supporting Avro. For more information, see the official Avro documentation though fastavro is recommended for use in Python.
To perform manual execution, these scripts expect to use AWS S3 prior to deployment to a simple SFTP server. In order to use these scripts, the following envrionment variables need to be set after installing dependencies (optionally within a virtual environment) via pip install -r requirements.txt
:
AWS_ACCESS_KEY
: This is the access key used to upload completed payloads to AWS S3 or to request those data as part of distributed indexing and processing.AWS_ACCESS_SECRET
: This is the secret associated with the access key used to upload completed payloads to AWS S3 or to request those data as part of distributed indexing and processing.BUCKET_NAME
: This is the name of the bucket where completed uploads should be uploaded or requested within S3.
These may be set within .bashrc
files or similar through EXPORT
commands. Finally, these scripts expect Coiled to perform distributed tasks.
Unit tests can be executed by running nose2
within the snapshot
directory.
Files generated in S3 can be trivially deployed to an SFTP server or accessed directly from AWS.
These scripts follow the same development guidelines as the overall pyafscgap
project. Note that style and type checks are enforced though CI / CD systems. See contributors documentation.
The snapshots updater uses the following open source packages:
- bokeh from Bokah Contributors and NumFocus under the BSD License.
- boto3 under the Apache v2 License.
- dask from Anaconda and Contributors under the BSD License.
- fastavro by Miki Tebeka and Contributors under the MIT License.
- requests which is available under the Apache v2 License from Kenneth Reitz and other contributors.
- toolz under a BSD License.
We thank these projects for their contribution. Note that we also use coiled.
Code to generate these flat files is released alongside the rest of the pyafscgap project under the BSD License. See data.pyafscgap.org for further license details regarding prebuilt files.