The sda-sync is an integration created to solve the data syncing requirement in BigPicture. In this project, every data submission that takes place in a country's node must be mirrored in the other country's node. This means that both the submitted data files and their corresponding database identifiers (e.g. accession and dataset ID's) in one country's node should be replicated in the other country's node. This is achieved by syncing the data between the two nodes and then providing the relevant identifiers for use while ingesting.
Specifically, the files (and datasets) that are uploaded in one node, are synced/backed up to the other node by making sure that the ingestion process that is run on both sides will result in accession and dataset IDs that are the same in both nodes. To achieve this, the sync integration utilizes the sensitive-data-archive sync
and sync-api
services.
In brief, the syncing will be triggered whenever a local dataset submission, i.e. a dataset that is created on the local node, is detected. The detection is based on the dataset prefix which is unique to each country's node. In such a case, the sync
service copies the dataset files to the inbox of the other node after these have been re-encrypted with the recipient nodes's crypt4gh public key. It also posts JSON messages to the receiving node's sync-api
service from which sync-api
creates and sends the necessary RabbitMQ messages to orchestrate the ingestion cycle on the receiving side of the syncing.
The sync-api
service will be triggered whenever it receives a RabbitMQ message from the sync
service of the other node. The sync-api
service will then create the RabbitMQ messages needed to trigger the ingestion cycle on the receiving node.
The example integration in the docker-compose.yml
assummes that both nodes are running inboxes that are S3 based but POSIX and sftp inbox types are also supported and may be configured by uncommenting the noted code segments of the compose file.
The services for the two nodes can be started in parallel: Start the services in Sweden
docker compose --profile sda-sweden up
Start the services in Finland
docker compose --profile sda-finland up
Note: the version of the sensitive-date-archive
services for the whole stack can be changed by editing the .env
file that is located in the root of the repo.
There is a script under the dev_utils
folder that cleans up the files folder and exports the keys for encryption. Running the script should download the crypt4gh keys that would allow for encrypting the file.test
included in the dev_utils/tools
folder
./prepare.sh
Note: This script should be run every time the services are re-run.
The next step is to encrypt and upload the file in the s3Inbox
of the Swedish node, using the sda-cli downloadable here. For example, to get the linux release run the following commands from the dev_utils/tools
folder:
wget -q https://github.com/NBISweden/sda-cli/releases/download/v0.1.0/sda-cli_.0.1.0_Linux_x86_64.tar.gz
tar -xf sda-cli_.0.1.0_Linux_x86_64.tar.gz sda-cli && rm sda-cli_.0.1.0_Linux_x86_64.tar.gz
From the dev_utils/tools
folder, run:
./sda-cli upload --config ../s3cfg -encrypt-with-key ../keys/repo.pub.pem file.test
Now that the file is uploaded in the S3 backend (that can be checked by logging into the minio via the browser at localhost:9500
and making sure the file is in the inbox bucket), the ingestion process need to be initiated.
That can be achieved using the sda-admin
tool located at dev_utils/tools
. The script has detailed documentation, however, here are the main commands needed to ingest the specific file. First ingest the file running the following command:
./sda-admin --mq-queue-prefix sda --user test_dummy.org ingest file.test.c4gh
where here test_dummy.org
is the <USER-ELIXIR-ID>
taken from the s3cfg
file.
To check that the file has been ingested, run
./sda-admin --mq-queue-prefix sda --user test_dummy.org accession
You should be able to see the file in the list, similar to:
file.test.c4gh
To give an accession id to this file, run the following command, replacing the <ACCESSION-ID>
:
./sda-admin --mq-queue-prefix sda --user test_dummy.org accession <ACCESSION-ID> file.test.c4gh
Finally, to create a dataset including this file, run the following command, replacing the <DATASET-ID>
.
NOTE: The <DATASET-ID>
should have the centerPrefix
defined under config.yaml
and it should be minimum 11 characters long:
./sda-admin --mq-queue-prefix sda --user test_dummy.org dataset <DATASET-ID> file.test.c4gh
for example, if the centerPrefix
value is EGA
, the <DATASET-ID>
should be of the format EGA-<SOME-ID>
.
Once the dataset is created, the swedish sync-api
service should send the required messages to the finnish endpoint, which should start the ingestion on that side.
In order to make sure that everything worked, and apart for the docker logs, you can check the bucket of the finnish (receiver) side and make sure that the file exists in the archive. Also, check that the file and the dataset exist in the database of the finnish side.