1 unstable release
0.1.0 | Dec 15, 2024 |
---|
#377 in Web programming
165 downloads per month
10MB
2K
SLoC
Contains (ELF exe/lib, 6MB) rrdnsd_bookworm, (ELF exe/lib, 6MB) rrdnsd_bullseye, (ELF exe/lib, 6MB) tmp/rrdnsd/srv/rrdnsd, (debian package, 2MB) tmp/rrdnsd.deb
rrdnsd
:toc: left :toclevels: 3 :nofooter: :sectlinks:
Distributed monitoring for Round Robin DNS load balancing and high availability.
image:https://custom-icon-badges.demolab.com/badge/hosted on-codeberg-4793CC.svg?logo=codeberg&logoColor=white[Codeberg, link="https://codeberg.org/FedericoCeratto/rrdnsd"] //
rrdnsd monitors the reachability of HTTP[S] services and updates DNS records accordingly. Lightweight and easy to configure - it can run on a small SBC but also scale up to hundreds of services. For increased reliability it can run on multiple instances using a quorum protocol.
This project is proudly supported by NLnet with sincere gratitude. [.text-center]
The main project website is https://rrdnsd.eu/
Downloads, codebase and bug tracker: https://codeberg.org/FedericoCeratto/rrdnsd
Description
When running Internet services it is common to configure multiple 'A' or 'AAAA' DNS records for the same FQDN in order to provide simple forms of load balancing and failover. In this way clients can connect to one or another IP address for the same FQDN randomly or in round-robin fashion. If any of the IP address providing the service becomes unreachable the related DNS record can be removed. Once the DNS update propagates all the way to the clients they will stop using the unreachable IP address.
This method complements, rather than replacing, the "traditional" TCP/HTTP[S] load balancers that sit between clients and servers.
It provides resiliency against small and large network failures especially when used for geographically distributed services.
rrdnsd is not a DNS resolver. It creates and deletes records on authoritative DNS resolvers and cloud services.
.Features:
- Multiple DNS update methods:
- the nsupdate protocol, used by popular DNS resolvers including
- BIND
- PowerDNS
- NSD
- Unbound
- various cloud/SaaS DNS services
- the knsupdate protocol, used by Knot DNS
- managed DNS APIs:
- Dynu
- custom scripts / plugins
- Minimalistic status web dashboard (demo)
- Statsd metrics
- Journald logging
- Security sandbox
[toc]
Statsd metrics
rrdnsd generates StatsD metrics that can be used to integrate with external monitoring tools and generate alarms for service outages and also to monitor rrdnsd itself.
.Generated metrics:
- fetch.cnt - internal
- probe.duration - internal
- probe.failure.cnt - Failed service probes count
- probe.load_factor - debugging
- probe.success.cnt - Successful service probes count
- probe_time_msec - Service probing elapsed time in milliseconds
- received_update.cnt - internal
- status_change.cnt - Endpoint status change count
Metrics described as debugging
or internal
are subjects to change without notice.
Glossary:
- Service: a public facing service identified by a unique FQDN in the configuration file
- Endpoint: an ipaddres providing a service. A service should have many endpoints. An endpoint might provide multiple services.
- Node: an instance of rrdnsd identified by an unique ipaddr/port pair.
Use case of a simple setup
A client (e.g. a browser) going to access a service running at two IP addresses (endpoints).
rrdnsd monitors the availability of the endpoints. It detects that one endpoint becomes unreachable and updates the DNS resolver in order to delete the related A record. When the downtime ends rrdnsd publishes the A record again.
Use case of redundant nodes
Again the service is available at two endpoints. Three rrdnsd nodes are monitoring the endpoints from different locations on the Internet.
[.text-center]
Scenario 1
[.text-center]
The endpoint on the left becomes unreachable from 2 nodes due to large network outage.
Also, the endpoint on the right is unachable from one node due to a localized issue near the rightmost node.
The nodes vote by majority and decide to remove the "A" record from the left endpoint from DNS. They also decide that the endpoind on the right is likely to be in good shape and keep the related A record in DNS.
Scenario 2
A DNS resolver or DNS API receiving updates is shown on the left.
[.text-center]
The leftmost node is not running (e.g. due to a power failure). The two live nodes are still able to probe the endpoints, communicate to each other and react to changes.
Additionally, the node in the center is unable to reach the resolver/API (e.g. due to network congestion). Yet, rrdnsd is still able to work as the rightmost node can reach the resolver/API.
Usage
Building natively on Debian Bookworm or Sid
Generates a native .deb package:
[source,console]
sudo apt-get install rust
dpkg-buildpackage -us -uc -b
Building on Debian Bookworm using Podman
This generates a minimalistic container localhost/rrdnsd
to run rrdnsd
[source,console]
make podman-build-bookworm
# You can optionally extract the binary:
podman run --rm --entrypoint cat localhost/rrdnsd:bookworm /app/rrdnsd > ./rrdnsd_bookworm
Deployment
Install the Debian package locally, configure /etc/rrdnsd.json
, then:
[source,console]
sudo apt install ./rrdnsd.deb
sudo systemctl start rrdnsd.service
sudo journalctl -f
Security
Do not expose the rrdnsd API port directly on the Internet. Mount it behind a reverse proxy like Nginx and enable TLS.
You can further restrict the API entry points:
POST /quorum/v1/*
: Required, used only between rrdnsd nodesGET /dash
: Web dashboard for administrators, optionalGET /health
: Basic internal healthcheck, optionalPOST /api/v1/*
: Administration API, optional
Configuration
A simple setup running on one node:
update_method
can be:
nsupdate
: Use/usr/bin/nsupdate
to send an update to one or more DNS resolvers.knsupdate
: Similarly tonsupdate
, use/usr/bin/knsupdate
for Knot DNSdynu
: Uses the Dynu API. Configureupdate_credentials
with the Dynu token.
Example configuration file:
[source,json]
{
"_comment": "Keys starting with underscore are ignored.",
"conf_version": 1,
"_local_node": "IP address and port of this rrdnsd node.",
"local_node": "127.0.0.1:3333",
"_nodes": "IP address and port of this node and its peers, if any.",
"nodes": [
"127.0.0.1:3333"
],
"nodes_protocol": "http",
"services": [
{
"fqdn": "rrdnsd.test",
"_healthcheck": "Protocol, port and HTTP path to test. {} is replaced with the enpoint IP address.",
"healthcheck": "http://{}:8778/",
"ipaddrs": [
"127.0.0.2",
"127.0.0.3"
],
"ttl": 5,
"zone": "rrdnsd.test"
}
],
"probe_interval_ms": 1000,
"update_method": "nsupdate",
"_update_resolvers": "List of DNS resolvers and port for nsupdate or knsupdate.",
"_update_credentials": "Token, password or username:password to authenticate updates",
"update_credentials": "",
"update_resolvers": [
"127.0.0.1:5454"
]
}
local_node
can be overridden using the environment variable LOCAL_NODE
The default config path is /etc/rrdnsd.json
and can be overridden using the environment variable CONF
Journald logging
Logs are sent to the local journald instance and are often prefixed by tags like [main]
, [load_conf]
,
[updater]
to identify where they come from.
.Log example: [source,]
Dec 05 10:57:19 tux rrdnsd[2432198]: [main] config overridden by envvar CONF: integ/rrdnsd.json
Dec 05 10:57:19 tux rrdnsd[2432198]: [load_conf] reading integ/rrdnsd.json
Dec 05 10:57:19 tux rrdnsd[2432198]: [main] config overridden by envvar LOCAL_NODE: 127.0.0.1:3333
Dec 05 10:57:19 tux rrdnsd[2432198]: [updater] started
Additionally, file, module and line from the source file are included in the log.
Use -overbose
to visualize them.
.Example: [source,console]
CODE_FILE=src/main.rs
CODE_MODULE=rrdnsd
CODE_LINE=823
To follow logs from the systemd service:
[source,console]
sudo journalctl -u rrdnsd -f
To follow logs from the rrdnsd ran in userspace e.g. during development:
[source,console]
sudo journalctl --identifier rrdnsd -f
Development
rrdnsd is under development; contibutions and testing are welcome.
Codebase documentation is published at https://rrdnsd.eu/codebase_doc/rrdnsd/
Bugtracker: https://codeberg.org/FedericoCeratto/rrdnsd/issues
.Roadmap:
- Optimize connection reuse and lifetime
- Log if a whole service goes down with warning
- Add end-to-end healthchecks e.g. /health
- Fail-open: if more than 50% of endpoints are down keep them in DNS
- Support active-standby pattern
- Support IPv6 (AAAA records)
- Exit gracefully
- Support running custom scripts
- Support calling webhooks
- Add an API to:
- fetch status
- fetch events
- add/remove services and endpoints
- add/remove services nodes
- manual failover
- drain endpoints
- flag services as under maintenance
- "refresh" an FQDN by adding again all live endpoints and then deleting the unreachable ones
- Add TCP testing
- Service-specific probing interval
- Support DNS APIs
- Dynu
- Digital Ocean
Testing
Run basic unit tests using:
[source,sh]
cargo test
Run basic unit tests using:
[source,sh]
cargo test
Run a test instance:
[source,sh]
RUST_BACKTRACE=1 CONF=integ/rrdnsd.json LOCAL_NODE=127.0.0.1:8000 target/debug/rrdnsd
Run full integration tests using the following command.
Warning: this starts/stops knotd, runs sudo tc ...
[source,sh]
cargo test --test integration_test
Dependencies
~12–26MB
~410K SLoC