In a recent blog post, I discussed a few of the reasons why in 2021, I began working on a new Fleet Configuration Management tool to replace SaltStack. However, I did not go into much detail about what the replacement is, how it works, or the roadmap into the future.
As mentioned in the prior post, the first inspiration to build a new tool arose from requirements regarding constrained memory environments. In our typical install, salt-minion at idle required between 140mb and 300mb of memory. (This wide variance in metrics results from several factors including the method of measurement, the use case, the system uptime, the service uptime, and the versions of salt-minion and python being run. Regardless, this is simply too much memory for salt to be usable in many situations.)
For widely-deployed, 1GB IoT systems, sacrificing between 10-30% of all available memory for a background task is not ideal. But in a world where even wrist watches ship with 2GB of RAM, and IoT components are getting cheaper by the month, maybe 300mb is not that big of a deal?
For the virtual machine use case, I’d argue it’s still very important. Linode (now Akamai) offers VPS systems starting at 1GB/mo. Amazon Lightsail’s smallest offering comes with only half a gigabyte of memory. Imagine trying to scale horizontally across many nodes only for half of the memory you paid for to be eaten up by a system management tool running in the background.
Without repeating too much from the prior blogpost, I challenge the reader to pause for a moment and consider the hurdles with their own configuration management software, memory requirements aside: how easy was it to get started? What kind of maintenance is required to keep things running and secured? Does it break? How often? What is the DX (Developer eXperience) like? Is the current solution easy (and safe!) to use at 2am for an emergency rollback from your parents’ house on holiday?1
When I set out to build this tool from scratch, I saw it as a golden opportunity to not only solve existing problems but also to shape the tool according to my vision. The goal was clear: to develop a tool that functions seamlessly, aligning to the needs of the landscape, including the creature comforts missing from other options. It should be low overhead, offer familiarity with existing tools, be secure by default, easily scalable, and resilient to OS upgrades (read: runtime dependency-free).
Introducing grlx - Effective Fleet Configuration Management
I’ve mentioned this one a lot already, but it bears repeating. The primary reason that drove us to pursue building grlx was the unavoidable memory footprint of the interpreted language ecosystems. While our first choice was to simple take an existing tool and fork it, paring down the featureset and optimizing memory, we quickly realized this would be an exercise in futility–we might see slight improvement, but nothing on the order of magnitude we were hoping for. The CPU usage of grlx should also be lower for the same reason: no JIT compilation or JVM (looking at you, Closure) taking up intermediate cycles.
Dependency-free by default
A tool written in Python or Ruby depends on the environment’s Python and Ruby packages for its own stability.
When a primary goal of a system and configuration management tool is to keep the system up to date, and updating the system involves updating system libraries such as Python or Ruby libraries, there is always a small but non-zero risk that a breakage in the ecosystem APIs will crash the management tool and prevent it from restarting, leaving the node (or worse, the whole fleet) stranded and unresponsive.
Python 3.7 for example, introduced new reserved words to the language (
async) which subsequently broke many packages that hadn’t been updated in time.
Typically, these ecosystems move slowly enough for maintainers to update in time, but it’s never a good idea to be stuck between waiting for a new software release and keeping your ecosystem out-of-date and vulnerable to prevent breakages.
Scalable and Fault-tolerant
Built using NATS.io, grlx can take advantage of NATS’s clustering and fault-tolerance features. Using only a single farmer system, you can join several more NATS Server instances to the cluster and have them all share the load of incoming messages. Read more about NATS clustering here.
Secure by default
Did you know that it’s a violation of SaltStack best practices (specifically the linked Hardening Rules page) to use Salt over the internet without taking precautionary encryption measures? It commonly recommended to use a tool like stunnel or WireGuard® to add an extra layer between minions and the master.
With grlx, there’s no need for a third-party encryption suite to stay secure. All communications from cli to farmer, and from farmer to sprout, are encrypted using self-signed TLS certificates and NATS.io NKey encryption. These certificates are pinned to the clients on first connection as an extra security precaution.
Easy to set up
Easy to automate. Need it be said again? Only one binary (and one configuration file) is required to provision a sprout (node).
Easy to Understand
Go is often touted as simple language. All grlx code is written in Go (except for the Web UI), and we explain how both the overall system and the individual components work in the documentation. Auditing the grlx codebase is easy, and most of the source code can be read through in a day or so.
Our non-stdlib build dependencies are selective–kept to as short a list as reasonably possible, so there’s little happening “elsewhere.” We follow best practices, strive for high test coverage, and keep an eye on our Go report card.
It would be crazy to install a tool on a fleet of systems and give it root access without the ability to do some introspection on what’s happening under the hood. All the source is available on GitHub, where I encourage you to give it a star!
Not only is grlx open and free, it’s also permissively licensed as 0BSD and is therefore compatible with nearly any enterprise company’s policy on OSS software.
Keep in mind, the logos, the mascot (also referred to as “Clove”), the name “grlx” itself, and the overall brand are copyrighted, all rights reserved, etc. Do not try to pass off grlx as your own–but forking and renaming is allowed!
Supported with Corporate Backing and Buy-in
grlx has entered into an agreement with ADAtomic, Inc., to offer official, commercial support. If this is something that would be valuable to your company or organization, please contact ADAtomic here.
Additionally, grlx is already seeing production use from several companies, listed in the README. As mentioned above, grlx development is being driven by a real need at our organization, and is seeing heavy usage in our own lineup of products.
Extensible Plugin System
To state the obvious, not every possible feature is required by every customer. To keep grlx light and fast, all endpoints and features take advantage of dependency injection and there is a well-defined interface already in place for loading plugins at runtime. These plugins can offer features such as supporting obscure package managers or downloading files from uncommon endpoints. In short order, two example file endpoint provider plugins will be released, one to download files from IPFS, and another to download from a BitTorrent magnet link. Hopefully, these examples indicate both the reason for and capability of the plugin system. Using runtime-loaded plugins is also a boon for development and testing, as mainline code that should be added to grlx can start as a plugin for development, and get merged over time.
At first, the plugin interface will only support Go plugins, but support for WASM plugins is on the roadmap, allowing for plugin development in nearly any compiled language.
To be clear, grlx is dependency-free by default, but we recognize there are some use cases where optional plugins may offer increased functionality without adding significant hooks into environmental state. Both Go and WASM plugins can be distributed as single files, dropped into a directory and hot-reloaded automatically at runtime.
Configurable Message Delivery Rules
Some tools only allow for deploys to go out the door to nodes that are online at the time of the push–if it’s not online right now then you have to either watch and wait for it to come online, or write some scripts to programmatically check for the next time it does come online and push the release then. In a datacenter, this is not a big deal as you should have extremely high (near 100%) uptime and connectivity between the CNC Server and the nodes. For an IoT deployment on the other hand, connectivity can be tricky, and the aforementioned scripts must come into use.
However, once you start writing them, you might realize that there’s a lot more involved regarding error checking, reporting, logging, etc., that the simplest
while ! online; do sleep 60; done && push script cannot handle.
You end up writing a whole deployment framework around trying to catch your node when it’s online and push to it at that exact moment, or maybe you add a cron job to the node itself to trigger an agentless or standalone mode periodically.
This whole mess could have been avoided if the configuration management tool provided a utility for you, which stores the deploy jobs into persistent queues to ensure delivery on next reconnection. NATS.io’s concept of Durable Queues allows for exactly this behavior.
grlx takes a Role-Based Access Control (RBAC) approach for designating who is allowed to deploy what–and where.
Secured service worker accounts or admins might be allowed root access to all machines.
Your coworker John, who manages some
nginx configurations, can
cook (push out) the
nginx recipe to all sprouts, and Celine has access to
cook all recipes but only on select development/canary nodes.
Future work will allow these roles to come from an LDAP endpoint or other external service provider plugin.
More information about the RBAC system can be found in the documentation.
Usability Improvement Features
Below, find some of the small tweaks and differences to make using grlx easier than competing tools:
Advanced Developer Tooling
Our development team is hard at work creating an LSP server and Linter for grlx. The roadmap includes support for jumping to import definitions, full syntax highlighting (even in recipe files containing templates!), and property checking. Our plugin interface also includes definitions to generate custom LSP inputs so that the tools you might need to build in-house are treated as first-class citizens in your editor.
Official GUI and TUI built-in
Many home-grown WebUIs have been built to support existing solutions which don’t come with official WebUIs. The grlx command line utility will soon support spinning up a local webUI, server over localhost, which can be used to access most functionality. Communications to the farmer are proxied through the CLI, so there’s no need to open up ports on the farmer to access the WebUI.
A TUI (Terminal UI), built with BubbleTea is on the roadmap.
Optional DropShell Support
While grlx already supports arbitrary command execution, sometimes it happens where you really just need a shell to do some in-depth troubleshooting. For SaltStack, I developed TunnelBunny, a process for setting up port forwarding and reverse shells to minions, which you can read about here. grlx will support a similar feature, called DropShell, which will allow you to drop into a shell on a single sprout, from your local machine, without having to open up any ports on the minion, configure SSH, or drop firewall rules. This feature is still in development, but will be available once RBAC is fully implemented, as it might be a security concern for some organizations.
High test coverage
We strive for 100% test coverage on that parts that matter. Specifically, all ingredients are thoroughly unit tested, and the core functionality of the platform runs through a suite of integration tests inside of a docker-compose environment.
No Server Access Required
The command line utility is separate from the farmer, and can be run from any machine with network access to the farmer.
All communication is encrypted using TLS and NKey encryption, so there’s no need to provide a VPN or SSH tunnel to the farmer.
Depending on company policy or developer preference, the CLI can be run directly on developer machines, or from a dedicated bastion server.
Authentication and authorization is handled on a per-user basis (to support RBAC), so there’s no need to share credentials, SSH keys, or configure
sudo access to the main CNC server.
The command line utility
cook command will soon support a
--git flag, which will tell the
farmer to check out a particular branch, tag, or commit of a git repository, and use the files in that branch as the source of truth for the recipes.
Additionally, the currently checked-out commit, tag, and branch are made available to the recipe files as property variables, so you can use them in your templates should the need arise.
There is also a command to update the recipe’s git repository to the latest commit, tag, or branch, and a command to list all available branches, tags, and commits.
Synchronous Job Status reporting
One of the most frustrating things about using a configuration management tool is the lack of visibility into what’s happening, as it’s happening. When submitting a job, you might get a job ID back for polling, or wait synchronously for the job to finish and see the summary of results but most tools won’t let you watch the states finish as they happen. Most of the time, this is fine, but the few times you need live introspection into why a deploy is stuck, you’re out of luck.
The farmer exposes a RESTful API for all of its functionality, so you can build your own tools around it.
If you don’t like the built-in WebUI, you have the option to build your own!
If you need to build automations around your deployment from a third-party tool, you can do that too!
cooking to command execution is available via the API.
More information about the API can be found in the documentation.
Lifecycle Hooks for Outgoing Webhooks
Do you need to track one particular step of a deploy for a particular reason? Maybe you need to know whenever a specific file is created or modified? Recipe files have a field for providing an upstream webhook URL and JSON body you need sent. Hooks can also be configured to fire off WebHooks when a deploy starts, fails, or finishes. Additionally, you can configure a hook to fire when a new sprout is added to the fleet, or when a sprout’s online status has changed.
Introspection and visibility has never been this easy!
Meet like-minded DevOps professionals and swap ideas about the best way to get something deployed–or embroil yourself into a flame war over Cloud or On-Prem–as long as you keep things respectful it’s up to you!
His name is Clove, isn’t he great?
In summary, I hope this article has presented the litany of reasons for a new tool in the configuration and system management space. Our roadmap is clear, and we are excited to announce that version 1.0.x is available for download now. Please see ADAtomic.com for more information!
Updating or rolling back production from your parents’ house is not a recommended DevSecOps or GitOps practice, nor am I endorsing it in any way. Ideally, you’ve got canaries and auto-rollbacks, split traffic, staging environments… This statement is meant to be humorous, don’t take it literally. ↩︎