This post has been a long time in the making. Anyone who has followed me and my work for any period of time probably knows by know that I’ve been working on grlx (pronounced ‘garlic’) for some time now (since mid-2021). But why? What’s wrong with SaltStack? More specifically, what problem am I trying to solve that doesn’t already have a solution solved many times over?
A Note to the Salt Project
First of all, I want to make something clear: this post is not an attack on Salt, Thomas Hatch, or any of the people involved in the Salt Project. I’ve been a long-time user of SaltStack, going back to the Boron release in 2016. SaltStack has gotten me really far, and I owe much of my career’s success to Salt, which I have carried with me from job to job.
I also want to make a call-out here: my use case for Salt is not necessarily the target use case, and it’s really unfair to criticize a screwdriver for being a bad hammer, especially when the screwdriver is free.
At my first role as a Salt user, I was not the decision maker. I was a junior developer, working with the tools selected by peers and higher-ups, and Salt was chosen to be part of our stack. Without going into too much detail, here was our problem statement:
- We are building IoT, embedded Linux boards
- We need a way to update the software and packages remotely
- We are using field-prototypes, which don’t have fixed or public IPs and therefore need to be the Client in a Client-Server model (agentless solutions won’t work)
- We are not just updating application code, our environments are going through rapid updates (emphasis on field prototypes)
- Our distribution is a custom Yocto build, without a package manager, so we need a way to declaratively replace system-level packages transactionally
- The ability to remotely run shell commands and return the output is a plus (we might need to trigger fswebcam and snap a photo as a one-off)
- Rolling out instant updates to targeted devices (where a ’target’ could be a small sampling, the whole fleet, or a single device) is a huge boon
- Some devices will connect over 3g cellular as not all of our field installations have WiFi available
As this platform was a product of its time, here’s some additional information to keep in mind:
- The year is 2016, and the hardware development went back to 2012 (so pre-Raspberry Pi GA)
- Google IoT Core doesn’t exist (yet, and also, RIP)
- Amazon GreenGrass is not generally available (until 2017)
- Ubuntu Core IoT was prohibitively expensive and didn’t support our custom hardware
SaltStack, in theory, checked all of our boxes, and allowed us to ship units to the field for testing, with remote, OTA updates.
- Full Linux support? Check.
- Remote updates using a client / server model to get around the IP contstraints? Check.
- Ability to upgrade system files in addition to application files? Check.
- Remote command execution? Check.
- Specific device targeting? Check.
And so, our small team set out to create a configuration management system built on top of SaltStack as the center pillar. Yes, we even managed the SaltStack installation with SaltStack, by replacing the Python libraries and .egg/wheel files with Salt states. We were careful in how we did this, and actually added a cron-based backdoor ‘rescue’ service in case the snake swallowed its own tail, but it was a calculated risk. In the end, the deployment worked, but it wasn’t without headaches.
Issues and Complaints
There were several runtime and compile-time bugs related to the dependencies. Anybody who’s been around SaltStack long enough will shudder at the mention of Tornado or PyCrypto. We were especially subject to the issues regarding pycrypto compilation, as our Yocto project build ran on ARMv5 (oh whoops, did I not mention that requirement earlier? That sucks, but we’re all engineers here and you just have to learn to roll with it…) and there weren’t precompiled wheels for our architecture–we had to build them ourselves. Memory was always running low, as we had less than 1GB of physical memory, and most of that needed to go to the application for some on-device image processing.
There was a memory leak, somewhere, in either
salt-minion itself, or in a dependency.
It was difficult to isolate, and I’m not sure my team actually ever found it.
It was likely patched in a later version of
salt-minion (again, or in a bumped dependency).
Our solution was to
limit the memory used by systemd and tell the unit to crash and restart if it went over.
This was not ideal.
Systemd had no visibilty into how we were using
salt-minion, whether we were killing the process due to a real memory leak or due to a spike in usage associated with an active software update.
It’s reasonable to allocate more memory to
salt-minion and away from our application during a software update, but not at the expense of allowing run-away memory usage during regular runtime.
Allowing systemd to indiscriminately OOM
salt-minion becomes even more dangerous if we are currently applying a state to update the system or python libs themselves–a crash mid-update for core and critical python libraries might prevent
salt-minion itself from ever coming back up.
In the end, we had to isolate the states for the system by themselves and run those states very selectively, not in the top/high state, which somewhat defeats the purpose of the ‘statefulness’ Salt brings you–but that’s not Salt’s fault, as it certainly wasn’t designed for that.
Hammer, screwdriver, all that.
Really the problem here was the memory leak itself, as that’s the cascading issue.
Memory leaks and crashing aside, we had a lot of trouble keeping the zeroMQ connection active.
Now, we did have many systems connected over 3G cellular, but the problems weren’t isolated to those units.
Our WiFi-connected units also had issues maintaining connectivity.
Minion did not return. [Not connected] message was a commonplace error.
More disturbing, early versions of Salt didn’t allow you to easily configure the retry logic.
Sometimes, the connection would break, and the client wouldn’t ever try to reconnect, so it would stay disconnected until the memory leak would trigger an OOM, which would in-turn restart the
This has since been addressed.)
Whoever needed to send the updates out to the units in the field needed to be root (or have sudo access to run
salt) on the Salt master, which means someone needed shell access to the CNC server.
From a security perspective, this is an ick.
I loved SaltStack
Yes, I’ve had my fair share of problems with SaltStack, but my teams and I made things work. When working with software of this scale, some tips and tricks tend to reveal themselves over time, (like using get_template to pre-render salt states before applying). Sometimes, you just need the tribal knowledge gained through experience. Several issues were fixed by the Salt Project team, others have gone away simply through a revision to requirements on my end, and still others I’ve learned to live with or work around.
I’ve brought SaltStack with me to 4 other career roles since my introduction to it, like some kind of DevOps Evangelist (I’ll likely write about these projects at a later date). I didn’t have enough good things to say about Salt. I became very involved in the community Slack channels. I answered a question or two on Stack Overflow regarding SaltStack. I’ve had a minor PR merged into salt-bootstrap. I’ve been featured on the Salt Community Open Hour podcast. The project called me out in an advisory bulletin to thank me for my help on CVE-2020-11651 and CVE-2020-11652.
In October, 2020, VMware acquired SaltStack, and the branding was changed to Salt Project. Large portions of the team were made redundant (read: laid off) and the focus of the project seems to have shifted to cloud-native and VMware-related projects. The name change alone causes a bit of a headache. The Salt Project is already a(n entirely unrelated) thing. Changing from “SaltStack” to “Salt” makes finding security issues difficult when SALT exists. The overall “Googleability” of the project has gone downhill, which really isn’t great when some of the page redirects are broken for all pre-hydrogen releases.
The number of unresolved issues in the project have skyrocketed (even despite the stale bot doing numbers on these issues). It seems the reduced team size and increasing other priorities has prevented development from keeping up with security and bugfixes in some dependencies, and the ‘solution’ is to freeze the updates and vendor the dependencies. That’s right, until recently, SaltStack has been vendoring a copy of the Tornado library, despite publicly-known CVEs.
Salt has also recently changed their packaging format to Onedir, effectively vendoring all dependencies inside of a single folder (including python itself). While this sounds great on the surface, keep in mind it will prevent non-breaking, API-compliant hotfix changes from making it down to the user through their system package managers, so users are now completely at the mercy of the Salt Project’s timelines and priorities, and their responsiveness record there is…not perfect in my opinion.
Where To Go From Here
Perhaps now you see why I’ve started
I need to fill a SaltStack-sized hole in my heart; build a tool that solves all the same problems Salt did for me, but also address my own issues with the tool.
grlx will never have 1:1 feature parity with SaltStack, especially as the Salt Project focus shifts to VMware first-party products and integrations, but for people like me, that’s a pro, not a con.
grlx, our goal is to support all the things that make SaltStack great–the extensibility, human-readable configuration as code, lightning fast remote execution–and do it with a small footprint, secure by default.
It’s my feeling that SaltStack probably isn’t to blame for its memory usage either–it is written in Python, and that’s just the price paid for an interpeted language.
For this reason, among others,
grlx is written in
Go, a compiled language.
Similarly, the connectivity issues are likely not part of the Salt codebase, and are probably part of the zeroMQ dependency.
grlx, we’re using
NATS.io instead, and we’ve even
submitted code upstream in an attempt to be good Open Source stewards.
- We have a Twitter/X account here.
- We have a community discord!
- You can also star our project
on GitHub using the button below:
I’ll also be releasing several posts in the coming weeks going into more detail about the features and roadmap for grlx, so hit that RSS button at the top of the page to make sure you don’t miss anything!