Hi all, I've been wondering about some basic organ...
# general
l
Hi all, I've been wondering about some basic organizational workflows around dependency and deployment, and am struggling to find detailed resources and stories. If you can point me to articles/slack threads/blog posts I will go there to read. I'll describe below the kind of issue I want to learn more about. The monorepo approach increases the size of the solution space, imo, but still each of those points in the larger solution space has tradeoffs and I am wondering what folks' first-hand experience has been. I lived in a monorepo before but it was small enough, and the org was small enough, that we were getting by (pantsless) not doing anything efficient - every CI would test and lint everything, and every merge to main would build and deploy everything. So if Service S1 and Service S2 depend on a shared library L, any change to L means you deploy S1 and S2 (in fact, any isolated change to S1 would cause S2 to get rebuilt, republished, and redeployed too - eek). But the org was small enough that everybody had a kind of feeling of shared ownership, and/or you'd defer to some central authority. I'm also familiar with the poly/multi-repo workflow at a large place. You have a kind of aspirational eventual consistency. The team supporting service S1 needs something in library L and improves it, bumps the published version and consumes it and redeploys S1. The team supporting service S2 may never pick up the L update, despite gentle suggestions or automatic version-bumping PRs created by some dependabot. At least not until something changes in the ecosystem and they have to jump to a newer L, at which point it is probably harder. What does it actually look like on the ground when you update L in a larger monorepo shop using modern tooling like pants? • do you seek approval from the teams behind all affected services before updating L, knowing all dependents will get redeployed? (how do you find out who all it is?
pants dependents
?) • do teams decouple integration into main from the trigger for deployment of their services? So that S1 can get deployed with the new L, but S2's team is not forced into immediate deployment, but are blocked on their next feature until they absorb the new L? (a kind of optimistic merging type of situation) • Do the teams behind S1 and S2 deploy off of non-main branches of the monorepo but periodically merge to and pull from main? (back to eventual consistency, no authoritative source of truth) • ... Thank you for any references you can provide around these kinds of questions.
👀 1
h
I think we do a mix of what you described in the first bullet. For something really disruptive, we work on scoping out and communicating the changes and then partner with other teams to develop the new features. For simpler things, we’ll just take it on ourselves. Ideally just changing out shared libraries shouldn’t change some public api so you can still rely on broader unit and integration tests to validate without that teams help. We usually rely on internal knowledge when going through our planning efforts to know what is going to be affected by our proposed changes. There are clear misses in this from time to time, but being communicative with teams can help. I’d love to see us use dependency trees to find that, but that’s pretty information dense in my opinion. It’s difficult to go from “what are the dependencies” to “who are the affected teams” especially when 1) transitive trees can be quite large 2) the current api lets you only scan the dependencies of one thing at a time (I believe) and 3) you’d have to use something custom in build files to say who owns code to keep a separate lookup out of the loop.
We have the benefit of being a small company where 1) the set of software teams is generally still small i.e. <75 and 2) we all generally work to the same planning cycles of identifying work and communicating to other teams what we intend to do. At a very large organization, that strategy doesn’t necessarily break down, but figuring out who is/will be affected by our intended changes certainly gets harder.
g
Our Pants repo is our smallest monorepo, but we have a company-wide backend/infra monorepo using Bazel. Our primary tool for communicating and scoping changes is codeowners. Every service (or at least all critical ones) has one or more users as codeowners. This applies to both infra (TF) and applications (k8s). Heavy use of CI/CD/reviews bots to make changes visible if necessary - TF state delta, DB table delta, etc. If a change in common code has no effect on downstream users, the maintainers or a random selection approves/reviews, and the change is rolled out. All application state is live-at-head through ArgoCD, more or less. Some things roll out with canaries, but non-critical things we can just revert. Terraform is applied manually. Deploying from branches is somewhat common for terraform changes (setting up a new resource you probably want to test before merging f.ex.) but deploying code from branches has special workflows using feature flags or other routing. With some work on consistent builds we don't do a lot of unnecessary builds or deploys - Bazel has tooling similar to
pants --changed-since
etc, which we use to prune heavily.
l
@high-yak-85899 thanks for the details, and it appears that "who are the affected teams" could be a feature, if it can deal with the issues you mention. (also is 75 software teams a small number?!) If I can repeat what you wrote, just to check if I understand it: you do bullet 1 developing the feature, looping in affected teams as deemed necessary, and get to the point that everything is ready to merge into main. At that point, when you merge into main, will that merge end up deploying all services that depend on the code being changed? Does it occur that a team is surprised that their service is getting redeployed?
@gorgeous-winter-99296 Thank you for the details too, it shows a different strategy. How do you set up codeowners to interact with dependencies? Normally it will only trigger review if the code in the owned subtree is changed. Like Team 1 has to review if the code for service S1 changes. But if the library L it depends on changes, that won't change anything in S1's code, but S1's behavior might change. Like say L defines some patterns of User Agents to blacklist and S1 consumes it. Will Team 1 end up automatically requested to review, or is it understood that it is ok for their service to get redeployed because of an upstream change of L? (Related to Nathaneal's point about it being difficult to figure out who are the affected teams) About:
non-critical things we can just revert
In this case you mean to revert the deploy, but not the merge, right? Like if Service S2 is misbehaving after it got redeployed due to L, just roll back the deployment to the previous, say, container image.
e
One thing to keep in mind if you have a large number of interconnected teams and or services is canarying. Now you describe some sort of auto-deploy system, which seems antithetical to canarying, but maybe you only auto-deploy to some sub-prod env that gets promoted only if OK - which is at least a form of canarying. The idea here is that no matter what you do, you will mess up, probably more often than you'd guess. A good canarying system hopefully makes the impact of these mess ups negligible most of the time.
g
> Like Team 1 has to review if the code for service S1 changes. But if the library L it depends on changes, that won't change anything in S1's code, but S1's behavior might change. Like say L defines some patterns of User Agents to blacklist and S1 consumes it. Most likely it just gets merged without S1 maintainers being notified. For larger scale changes that do not have a lot of teams notified, it'll be posted for a quick RFC/FYI before merged. There's a shared understanding that you do not merge things with a large blast radius on evenings etc. If a team notices that there's a lot of "innocuous" changes from L that break them, they can add themselves as code-owners and ensure that they get pinged. I'm code-owner on a lot of things as an "FYI" but will just remove myself from reviews list if it looks benign. > In this case you mean to revert the deploy, but not the merge, right? Like if Service S2 is misbehaving after it got redeployed due to L, just roll back the deployment to the previous, say, container image. Either-or. If the code itself is bad and multiple services are affected, we revert the merge and it goes through CI/CD to roll out. Repo admins and some users can bypass CI checks for reverts if needed. We can also revert on the infra level if it's critical, and then re-enable sync once we've stabilized on the code side. That's always second choice over code-reverts -- if we forget to re-enable sync you have issues later instead. But our policy is no fix-forwarding, so revert then fix properly before merging again is the default. And as John says -- canaries are super-useful, and all critical services run them for us. I think that's using
flaggr
, but not sure. As I understand it, it essentially deploys a canary next to our prod environment, which will do self-validation and health checks. Once it's deemed OK we promote to live and spin down the old deployment. I work on things that isn't very good for that workflow (K8s controllers) and much older than any such fancy tools, so I just have a beta environment. 😛
l
Interesting, so for critical systems the live deploy will go through a possibly manual check (on the canary).
g
No manual checks, no! It's all automated. But we have feature flagging so you can do a "dev deploys" as well; so you can do E2E testing.
🙂 1
FWIW we're a fully K8s shop; so our workflows are very tailored to that
There's some pubsub and serverless, but almost exclusively just shimming before going to either K8s services or managed services (like BQ, etc)
l
So you deploy to canary and do self-validation / health checks, but the canary does serve production traffic, correct?
g
So the way it goes is something like this: • Developer creates PR and and goes through review + initial CI, tests, ... • Merge to main • ArgoCD detects new commit and generates new manifests for the cluster and tags them as "canary" • Flagger observes the canary and waits for readiness and health probes on the deployment to go green • At that point, it can either do a direct promote, or it can start redirecting traffic for "incremental" promotion (as I understand it - as noted I don't work with this kind of service) • If the canary crashes at any point, it will prevent promotion. There's some timers, backoff-rules and retries to avoid thundering-herd style issues here. • Once promoted or all traffic is reaching the canary, the old deployment is removed I believe this is the mechanic we use when we do staggered promotion: https://docs.flagger.app/tutorials/istio-progressive-delivery
I guess in one view, our "dev deploys" are also canaries, but they can never be promoted to live. They only serve tagged traffic.
l
Thanks for all the details, and I did not know about this tool. I should get to know it 🙂
Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. It reduces the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring metrics and running conformance tests.
g
https://argoproj.github.io/rollouts/ is a similar tool; which might make more sense if you also want to use Argo for CD. We adopted flagger much before argo, so we're stuck with it. But Argo Rollouts is more "native" when already using Argo.
👍 1
l
I'm going to wait to see what others have to say. It feels like some or all of what Nate and you have said could be worth opening up in some more public place. Managing operational and social risk when making changes in a monorepo setting. I'm def getting a more concrete picture of the possibilities and reassurance from the details.