Why and What
As with any other company, we at willhaben want to enable software teams to produce quality results fast and with low cost. This is in the interest of both the company and IT staff — who wants to produce low-quality software slowly?
This goal is a never-ending process of improvements, and we’re on the journey, not at the end. But we’re doing pretty well. And we rely not just one tool, but a big box of tools. One of the tools we use is to follow STOSA — Single Team Oriented Service Architecture.
STOSA is a pattern that is mostly about project management and team structure, but it does also have consequences for software technical details. It is related to the Inverse Conway Maneuver — restructure your teams, and your architecture will follow. We applied this successfully (and really should write a blog article about our experience!).
STOSA is not particularly complicated; the referenced article can be read in 10 minutes. And it is not particularly surprising; many of its recommendations are “mainstream” ideas. However taken together, these ideas can boost productivity.
What this means specifically for willhaben is:
- grouping the majority of our IT staff into tribes — cross-functional teams who develop, maintain, deploy, and run specific features
- building software systems as distributed services whose code-bases are each small enough and specific enough to belong to a single tribe
Note that willhaben is a company that produces software for its own use, driving its online services. We can release software whenever we wish (and do so frequently). This is quite different from companies who provide software for others. And (occasional) errors can be tolerated; we don’t like user-visible problems in our services but producing software fast with occasional mistakes is still better than producing perfect software slowly and expensively. Nobody’s life depends on our services. Together these aspects mean that a distributed system works well for us, without excessive costs.
This approach is similar to what is called the “Spotify Model”, and Spotify themselves point out that they use this pattern to “optimise for innovation”. This is important to them because they are in a very competitive market and their competition is evolving rapidly in terms of services offered to customers so they need to do so too; rapid development (innovation) is therefore valued over system reliability or cost-effectiveness. As willhaben happens to be in a similar situation (at a smaller scale), this approach works well for us too. For companies who have limited budgets, have high reliability requirements, are working in a slower-moving business environment, or multiple of these factors together, then this might not apply so well. A company building software for nuclear reactors will have other values and may choose a different approach.
Cross-functional means each tribe contains business-domain experts, mobile-app devs, web devs, back-end devs, SREs (sysadmins), and QA. For many change-requests the entire task can be done by a single tribe, from requirements to rollout, without needing any meetings with other tribes or teams. This gets rid of a lot of time-consuming tasks — meeting agendas, scheduling problems, priority discussions, waiting for someone else to do their part — or worse, waiting for approval for some step. Of course discussions are needed, but they are primarily internal to the tribe — and the tribe is a small group of people who know each other well and have the same priorities. Answers to a question can often be obtained without leaving the tribe’s workroom — and often without needing to get out of a chair.
Where possible, tribes are responsible for user facing functionality from UI to database, ie vertical slices of behaviour rather than horizontal layers. This reduces the number of interactions needed between teams; any time teams have an upstream/downstream relationship then additional coupling occurs. This is effectively assigning “use cases” to tribes and is sometimes called value stream aligned teams or “customer journey oriented teams”. In some ways, each tribe somewhat resembles a separate startup company with its own customer market and customer-facing product.
There are of course projects that cross tribe boundaries, and require more formal coordination, including specifications and timelines. However making these the exception rather than the normal case helps productivity significantly.
As Rebecca Parsons pointed out (paraphrased by me), “horizontally partitioned systems protect against technical change, vertically partitioned systems protect against business change”.
In addition to tribes, there is a set of “supporting teams” who take care of things that are not related to any specific feature — eg providing the infrastructure for our production and development environments, providing test-automation frameworks and advice, supporting agile development practices, or dealing with cross-tribe issues such as security and overall architecture. However the number of people in these roles is kept as small as possible — it’s the tribes who do the work that really earns money!
Domain Driven Design (DDD)
Much of our work is influenced by the well-known and widely-used ideas of DDD. In particular, each tribe owns one or more domains1 — coherent sets of functionality. This is important for achieving a good hit-rate for the goal “change requests don’t cross tribe boundaries”. We’ve identified 14 primary domains, and currently have 4 tribes each of which is responsible for several domains. Having more than one domain per tribe allows responsibility to potentially be moved between tribes if the workload becomes unbalanced for any tribe. Or to create an additional tribe if workload justifies it, without needing to split a domain.
Ownership and Autonomy
Making it possible for a tribe to minimise external coordination requires software to be “owned” by a single tribe. Each deployable unit of software (“component”) ideally has an independent code-base (a separate Git repo in our case), and its own deployment pipeline, so that the owning tribe can change code and release the changes into production without coordination. We call this code-ownership and authority to change and deploy code tribe autonomy.
Autonomy isn’t an absolute law; the end goal is to increase company profits via high software productivity and ownership/autonomy are just tools to achieve that. However ownership is a core concept of STOSA and brings multiple benefits:
- better maintainability — developers who feel they “own” code take more care with it
- better stability — a team who have to deal with production issues will build better diagnostics and test-suites
- better bug-resolution — when an issue arises, it is clearer who is the expert and who will feel responsible
To balance ownership and openness to other tribes, all source-code is visible to all developers and pull-requests from outside of the owning group are encouraged. Making this process easy requires support from the owning team (good docs, etc) — something we call “internal open source”. Doing this right is a continual process, however, and we’re just starting here. Maybe we can write an article on our experiences here in the near future.
For front-end software, we haven’t yet found full solutions for tribe autonomy with regards to software releases. We do structure app and web codebases in a modular way, with modules being aligned with domains i.e. tribe responsibilities. However releases still involve “the whole cross-tribe codebase” and therefore (sometimes) require cross-tribe coordination.
Obviously, mobile apps are released as monoliths by their very nature. We are currently looking into “server-driven UI” concepts, in which server-side code (which we can deploy at any time) serves up templates, validation rules, and other similar data which is then processed by some kind of interpreter built in to the client application, allowing new functionality to be provided to users without updating the client. There are of course limits to this (a web browser is the ultimate form of this, and server-driven UI shouldn’t lead to a bad reimplementation of one) but something in this direction may be useful to support the concepts of STOSA.
There are various proposals for web micro-front-ends (built from multiple independently-deployed processes while still looking like one site to the user). For our primary web front-end we are currently using the “fragment integration style” which preserves ownership but deploys as a single artifact. While perhaps not 100% “STOSA compliant” we do manage multiple releases per day of this application.
STOSA’s technical impact is (at least for us) most applicable to “back end components”, ie deployable artifacts that offer APIs rather than interfaces, and which own some set of data. The remainder of this chapter talks specifically about these.
For back-end components, the above requirement for independent code-base and independent deployment leads naturally to a micro-service architecture. However at willhaben we focus more on the service than the micro. As long as a component is small enough to belong to a single tribe, that’s sufficient. And in practice, the “back end code” for each domain is usually implemented as a single component. Truly fine-grained components do have some advantages, but also have a lot of issues associated with them, and so we are avoiding these (at least for now). This cautious approach to microservices also appears to be the mainstream consensus at the current time.
Testing and API Compatibility
Tests for each component need to be as independent of other components as possible; requiring execution of a full system integration test suite before release of any component would be a major hurdle to rapid releases and cost-effective software. As noted above, nobody’s life depends on us, and so testing should be good but doesn’t need to be perfect.
However as components interact, API compatibility needs to be managed carefully; each component must provide backwards-compatible interfaces until all interacting components have been updated. And as we strongly discourage tribes forcing deadlines for work onto other tribes, such backwards-compatibility needs to be long-term.
This API compatibility management (given no full system integration test suite run before each release) is a tricky issue. And honestly, one with which we are still struggling. We use contract-tests with PACT, but not as extensively as we would like, and intend to investigate other options in the future. For now, relying on developers to “be careful” with API changes is our primary mechanism.
Testing a distributed system (as compared to a monolith) is significantly more complicated (as we experienced). This alone justifies the growing consensus that there is nothing wrong with a monolithic IT system if it fits your requirements. We’ve outgrown that phase; our rate of software development, frequency of release, and plain number of people working on our code-base, is just too high to make a monolith practical. However the complexity (and therefore cost) of a distributed system (with testing being an important issue) are significant!
In short: in a microservice architecture with multiple releases per day (of various components), it is an illusion that it is possible to create a realistic replica of the production environment. We are therefore currently moving from our traditional testing approach to something more distributed-system-compatible.
STOSA recommends that each back-end component has its own database, and we follow this strictly, for multiple very good reasons:
- independent releases
- performance impact
When one component reaches into another component’s database to read data, then this impacts the ability of each to make changes to the schema of that database. Meetings are required to agree on changes, and release-schedules potentially need to be synchronized. This brings back exactly the issues that STOSA tries to resolve.
Having one component write into another’s database is worse; enforcing data consistency rules then becomes very difficult.
Partitioning data across multiple databases increases security; a vulnerability in one component only exposes the data that that particular component has available — which is hopefully a small subset of the full dataset.
And separating databases means that performance issues in one component (eg missing indices on a relational table) cannot affect the performance of a component owned by a different tribe.
One problem that separation of databases brings is the need to access data owned by a different component/domain. There are two general solutions to this:
- use synchronous calls between domains to obtain data as needed
- replicate data asynchronously
Both solutions are widely used — though the first is probably more common than the second. And each approach has its advantages and disadvantages.
Systems based on synchronous calls between “microservices”:
- are obvious to build (developers are familiar with such calls)
- don’t need to deal with eventual consistency issues (synchronous data read requests always return recent data)
- have efficient data storage (each data item is stored only once)
However such systems also:
- can have complex nets of synchronous calls which make it hard to understand dependencies
- need to put lots of effort into interface compatibility
- have complicated failure-modes
- are IO-intensive
- can have performance issues (“chatty APIs”)
- have more potential security issues (each API can be abused, simple firewalling cannot be used between services)
- are difficult to scale appropriately
- can have startup ordering problems
- can lead to development being “blocked” by missing APIs (one tribe’s work requires a new API from another tribe)
- really need distributed-tracing support
- are difficult to test (testing requires interactions with other systems)
The “replication” approach has additional up-front complexity but resolves many of the above problems. Given the (relatively high) volume of traffic that willhaben carries, our relatively small development team, and our intention to be in business long-term, we are therefore currently applying the “replication” approach — called “Event-carried State Transfer” (also known internally here as distributed read models).
The most important concept related to data storage is: each piece of data has exactly ONE owner, and only the owner may change that data. That owner is the “source of truth” for that data; other systems may potentially have copies of it, but must never modify that copy.
We do of course have some legacy systems that don’t follow the STOSA principles, ie are too big to belong to a tribe and/or cross DDD domain boundaries. However we’re working on that, and expect within the next few years that all our software components have just one owner, ie are completely compatible with STOSA.
If you are looking at applying this yourself, there are a few abilities that a company needs first:
- you need to be providing an online service; shipping “packaged software” isn’t a good match for this pattern
- good automated deployment pipelines
- a good monitoring/alerting system
- good API compatibility testing (ideally)
- an IT team of at least 30 people (below that, the overhead is not worth it)
The Negatives of STOSA
We’ve of course noticed a few problems as a result of adopting this pattern:
- ownership can lead to a “your problem, not mine” attitude
- ownership can lead to a “keep your hands off my code” attitude
- uneven workloads for tribes can occur
- inter-tribe communication can suffer
- knowledge-sharing with staff of the same “skillset” can suffer
- more frequent integration issues (as noted earlier)
We use the following strategies to deal with some of the above:
- “chapters” — regular meetings, presentations, and support-channels for specific skillsets (eg iOS devs, or QA)
- emphasise T-shaped skillsets (staff have a primary role, but also actively use/learn an additional role)
- emphasise consistent toolsets and frameworks (tech radar etc)
- tools for verifying api compatibility
- good devops practices (good monitoring, good support for rollbacks, etc)
- regular cross-tribe presentations
- whole-company community-building events
And STOSA requires building software as a “distributed system” which brings significant technical complexity. Below a certain scale, a monolithic application can well be a better solution — in which case STOSA does not apply.
Given that the STOSA overview is only a 10-minute read, it has quite an impact on how we develop software. Well, actually we discovered the STOSA definition long after we had made similar decisions, but the article brings the ideas together nicely.
As described above, it influences:
- Tribe and team structure
- Personal communication patterns
- Version control decisions
- CI/CD pipelines
- Responsibility for production issues (owning tribe)
- Modularity of mobile-app and web-ui code
- Size of “domains” in our DDD analysis of the company featureset
- Integration-testing approaches
- API-compatibility verification
- Component interaction patterns
- Data storage and replication
Getting all of this organised has been a lot of work, and it’s not done yet. However, given the size of our IT department, we’re producing some excellent DORA (aka Accelerate) metrics (DF/LTC/CFR/MTTR) and this STOSA approach can take at least some of the credit.
And importantly, we’re generally happy with it. Developer satisfaction is high and we have multiple releases into production every day which everyone (business and tech) enjoys. There are of course always differing opinions, and it’s hard to know whether other approaches would be better or worse. It will be interesting to see 10 years from now how this looks, but for now it’s looking like a pretty good way to deliver IT value.
This article was written by myself while an employee of willhaben, and originally published on Medium (with link from the company website) in February 2023. Minor updates were made in June 2023.
Possibly the word “subdomain” could be used here - but domain is just easier to write. ↩