This previous month for Gnuxie & Draupnir, February 2024.
Table of Contents
Introduction
Drive to public testing
Since the beginning of the month I have been driving development towards getting the rework of Draupnir that builds on top of MPS merged into the main branch. Or, as a compromise, in a state where the branch could be tested publicly. As of writing, the latter looks a lot more likely. The reason why I am doing this is basically to shortcut myself and stopping MPS becoming a forever project. Not that there's genuine risk of that, just I think it's time to see how our work pays off.
A release would require focusing on getting the test harness that we inherited from Mjolnir working in the context of MPS and then deciding where to draw the line in terms of features or fixes to be included or followed up on. I don't think we're ready for that yet, so working towards a rolling test image for the branch makes more sense.
Reading this post
I've just been toiling along, and this is a much bigger update. A lot of things have happened and the last month, and even this month, have felt somewhat emotional. If you can't be assed to read something you should go to the contents to find something else or skim through what you find interesting. There's a bit of chronology but it'll be fine. And yes I know the post is late 😛.
I pretty much failed on the whole "edit as I go along thing" and this time I've really paid the toll because there is so much stuff to talk about. But I think now that editing is over, I accept how things have worked out.
Draupnir for all shoutout
Cat worked really hard this month to get the appservice mode of deployment into matrix-ansible-docker-deploy. Now, the appservice is still alpha quality and the system admin experience is quite awful at the moment. A lot of that is waiting on Draupnir MPS work to conclude since the appservice internals have seen a lot of improvements there. Big shoutout to Cat for this, I wasn't very receptive to his efforts because I worry about people having a bad time getting this to run, but I do now think it probably is the right thing to do. He went through the pain of reworking the documentation that had been in the repository from the "Mjolnir for all" days, which was rushed and really for Halfy's eyes only. So a big thank you to Cat for working it all out.
Developments
MPS: Room State Tracking
Throughout the previous update, we mentioned room state caching and tracking
only in passing. But I want to talk about this more completely as it
came up again. Both Draupnir and Mjolnir depend on the matrix-bot-sdk as their
client library, which provides very little control over the /sync
loop.
This means that we cannot use the same /sync
loop that we use to receive
timeline events to receive state deltas from joined rooms. Even if we could,
the state
portion of /sync
is primarily used to inform clients about state
that changed within timeline gaps1, and isn't supposed to provide a
complete set of deltas. More worryingly, the appservice API provides us no way to
receive state deltas from the homeserver, and I don't want to have to have to
manage syncing clients within the appservice code either. This has left Draupnir
and Mjolnir in a situation where in order to create an accurate model of the
room state, they invalidate their local copy and request an entirely new one
every time they see a state event in the room timeline2. The reason being
that it is not possible or safe to assume that a state event found in the
timeline is representative of current state for that type+statekey pairing3.
As a result, both Mjolnir and Draupnir only cache room state for policy rooms,
and for every other operation, such as synchronising policies with room members,
the state (though usually just a call to /joined_members
) is fetched anew.
This works, but it is problematic because most protections also need this same
information in one form or another. It also is the primary reason why the bots
feel so slow.
As you already know, in MPS we bit the bullet and started tracking state in all protected rooms. But this does present a concerning problem: if we now fetch the entire room state every time we see a state event, then isn't that going to really complicate a join wave attack? Well yes. So I wanted to find a solution to that before releasing. It turns out that actually there are situations where you can take a state event from the timeline and treat it the same way as a state delta. If you have an existing copy of the room state, and you notice a new combination for a type+statekey pair, then from your server's perspective it is impossible for that event to be stale state4.
I don't know why it has taken so long for me to figure this shortcut out, I've told plenty of prominent matricians5 about this and none of them suggested this. However, I guess they just didn't have their head in such a tight spot, since not everyone needs to use state this way or use some crazy hack. Regardless, in order to take advantage of this, we'd have to rework how we calculate changes to room state.
All the way back in Mjolnir, we calculate changes to state by using a very
simple enum called ChangeType
. This has three variants Added
, Modified
and Removed
. The idea is that new combinations of type+statekey pairs are
classified as Added
, and everything else Modified
. If a state event that
is current state becomes redacted or it is replaced with an event with empty
content, then it is classified as Removed
. If an event that was previously
Removed
is reintroduced, then it is classified as Added
again6.
This was fine because we only used this simple change type to represent policies, not all state events. If I was making a generic matrix library, neightrix, that might be annoying since we don't really know how people are using their state events and what the changes mean. Besides, these rules don't tell us much about how state was changed, and we couldn't figure out whether a new type+statekey pair had been introduced or not. At least not without duplicating some of the checks we already did to derive the change type in the first place.
Well I made the decision to just expand the enum to exhaustively describe every state change, tell me whether that was a dumb move.
// Copyright 2022 - 2024 Gnuxie <Gnuxie@protonmail.com> // Copyright 2019 - 2021 The Matrix.org Foundation C.I.C. // // SPDX-License-Identifier: AFL-3.0 AND Apache-2.0 // // SPDX-FileAttributionText: <text> // This modified file incorporates work from mjolnir // https://github.com/matrix-org/mjolnir // </text> export enum StateChangeType { /** * A state event that has content has been introduced where no previous state type-key pair had * in the room's history. This also means where there are no previous redacted * or blanked state events. */ Introduced = 'Introduced', /** * A state event that has content has been reintroduced where a blank or redacted state type-key * pair had previously resided in the room state. * The distinction between introduced and reintroduced are important, because * an issuer can always treat introduced state in the timeline as a delta, * but not reintroduced, modified or removed state. */ Reintroduced = 'Reintroduced', /** * This is a special case of introduced, where a state type-key pair has been * introduced for the first time, but with empty content. */ IntroducedAsBlank = 'IntroducedAsBlank', /** * This is when a unique state event with empty content has been added * where there was previously a state event with empty or entirely redacted content. * Can alternatively be thought of as "ReintroducedAsEmpty". */ BlankedEmptyContent = 'BlankedEmptyContent', /** * A state event with empty content has been sent over a contentful event * with the same type-key pair. */ BlankedContent = 'BlankedContent', /** * A redaction was sent for an existing state event that is being tracked * and has removed all content keys. */ CompletelyRedacted = 'CompletelyRedacted', /** * A redaction was sent for an existing state event that is being tracked * and has removed all content keys that are not protected by authorization rules. * For example `membership` in a member event will not be removed. */ PartiallyRedacted = 'PartiallyRedacted', /** * There is an existing contentful state event for this type-key pair that has been replaced * with a new contenful state event. */ SupersededContent = 'SupersededContent', /** * The events are the same, and the event is intact. */ NoChange = 'NoChange', }
Blanked is really a bit arbitrary. It's really an interpretation of what a blank state event is supposed to mean, but usually consumers will use them as a more explicit way to remove state than a redaction.
I haven't written code to use the Introduced
scenario yet as a shortcut,
since that's less important to me at the moment. But I wonder if unsigned
event content can also be abused to find out when we have a directly superseding
event and whether we can take a shortcut there too7. But I think you are much
more likely to run into a risk of getting out of sync with the canonical state of
the room then. No reason why we couldn't preemptively use the events though and
still make a call the /state
to compare. We could probably write some basic
code to just make sure we're not frequently over checking the same room.
That's all stuff that needs figuring out in the future though once I have
Draupnir MPS up and running with a profiler.
MPS: ProtectionsConfig
The ProtectionsConfig
is responsible for configuring and storing which
protections are enabled within Draupnir. Just like Mjolnir, Draupnir stores
the list of enabled protections in Matrix account data. And this is done under the
account data key org.matrix.mjolnir.enabled_protections
. Early on in Draupnir's
development, we wanted to add new functionality as protections and the first
protection that we did this with was the BanPropagationProtection
.
Of course, naively doing so would mean when Draupnir get upgraded, the new
features would be disabled, which is not the behaviour that you would expect.
Given that the BanPropagationProtection
protection is such an important feature
for the UX of Draupnir, we wanted it to be enabled by default. So even for
existing deployments of Draupnir that are upgraded, or Mjolnir deployments that
are migrated across, the protection will be enabled without their intervention.
To do this seamlessly we needed was some kind of migration code.
Originally, this was done by embedding a special key into the event called the
SCHEMA_VERSION_KEY
or ge.applied-langua.ge.draupnir.schema_version
. The
associated value to the version key would contain the version of the schema
the event claimed to use. Draupnir's old MatrixDataManager
would spot events
that didn't have the key or conformed to an older schema and update them for us.
The migrations were were pretty straight forward to define8.
Ironically, the input data is only validated in an adhoc way by type narrowing
from unknown
and any TypeScript user knows that writing the code to do this
is tiring9.
In MPS
we provide an interface for the protections config and a standard
implementation that works with abstracted capabilities to load the account data.
Since MPS
is a generic library, it would be inappropriate to add Draupnir
specific migrations to the data. So a solution would be needed so that Draupnir
could use the standard implementation and provide migrations for it to use,
all without making the utility weirdly out of place.
This is done by introducing the concept of SchemedData
and
SchemedDataMigration
to MPS and then allowing a migration path to be given
to the standard implementation of ProtectionConfig
.
This is a concept that has existed in Draupnir's code base for awhile in various
forms, and this concept still is not complete. I also worry if we are really
abusing a concept here that is way too generic for a very specific purpose.
MPS: ClientsInRoomMap
In January we described the ClientsInRoomMap
, a utility that would track all
the appservice users present within a room and whether they should be informed
about an incoming event. We used this map to also restrict which rooms
appservice users can request room state information from. Technically, this isn't
necessary but I am paranoid about some obscure consumer of the API being abused
down the line to leak room state, and thus room members, out of the appservice.
If we naively implemented this restriction, then it would break conventional use
of client objects within the code base. For example, to join a room you usually
call MatrixClient['joinRoom']
, where MatrixClient
is from the matrix-bot-sdk.
Consider watching a policy room, the steps in Draupnir post-MPS go like this:
- The room is given to the
PolicyListConfig
. PolicyListConfig
joins the room.PolicyListConfig
requests an instance of aPolicyRoomRevisionIssuer
from thePolicyRoomManager
so that the policies of the room can be accessed.- The
PolicyRoomManager
checks that the client is present within the room by using theClientsInRoomMap
.
If during this time, no timeline event has been received to inform
the ClientsInRoomMap
that our client has joined the room, then
the ClientsInRoomMap
will tell the PolicyRoomManager
in step 3 that our
client is not present within the policy room. This will very confusingly present
an error to the end user that suggests they are not present within the policy
room, when they will see on inspection that the client just joined it.
There are three ways this could be fixed:
- Waiting for the timeline event to be received where we make a call to join the
room within the
PolicyListConfig
. - Manually informing the
ClientsInRoomMap
of our join to the room. - Adding preemption to the client itself when the
joinRoom
method is called that automatically and atomically updates theClientsInRoomMap
.
The first option, waiting for the corresponding timeline event which acknowledges the join, is terrible. Since now everywhere in the code where you join a room, you are waiting indefinitely for something that may or may not happen. Something that depends on connectivity to the homeserver, and a whole bunch of code then needs to be added somewhere to manage the behaviour of waiting for the timeline event and continuing again. This would be a lot of dedicated engineering for what is really nasty code that indirectly solves the problem. It would also force all code around room joins to not only be noisy but also latent at runtime while we wait for the join event to get sent down to us from our homeserver.
The second option, manually informing the ClientsInRoomMap
, is ok.
This will likely only require a single line of code per site where we join a
room. The update can also be performed instantly. However, it does allow the
programmer to make a mistake quite frequently. They can accidentally tell the
ClientsInRoomMap
that we have joined a room, when really there was an error
of some kind and we never did join the room.
It also introduces an idiom that must be familiarised with everyone who wants to
write code for Draupnir or make protections. Someone unfamiliar with the
code base who is developing a protection might not be aware of this new idiom,
and then get confused error messages that tell them that their client isn't
present in the new room.
The third option, preemption in the client interface, requires significant
engineering, since we now have to either modify all uses of the matrix-bot-sdk's
MatrixClient
or come up with our own alternative. However, it does mean that
there is no manual tracking of the joined rooms and we don't have to take extra
care to inform new developers.
If you hadn't guessed yet, the third option of preemption is my preferred route.
The way this would be implemented is by taking a step into the future, and
provide specific capabilities for client actions. One of those capabilities
would be a RoomJoiner
and that's where we will implement our room join
preemption for the ClientsInRoomMap
.
Draupnir: testing commands
The Mjolnir integration test harness is unideal, particularly in the context of commands. The way that tests are written currently, regardless of whether commands get tested directly or indirectly, we always assert outcomes by scanning the matrix events that get rendered after the commands are run.
This isn't too different to how end-to-end webtests are written generally in the industry, however, I have a very strong resent towards the way these tests are written. I have a strong resent to tests that are written imperatively to begin with, and I'm yet to to put my intuition into theory, but I should do at some point so that they can be evaluated objectively. What I believe it could boil down to is that tests have to change, and a concise declarative description of what to expect is easier to edit for new changes than an imperative set of steps that need to be fully understood in order to modify just a portion. In order to create a language where a declarative description can express a concept in the first place requires a lot of thought and a full understanding of all interactions of the component we are describing. This being a pre-requisite in itself should already increase the chances of getting better API design.
I guess what I have accidentally argued for then is for the same JSX templates I use to render the results of commands with to also be used for testing the expected result of commands. This isn't really what I was going for, but it is true. Going back to the webtest analogy, this would be like using your react components as the expectation part of your test assertions, you tell me if that would be bad.
So in order to figure out this out, we should ask what we are really testing for when we test commands?
- The side effects of the command.
- That the command works end-to-end.
- The result of the command renders correctly.
Side effects ideally should be covered by unit tests. You'd expect that most application logic to be implemented somewhere distinct from the user interface. You'd also then expect application logic to be testable without invoking procedures indirectly through the user interface. This is already something that is already near universally acclaimed.
If we're testing the side effects of a command, and thus the implementation logic of commands, then testing "that the command works end-to-end" in reality means testing that the "integration glue" around the command is working. Integration glue is the code that sticks user interface to application logic. If we take this further, then the integration glue doesn't need to be tested for each command. The glue just needs to be covered so that we can be confident that the glue will work for every command. To give some specific examples of what this glue would be, here's what I see as integration glue in Draupnir:
- Extracting command context and parameters from from a matrix event and then calling command specific implementation code. Note, this isn't the same as parameter parsing, we're talking about code that invokes the parser, takes the parse result, and calls the appropriate command designated by the user's message.
- Taking the result of command logic and calling the appropriate renderer, to generate a render result.
- Taking the render result, and sending it to Matrix.
This is all glue that should be common to commands, and if it isn't, then you need to write a framework that makes it so first11. This means that we should only need to cover this glue code in a dedicated test, rather than covering glue code through brute force by end-to-end testing every single command to make sure that we hit all our glue.
If we can then break the dependencies of our rendering code and our command code from this same glue (and again this should already be the case), then we can write unit tests specific to those and reduce the complexity of our tests.
The only part of this that I'm hesitant about is renderers. Since you might want to manually inspect the output of the renderers. I don't think that blocks the idea though, since it's a tooling issue really. The tests for renderers just needs to be run with the option to export the results (to Matrix) for review.
Having written this analysis, I'm now curious about what the value of end-to-end web tests is? Because from this point of view, it seems that the only reason to use them is when it's impossible to break free from glue. Or it is impossible to find a concise way to test the renderers without including the entire web stack. This should quite obviously mean that there has been a failure to write modular programs12.
MPS: Capability providers for protections
We have established that we needed to break the dependencies of command specific
code in order to make good tests, and also modify the Matrix client interface so
that we can inform the ClientsInRoomMap
of a room join. We now need a way to
break the dependency on a matrix-bot-sdk MatrixClient
(and Matrix itself),
which is by far the biggest and most complicated stateful dependency.
There is already a precedent for fine grain capabilities being broken down from the different responsibilities of Matrix Clients, which is the widget api. The widget API doesn't provide the same capabilities that a client has, for example there's no capability for joining matrix rooms. The widget API is also oriented towards declaring the degree of attenuation, so for example restricting access room state to a specific event type from a specific set of rooms. Which is good, but we don't need an API like that immediately and we would also still need some conceptualisation of a client to build that API upon.
This is way too much work to attempt all at once, and it would be unwise to jump
in and change everything without having some way to evaluate the new API.
So to start, I decided to focus on implementing the capability providers for
protections with the new client interface. Previously in MPS, we hadn't had
time to think about how to implement capability providers. So there existed
just one capability provider, the BasicConsequenceProvider
.
This was a monolithic interface containing everything that the basic protections would need, and not a very good one either.
/** * This has to be provided to all protections, they can't configure it themselves. */ export interface BasicConsequenceProvider { consequenceForUserInRoom( protectionDescription: DescriptionMeta, roomID: StringRoomID, user: StringUserID, reason: string ): Promise<ActionResult<void>>; renderConsequenceForUserInRoom( protectionDescription: DescriptionMeta, roomID: StringRoomID, user: StringUserID, reason: string ): Promise<ActionResult<void>>; consequenceForUsersInRevision( protectionDescription: DescriptionMeta, membershipSet: SetMembership, revision: PolicyListRevision ): Promise<ActionResult<void>>; consequenceForServerInRoom( protectionDescription: DescriptionMeta, roomID: StringRoomID, serverName: string, reason: string ): Promise<ActionResult<void>>; consequenceForEvent( protectionDescription: DescriptionMeta, roomID: StringRoomID, eventID: StringEventID, reason: string ): Promise<ActionResult<void>>; consequenceForServerACL( protectionDescription: DescriptionMeta, content: ServerACLContent ): Promise<ActionResult<void>>; consequenceForServerACLInRoom( protectionDescription: DescriptionMeta, roomID: StringRoomID, content: ServerACLContent ): Promise<ActionResult<void>>; unbanUserFromRoomsInSet( protectionDescription: DescriptionMeta, userID: StringUserID, set: ProtectedRoomsSet ): Promise<ActionResult<void>>; }
The way this is has now been split up is as follows. A capability interface is described with a schema as so:
export interface UserConsequences extends Capability { consequenceForUserInRoom( roomID: StringRoomID, user: StringUserID, reason: string ): Promise<ActionResult<void>>; consequenceForUserInRoomSet( revision: PolicyListRevision ): Promise<ActionResult<ResultForUserInSetMap>>; unbanUserFromRoomSet( userID: StringUserID, reason: string ): Promise<ActionResult<ResultForUserInSetMap>>; } describeCapabilityInterface({ name: 'UserConsequences', description: 'Capabilities for taking consequences against a user', schema: UserConsequences, });
This is the interface that protections that need to take actions against users will write themselves around, in order to make the functionality pluggable. This includes allowing the protection to run with a complete stub, so that it won't actually take action against users13.
In order to implement the interface, we describe a capability provider using the API below:
describeCapabilityProvider({ name: 'StandardUserConsequences', description: 'Bans users and unbans users.', interface: 'UserConsequences', factory(_description, context: StandardUserConsequencesContext) { return new StandardUserConsequences( context.roomBanner, context.roomUnbanner, context.setMembership ); }, });
This provides a consistent way to to instantiate capabilities by using a factory, as long as the capability provider's factory returns a capability that matches the named interface, then everything will work. The context object is what provides us with this consistency and the factory acts like glue code to pull dependencies from the context and set up the capability for us.
As you can see, the context object has individual capabilities for the client
responsibilities of banning and unbanning a user from a room.
We also give the SetMembership
for the ProtectedRoomsSet
so that the
StandardUserConsequences
can figure out who is joined to the room.
The context object
We should probably explain what the context object is. Basically, the
protection suite can't be made aware of the context of its use. For example,
we couldn't include the interface of Draupnir in MPS because that would break
modularity and the library wouldn't be useful to other software. But we still
need protections to be defined by the library consumer that do depend on the
context of their use. For example, the ban propagation protection needs access
to Draupnir specific code currently to make prompts. The way we fix this is by
allowing the library consumer to provide an arbitrary context object to the
ProtectedRoomsSet
, that their protection descriptions can either use as a
dependency or destructure dependencies from. in Draupnir MPS, we use the
Draupnir instance itself for this.
Capability context glue
Now that we have decided to change the client interface, and capability
providers can be defined to desctructure dependencies from a context object.
It is possible to define a standard implementation for the capabilities
required by the MemberBanSynchronisationProtection
and the
ServerBanSynchronisationProtection
. Consider the description for the
StandardUserConsequences
capability provider earlier.
export type StandardUserConsequencesContext = { roomBanner: RoomBanner; roomUnbanner: RoomUnbanner; setMembership: SetMembership; };
When we come to use this capability provider in Draupnir, unless Draupnir has
these properties directly accessible, the factory method for the
StandardUserConsequences
capability won't work. We therefor need some kind
of glue code that creates this context for us from the actual consumer specific
context, that the consumer defines ahead of trying to call the factory.
This is what that looks like for the StandardUserConsequencesContext
.
describeCapabilityContextGlue<Draupnir, StandardUserConsequencesContext>({ name: "StandardUserConsequences", glueMethod: function (protectionDescription, draupnir, capabilityProvider) { return capabilityProvider.factory( protectionDescription, { roomBanner: draupnir.clientPlatform.toRoomBanner(), roomUnbanner: draupnir.clientPlatform.toRoomUnbanner(), setMembership: draupnir.protectedRoomsSet.setMembership } ); } });
This little piece of glue code will get called instead of the factory for the capability provider, setting up the call to the factory with the correct context.
Capability renderers
Next we need to consider capability renderers, which is code that helps the user log and keep track of when a capability is used and by which protection. This is code that will get called around the individual capability methods when called by the associated protection, so that the side effects can be rendered in Draupnir's case to the management room. This could equally be some audit log in an application that isn't Draupnir. It's at this point you really wish you had some of that Kiczales aspect oriented goodness so this wasn't some kind of second class concept, but that's a digression.
Here's the file for the capability renderer to StandardUserConsequences
.
There's quite a lot of noise for what is really glue code, I'm not happy
with it yet, but I can't complain too much. There isn't much code here,
and it's not exported anywhere either, so it should be ok.
The capability set
The final piece is how to declare the interfaces of the capabilities that are required for a protection, and giving those dependencies to the factory.
describeProtection<MemberBanSynchronisationProtectionCapabilities>({ name: 'MemberBanSynchronisationProtection', description: 'Synchronises `m.ban` events from watch policy lists with room level bans.', capabilityInterfaces: { userConsequences: 'UserConsequences', }, defaultCapabilities: { userConsequences: 'StandardUserConsequences', }, factory: (decription, protectedRoomsSet, _settings, capabilitySet) => Ok( new MemberBanSynchronisationProtection( decription, capabilitySet, protectedRoomsSet ) ), });
Annoyingly, this is where typescript starts to fall down a little, since
we have to both give the interface to the type parameter for the capabilitySet
argument AND we also have to give a description object of that interface
so that the code that calls the factory can create the capability set for us
from the context. If it isn't clear, the describeProtection
utility will
take the capabilityInterfaces
description, use the value for each property
as the interface name and go find that for us. The ProtectionsConfig
can later
be made to allow each matching capability to be configured, and if no capability
provider is named, then the associated default named in defaultCapabilities
on the protection description can be used instead.
Copyright and meta discussion on commercial open source software
MPS: Becoming reuse compliant
A lot of the files in the Matrix Protection Suite are incorporated from Draupnir and Mjolnir. Almost all of them aren't direct incorporations, but a remixing of ideas and code. From a strict IP perspective, the remixing of ideas alone probably doesn't come under copyright, but I don't really think that's a reason not to treat the file the same way and give attribution. If I read a file, and I gain a lot of understanding from the it, and then go on to change or develop new ideas to meet my new context, I feel it's important to give attribution to the work that came before me, no matter how much "original thought" or whatever you might think is being given. I take this stance because in my experience what seems like the most "simple" and "obvious" concepts or "way to do something" are actually the parts of a code base that took the most work in deriving. There is usually an entire ancient epic that the authors had to go through to get there, seas littered with the roaming souls of past developers and you're just some bozo coming along getting the privilege of hindsight. I do not think that attribution is unreasonable or a reinforcement of IP. It's not saying that "these ideas are theirs", they might not be, often won't be, it's just the most basic respect and gratitude to what came before you. And that's important to me, because I don't think there's much that would piss me off more if someone didn't do that for me.
This sort of attribution isn't normal in the industry. Even the most basic answers on stack overflow are incorporated by developers on a daily basis without attribution, I've done this throughout my entire career too14. What's worse is that even though weak licenses have clauses that require attribution and the preservation of attribution notices, they are buried within the source or a NOTICE file. Which likely will not ever be seen by an end user or the customers of an enterprise. Some licenses, such as the AGPL, remedy this by requiring these notices to be prominent within the UI. But the majority of open source software that private enterprise plunders comes in the form of libraries with permissive licenses such as MIT or Apache-2.0.
Draupnir itself makes extensive attribution notices to Mjolnir, since Draupnir started as a fork and it was clear that it wouldn't always be a fork. These notices are completely ad-hoc and it's not clear to anyone who comes across a file what needs to be preserved and copied. So for MPS, I wanted to see what was already out there which could help. It turns out that the SPDX specification does do this job, and I was here thinking it was just a specification for license identifiers. No, there's a lot of stuff here, including a system of annotating copyright, license and attribution information within files. It also turns out that there is a tool to check that your project has complete copyright and license information called reuse.software. I essentially just want it to be as unambiguous and explicit to developers who stumble across the project what they need to do to copy something.
Wait Gnuxie why do you not just use the AGPL?
Idk long story, but remixing and incorporation are important to me, and I'm worried about AGPL hindering that. And the AGPL is something that enforces intellectual property, so don't come at me from both sides when you come to critique what I said in the previous section above and also this, pick one.
Enterprise using a community maintained library is exploitative and opportunistic. Despite this, in the right circumstances the relationship can also be used by the community like a lasso. Especially in a context where companies and community members have shared responsibility over an ecosystem with a common specification. In theory the community can use its producer control over the library to force negotiation on issues. As for the commercial entity to push through its aims, it would either need your support or be faced with reimplementing your work themselves. In reality of course, the software is open source, so they can just fork you out of the picture and make you irrelevant.
When we look at the plundering of open source software by enterprise, we all have the urge to argue that developers should just stop doing labour for corporations. This is however difficult when they will rip out your code and ideas from under your feet regardless of whether you "do labour for them". Quite often we're still doing huge amounts of work for them, practically grovelling at their boots. And this is expressed in subtle indirect ways, as we all help create and maintain ecosystems ready for exploitation.
For example in the Matrix ecosystem, it's important to remember that there was a time where Element viewed expansion of the ecosystem and Matrix community as critical to their success. By encouraging the growth of a community as an enterprise, you get so many powerful benefits. I've listed the things that have come to mind next, this is pretty low quality for the blog, but if you're interested in exploring these ideas and dynamics, then you should keep reading this section.
- Advocacy.
- Infiltration: people who already use Matrix will go to their day jobs and when their company needs software like Matrix, there will be an advocate already within the company.
- Advertisement: A lot more people are going to know about your product if the people using it have good things to say about it.
- Hiring pool.
- The community trains itself on how to develop for Matrix. There are automatically more heads in the hiring pool which reduces costs and stress to find specialists.
- New joiners can be pointed to community maintained resources in order to get familiar with the technology, this includes joining support rooms that are actually run by non affiliated community members.
- Software Development.
- The community can organise and fill gaps in your product line-up, saving resources that would otherwise have to be committed by the enterprise.
- The community can produce or maintain libraries that can be used in future development, again saving huge amounts of resources that would otherwise need to be committed by the enterprise.
- With sufficient time investment into the community, they will contribute to your projects directly, for free. This might be with code, this might be with design or architecture, this could even be triaging issues, it could even be documentation and translations. All of these things are roles that would otherwise have to be filled or left absent if the software was proprietary.
- Issue Reporting.
- The community can rapidly give feedback on development builds and betas before embarrassing yourself to customers
- Community feedback can come from highly technical people who you would have otherwise have to be spending consultancy fees to have access to.
Some of the drawbacks would be:
- Shared market.
- You will have to share your hosting market with other vendors, since as the product is open source, anyone can offer hosting or provision of your products. I would argue that this is an overstated drawback, especially for Matrix. If your product is open source, you probably don't have the resources to go alone, otherwise you wouldn't have made it open source. This means that your product very likely is poorer quality than the proprietary competition. The very reason why people use your product in the first place is then also probably because it is open source, not because it is an objectively better product. Element has tried cannibalising on this point, without really thinking that the reason people use their stuff is because there is little alternative, and Element is only good enough.
- Even though the market is shared, your founding company very likely has a monopoly not only on developers and people with technical knowledge, but also a monopoly on connections to the community. This is especially true for companies that have a duality with a non-profit front. This means that you will also have a monopoly or at least an uneven share over the market for feature maintenance and development. Since you can push through large or fundamental changes to the entire community. The shared market as a drawback argument is very uncompelling to me, since with all the domination over people, resources, influence, technology you will have much more of a share of contracts than anyone else, almost certainly combined. Even if in reality there is a much smaller market overall. And again, a market that only exists because your product is open source.
- Poaching
- It is easier for a bigger fish to come along and poach your staff and or pick up the pieces if you fall down. This isn't really a drawback to the community but for founders needing to maintain central control. This is because most of the work is public, and also who is responsible for it, how they work, who they talk to. There's little risk in recruiting people already showing competency in the ecosystem.
- Fostering community requires upfront and continuous investment to get benefits
- If you don't put a lot of work into fostering your community, then your corporate mindset will force you to see initial poor quality contributions as a waste of time and a burden. Unless you make an attempt to seriously engage with contributors like this, then you are sending a signal to the rest of your community that their contribution is not worthy either. Which can easily cause a negative feedback loop. It takes time to guide people, and some people WILL waste your time. In my experience it is worth it, there's so much value and potential in collaborating with other people. In my experience, there isn't a single first time contribution I've received that hasn't taught me something or given me an insight.
- You're not going to make as much money with an open source product as you
would with a proprietary product unless you hit the big time through
being malicious.
- Why does this matter? Are you a greedy techno vulture or something? Get out of my sight.
Gnuxie: Blowviations about the state of affairs
There's been a lot of frustrated discussion about trust and safety recently in Matrix rooms adjacent to the foundation. None of it really mattering too much, who does the discussion matter to? Who is the audience? Well it's us, the amateur developers and "community" figures that might be holding things up, but also probably aren't. We're probably irrelevant, and you better hope we are. I'm pretty sure that most of the "ecosystem"15 is completely unaware of the foundation room. So whatever is said in there, whatever we also say in there is irrelevant to them.
You see, the thing is, what holds Matrix up isn't us, it's Element. That's how it's always been, that's probably how it has to be. The technology itself is designed around entitlement to specialist labour that any business can expect to have access to, and to be able to throw at problems until the problems run out of money.
By contrast, I admire the people who hold the fort for IRC. Both the technology, software and the networks are held up by volunteers or very small businesses. I pretty strongly believe that liberatory technology, as initially also promised by Matrix isn't going to come from venture capital. It has to be built for and by people who work to help each other. This isn't a stand-up analysis of venture capital, it's just some bullshit I'm writing ok? But cooking up some software that allows communities to organise and do lovely things, and then trying to apply it to the needs of a completely different market and force feed all the architectural decisions that come with that back to us, is shit. It's backwards, I know you might think it can be done, but it can also be done without and it should be.
Gnuxie: The awakening of a Marewolf
It's extremely edgy, I know, but that's cool, and cringe is dead16.
Soft failure and spam
Since the start of the year, spam attacks have been a popular choice on the menu. We had an issue with this at the end of January and the Draupnir room was intentionally targeted because whoever was upset probably didn't understand who we were. Regardless the topic of soft failure came up in our Matrix room.
By mid February the climate had changed and there were waves of attacks in the
ecosystem where abusers would spam CSAM to public rooms for whatever reason.
Now obviously this is bad, but Matrix is somewhat unequipped to prevent
knock on effects. For example, it is in attacks like these where a user joins,
sends some messages, and then gets banned that we see something called
"soft failure". What happens is that even though a moderator can redacted
events sent by the abuser, there are some events that the abuser sent that
the moderator can't see in order to redact them in the first place. This is
because if your spammer is on a remote server, and you ban them, then there is a
gap in the DAG between when the admin issues the ban, and when the the remote
server receives the ban and stops the user sending messages. These are
valid events, Matrix is supposed to and used to work such that the admin would
see these events, because network latency and connectivity issues cause conflicts
like this. Matrix is designed to be resistant against this and is.
The problem with this arrangement is that if you have a fancy homeserver you can
even create any event at any point in the DAG retroactively, by referencing
"stale state". Because to all the participating servers in the room, the sudden
existence of these events referring to prior state, is indistinguishable from
old events they just haven't heard about yet. So if you are a malicious server,
as long as you were once joined to a Matrix room, you can append anything you
want to the DAG so long as you can get get a network connection to another
server in the room17. To stop this from happening you need to fundamentally
redesign event authorisation within Matrix. Which is something that Matrix's
leadership has avoided, because it is a lot of hard work. Instead, a bodge was
introduced, which as you can guess is soft failure. Now, whenever you receive
events that refer to authorisation events that are not a part of the current
state, the homeserver will "soft fail" them. The receiving server accepts the
event but won't show the event to clients via /sync
. Additionally, the
receiving server will try to pretend that it never saw the event in order to
stop its propagation to other servers. This is pretty futile though because if
they received the event in the first place then it's very likely that the event
has already been accepted without soft failure by some other server in the room,
and will be referred to as a forward extremity by that server.
In the specific case where soft failure hits us after banning a spammer, the spammer's server (which in the current climate, almost always is acting complicity rather than maliciously) usually has time to send messages to other servers before those servers also receive the ban. Meaning now when the admin tries to redact the spam, a lot of it gets left behind from the perspective of many in the room. This is an issue that has been documented by many notably heftig in https://github.com/matrix-org/synapse/issues/9329 and subsequently https://github.com/element-hq/synapse/issues/9329.
Draupnir's doctrine
When we discussed this within the Draupnir room on January 31st and we came
to the conclusion that the best thing for us would be having a way to be send
soft failed events down /sync
, and that this should be something accessible
to all clients, so that room admins would be able to find soft failed spam and
redact it. Obviously the implication is that Draupnir or other tools would
be able to do this for them. We agreed that implementing this as some kind of
admin API would leave a bunch of room admins out. I also wrote this up in the
issue, explicitly making reference to Mjolnir.
For some additional context, the reason why this is a problem for Mjolnir is because as a room administrator, NOT necessarily a homeserver administrator (but it is also a problem for server admins), mjolnir can't see the soft failed events. A solution that would work would be allowing a client to access soft failed events from both
/sync
and/messages
. It will be ok for these events to be given in a redacted form, provided it is very & immediately obvious to a client whether these events have an associatedm.redaction
event or not. So that Mjolnir can then see and issue redactions that will be seen by other servers. This should ideally not be a Synapse administrator API, since public Matrix homeservers will have room admins who are not admins of the homeserver.
In private, Cat talked with me about how the appservice API would need to push these too for the system to work with draupnir4all. I thought that this discussion had happened in public around the time, but I can't find evidence of that.
MSC4104: Soft-failure-be-gone!
Subsequently, on February 19th I proposed to remove soft failure entirely and introduce a new mechanism that could be used to canonicalise the room history. You should read the the introduction here, and if you like that, you can read the full pull request here.
This would eliminate the need for any special hacks to have a consistent view of soft-failed and non-soft-failed events. But would require fundamental changes to event authorisation, which again, are almost always avoided in Matrix. I don't really care though, it's still important to give these ideas life just in case, and also to develop and inspire other ideas.
Even ideas about new and better protocols.
Irrelevancy, Confusion, Megalomania
A few days later Cat brought up the soft fail issue in the context of a compliant about extensible events. Essentially he was asking if linked media could be introduced before extensible events, as extensible events will create a nightmare for reactive tooling since now there's many new ways to inject media into events that we need to account for. Cat mentions soft failure complicates an already bad situation.
Without the context that we have, Travis states a preference for sending events
to moderation tools via the appservice. It later turns out that he's somewhat
blind sighted by how prevalent the problem is. At the same same time he brings
up a problem with syncing clients whereby there can be gaps in the timeline.
He theorises that this would lead to Mjolnir and Draupnir being unable to redact
events similar to soft failure. And he quite strongly uses this to advise
against our favoured solution of providing an option to show soft failed events
to both clients and appservices. I push him on this, and tell him that this does
not happen, because Mjolnir has always called /messages
to fetch events to
redact. He then drops the line of argument and switches to show a problem that
does exist, but is much less severe. That when Draupnir or Mjolnir has been down,
it won't process commands sent while the bot was down. However, this is also a
problem that is common to both the appservice and bot deployments of Draupnir,
as we don't have the capability in the appservice to save events while a managed
Draupnir is down or crashed. Only an appservice that can store events or crash
in its entirety when a client experiences a problem can do this. And they
shouldn't really be crashing often in the first place18. The following
morning Travis would publish an MSC to show soft failed events only to
appservices: MSC4109: Appservices & soft-failed events.
While this is all ongoing (within the same discussion), Travis slowly starts depending on another argument, and relies on it more as we go forward. For some context, we acknowledge deeply that Draupnir does not have the ideal interface. It's never been the intention to force moderators to use commands to carry out the most basic tasks (banning and unbanning from a set of rooms). Additionally Draupnir still does not offer any kind of support for spaces, and it would greatly ease onboarding if it did. It's also always been the intention that Draupnir4all could be turned into an appservice that closely tracks a space, given that we can overcome the same problems that have led to the development of MPS. We want for there to be as little conscious manual intervention from users as possible.
Basically Draupnir doesn't align with the foundation's vision for the future, for whatever reason. Not in this form, not in MPS. Possibly, they want to try roll back to a view where "the homserver does it", but we know that homeserver code bases are hot garbage and that this would be a major compromise to modularity19. We don't even see how they arrive to their decisions because it happens in private. Even now with the announcement that SCT will be focusing on T&S work (albeit, because of changes to legislation in the UK and elsewhere), we're assured that these will be worked out with a series of working groups20 but time drags on and we have no idea yet. Of course, I have to give them chance, and I don't think I have here yet. The worry is that they will continue to predetermine the design and direction in private and use the working groups to compromise on them. Which is backwards and a very exhausting approach, the meta discussion needs to be public too.
When I reiterate these ideas for Draupnir I'm told that they don't align with the vision that the foundation has. I'm again told that Mjolnir was stop gap tooling. And all of these things are a giant omen to give up, and my attempts don't feel respected, and sometimes it even feels as though they are undermined.
I'd like to know what you all think about this, because after the discussion on the 23rd of February, I was really upset. Obviously this stuff feels more personal to me since I'm the one wasting my time21, and that's my problem.
NOTE: Just want to say that, I'm alright, I felt that way at the time and when I was writing this a couple weeks later. But right now I can't say I feel like I care enough anymore, at least not right now, nothing. There's enough discussion throughout this post about what the real problems and how they can be fixed, so focus on that rather than some platitude about Draupnir.
Mjolnir
I want you to read the following exchange as a joke between two friends. As it captures the sentiment quite well.
<TravisR> Mjolnir was actually meant to be a proof of concept and nothing more but then we deployed it to 3 servers and forgot to tell people not to run it 😛
<Nico> Well, you actively told people "We have mod tools! Look! Mjolnir!"
<TravisR> this was after we realized it was too late.
What kind of sucks about this is that in this alternate reality, which is apparently preferable to the foundation, Mjolnir would never have been released and the community would be left with nothing at all22.
These are some harsh words, but this is also why I feel like the foundation has a policy of developing for an internal context first, they don't want to again be left with the legacy left by Mjolnir. Saving their own drowning community is too inconvenient, because there will be an extra body on the ship with agency of their own to consider. Which would seem horrific if you didn't acknowledge that the reason why is that the foundation's own concerns are too much to handle. Though I'm not sure that this argument even make sense, since Mjolnir is now and was with exception to its revival in 2021-2022, given the bare minimum attention to just ensure that it still functions. Is it easier or harder to ignore the voices that have been and gone if Mjolnir does or doesn't exist?
Closing
Footnotes:
A timeline gap is where a large number events occurred between the
previous and current call to /sync
, and only the most recent events have
been returned https://spec.matrix.org/v1.9/client-server-api/#syncing.
This happens most frequently when starting a client up after being away for the
day.
This is not as it seems though, it is true for policy lists, but room state isn't used everywhere in the code base from a cache. That's because the infrastructure does not exist for us to cache room state like that, as mentioned in our previous update where we introduced revisions and the clients in room map.
This is because they can be stale events sent from a server that was knocked offline or a result of any mumber of DAG oddities.
Well, this isn't actually true, but in terms of your model of the state it will be. For example, it is possible for this event to also be something stale provided by a previously disconnected server. But we'd have to wait for that server to refer to or send us any superseding event anyways for our server to be able to pull it in. So it's a pretty safe thing to do considering that there won't be any conflicts.
A prominent blue matrician has since felt challenged by this statement and we had a nice discussion about this, where he informed me of how the state portion of sync interacts with timeline gaps.
Events with fields that are protected from redaction will be classified
as Modified
if superseding events are found.
Not all homservers append previousstate to unsigned
though.
Technically we're being a bit cheeky here by defining the the schema for an event we don't control, since it is Mjolnir's event. However, the version is scoped to Draupnir and so long as we only use it to disable/enable protections and maintain compatiblity with Draupnir, we're free to do this without causing issues.
Which is why there are lots of competing validation frameworks. Which I honestly am not sure how to feel about. We've been using TypeBox in MPS, which has been designed close to JSONSchema. Which is important to us since this is how Matrix events are defined within the specification. However, you will be aware of the phrase "don't validate, parse", which exists because not parsing will force you to use validation checks like sprinkles all over your code base, because you never setup an entry point where you parse data and then that is it. The problem with keeping Matrix events in their true JSON structure (or at least doing so without wrapping them in some other object and then leaving the JSON structure as accessors10). Is that I have to either duplicate the checks or lie to the type checker whenver i do property access. The way I have typebox setup, the type system isn't smart enough to recognise that 'm.room.message' should have the room message type. And it can't, because that would be unsafe. In MPS we have to allow for invalid events so that moderators can redact them if their homserver is silly enough not to soft fail them. Therefor I refer you to the solution in this footnote10, and adding a special wrapper type for an invalid event.
Though I'm not sure this would work. Since your messages will clash with possibly valid property names on the event. The real way to do it is to wrap the event in a specialized wrapper for the event type, and then offer the raw JSON or raw content in addition to specialized accessors for that event type. This is complicated though by extensible events, which can annotate any event with media, and makes message scanning a lot harder.
I don't need to seriously explain to you why do I? Since if you don't you are going to waste your time writing a bunch of tests for something that could be tested just once in one file. It's exactly the same problem we're trying to fix here.
And I don't want you to blame that on the web, the web is good and doesn't stop people from writing the tests that I'm describing. What stops them is a lack of investment into libraries and tools that would let them write the tests that I am describing.
And this is also a much safer way to maintain the "no-op" feature from
Mjolnir, since all capability providers that take consequences can just be
replaced with stubs. If you've ever taken a look at the code base, you'll see
a lot of special casing around calls to to ban or redact etc by wrapping
them in a check for config.noop
. Which developers of Mjolnir and Protections
have frequently forgotten about entirely.
Though, I have started copying the share link. This still violates the license though. I think reuse will let you do this in a compliant way.
People keep pretending they "know" what this is, I don't believe anyone has a real idea, it's just their idea. They're probably all simultaneously true.
And forever may it remain dead. Vylet Pony is amazing you should really go listen to some tunes. Can opener fish whisperer is dead easy, but anything since mystic acoustics is absolute game.
Conventionally this is prevented by serveracl, but this can be bypassed. While it might appear that the propagation of soft failed events is prevented in the specification by recommending that servers should exclude them from forward extremities, replication is unavoidable if just one participating server refers to a soft failed event just once. And this can happen either intentionally or unintentionally (I somewhat doubt all homeservers enforce this recommendation, and of course the malicious route is still open).
This could also be solved by providing working out an API for bot interactions. So Draupnir would then only have to fetch the recent interactions in the management room and respond to all interactions that haven't completed yet. But I don't trust interactions to be proposed as an idea in a generic enough way to allow for much freedom. For example, the command syntax in Draupnir is deliberately designed such that the client could be aware of the commands, their documentation and their arguments. Even the presentation types that each individual argument accepts. To the extent that a command to ban a user could highlight all users on the screen when invoked, and allow them to be selected with the mouse. This is copied from presentation style interfaces from a by-gone lisp machine era. But there's no way anyone wants to be as imaginative as that because it's bloatful etc, what a pitty.
I'm editing this on the 16th March, and this came up yesterday in the foundation room. Thib explains his thoughts on this, which are probably shared within the foundation. Essentially, the concern is about consistency vs freedom and modularity. If the homserver is required to manage spaces, synchronise policies etc, then it is more likely that client UX will be consistent. Which is a fair argument, but it does restrict the freedom that we have. I did try to remind the room that so far homeservers have failed to implement many features consistently.
Honestly, the announcement of these groups alone is probably because of the engagement Emma, Cat and jjj have been giving in the rooms around the foundation in the context of T&S. Though I do worry about how much time we as a little group (and I'm not speaking on behalf of the others, I don't know how they feel about this yet) will be able to commit to this process. The loose adhoc association is both a strength and a weakness. We'll see though.
Though clearly, I do this stuff regardless of what spokespeople, community figures, what have you say. They don't have some silver bullet that's gonna make all the work I've been doing look like complete shit. And it would take some serious organisation and coordination for something like that to materialise. And also fundamental reworks that are avoided so. All stuff that I actually want to happen. The worry is that they won't and we'll get more damaging hacks like soft failure and server acl. Or specification changes built to empower tooling that doesn't exist yet, that disadvantages existing tooling or workflows without their consideration.
I have been hesitant to post acknowledgements because I don't want them to be seen as an endoresement of what I have to say, but that feels wrong. You can find the sponsors within the links though.