This previous month for Gnuxie & Draupnir, February 2024.

Table of Contents

Introduction

Drive to public testing

Since the beginning of the month I have been driving development towards getting the rework of Draupnir that builds on top of MPS merged into the main branch. Or, as a compromise, in a state where the branch could be tested publicly. As of writing, the latter looks a lot more likely. The reason why I am doing this is basically to shortcut myself and stopping MPS becoming a forever project. Not that there's genuine risk of that, just I think it's time to see how our work pays off.

A release would require focusing on getting the test harness that we inherited from Mjolnir working in the context of MPS and then deciding where to draw the line in terms of features or fixes to be included or followed up on. I don't think we're ready for that yet, so working towards a rolling test image for the branch makes more sense.

Reading this post

I've just been toiling along, and this is a much bigger update. A lot of things have happened and the last month, and even this month, have felt somewhat emotional. If you can't be assed to read something you should go to the contents to find something else or skim through what you find interesting. There's a bit of chronology but it'll be fine. And yes I know the post is late 😛.

I pretty much failed on the whole "edit as I go along thing" and this time I've really paid the toll because there is so much stuff to talk about. But I think now that editing is over, I accept how things have worked out.

Draupnir for all shoutout

Cat worked really hard this month to get the appservice mode of deployment into matrix-ansible-docker-deploy. Now, the appservice is still alpha quality and the system admin experience is quite awful at the moment. A lot of that is waiting on Draupnir MPS work to conclude since the appservice internals have seen a lot of improvements there. Big shoutout to Cat for this, I wasn't very receptive to his efforts because I worry about people having a bad time getting this to run, but I do now think it probably is the right thing to do. He went through the pain of reworking the documentation that had been in the repository from the "Mjolnir for all" days, which was rushed and really for Halfy's eyes only. So a big thank you to Cat for working it all out.

Developments

MPS: Room State Tracking

Throughout the previous update, we mentioned room state caching and tracking only in passing. But I want to talk about this more completely as it came up again. Both Draupnir and Mjolnir depend on the matrix-bot-sdk as their client library, which provides very little control over the /sync loop. This means that we cannot use the same /sync loop that we use to receive timeline events to receive state deltas from joined rooms. Even if we could, the state portion of /sync is primarily used to inform clients about state that changed within timeline gaps1, and isn't supposed to provide a complete set of deltas. More worryingly, the appservice API provides us no way to receive state deltas from the homeserver, and I don't want to have to have to manage syncing clients within the appservice code either. This has left Draupnir and Mjolnir in a situation where in order to create an accurate model of the room state, they invalidate their local copy and request an entirely new one every time they see a state event in the room timeline2. The reason being that it is not possible or safe to assume that a state event found in the timeline is representative of current state for that type+statekey pairing3.

As a result, both Mjolnir and Draupnir only cache room state for policy rooms, and for every other operation, such as synchronising policies with room members, the state (though usually just a call to /joined_members) is fetched anew. This works, but it is problematic because most protections also need this same information in one form or another. It also is the primary reason why the bots feel so slow.

As you already know, in MPS we bit the bullet and started tracking state in all protected rooms. But this does present a concerning problem: if we now fetch the entire room state every time we see a state event, then isn't that going to really complicate a join wave attack? Well yes. So I wanted to find a solution to that before releasing. It turns out that actually there are situations where you can take a state event from the timeline and treat it the same way as a state delta. If you have an existing copy of the room state, and you notice a new combination for a type+statekey pair, then from your server's perspective it is impossible for that event to be stale state4.

I don't know why it has taken so long for me to figure this shortcut out, I've told plenty of prominent matricians5 about this and none of them suggested this. However, I guess they just didn't have their head in such a tight spot, since not everyone needs to use state this way or use some crazy hack. Regardless, in order to take advantage of this, we'd have to rework how we calculate changes to room state.

All the way back in Mjolnir, we calculate changes to state by using a very simple enum called ChangeType. This has three variants Added, Modified and Removed. The idea is that new combinations of type+statekey pairs are classified as Added, and everything else Modified. If a state event that is current state becomes redacted or it is replaced with an event with empty content, then it is classified as Removed. If an event that was previously Removed is reintroduced, then it is classified as Added again6.

This was fine because we only used this simple change type to represent policies, not all state events. If I was making a generic matrix library, neightrix, that might be annoying since we don't really know how people are using their state events and what the changes mean. Besides, these rules don't tell us much about how state was changed, and we couldn't figure out whether a new type+statekey pair had been introduced or not. At least not without duplicating some of the checks we already did to derive the change type in the first place.

Well I made the decision to just expand the enum to exhaustively describe every state change, tell me whether that was a dumb move.

// Copyright 2022 - 2024 Gnuxie <Gnuxie@protonmail.com>
// Copyright 2019 - 2021 The Matrix.org Foundation C.I.C.
//
// SPDX-License-Identifier: AFL-3.0 AND Apache-2.0
//
// SPDX-FileAttributionText: <text>
// This modified file incorporates work from mjolnir
// https://github.com/matrix-org/mjolnir
// </text>

export enum StateChangeType {
  /**
   * A state event that has content has been introduced where no previous state type-key pair had
   * in the room's history. This also means where there are no previous redacted
   * or blanked state events.
   */
  Introduced = 'Introduced',
  /**
   * A state event that has content has been reintroduced where a blank or redacted state type-key
   * pair had previously resided in the room state.
   * The distinction between introduced and reintroduced are important, because
   * an issuer can always treat introduced state in the timeline as a delta,
   * but not reintroduced, modified or removed state.
   */
  Reintroduced = 'Reintroduced',
  /**
   * This is a special case of introduced, where a state type-key pair has been
   * introduced for the first time, but with empty content.
   */
  IntroducedAsBlank = 'IntroducedAsBlank',
  /**
   * This is when a unique state event with empty content has been added
   * where there was previously a state event with empty or entirely redacted content.
   * Can alternatively be thought of as "ReintroducedAsEmpty".
   */
  BlankedEmptyContent = 'BlankedEmptyContent',
  /**
   * A state event with empty content has been sent over a contentful event
   * with the same type-key pair.
   */
  BlankedContent = 'BlankedContent',
  /**
   * A redaction was sent for an existing state event that is being tracked
   * and has removed all content keys.
   */
  CompletelyRedacted = 'CompletelyRedacted',
  /**
   * A redaction was sent for an existing state event that is being tracked
   * and has removed all content keys that are not protected by authorization rules.
   * For example `membership` in a member event will not be removed.
   */
  PartiallyRedacted = 'PartiallyRedacted',
  /**
   * There is an existing contentful state event for this type-key pair that has been replaced
   * with a new contenful state event.
   */
  SupersededContent = 'SupersededContent',
  /**
   * The events are the same, and the event is intact.
   */
  NoChange = 'NoChange',
}

Blanked is really a bit arbitrary. It's really an interpretation of what a blank state event is supposed to mean, but usually consumers will use them as a more explicit way to remove state than a redaction.

I haven't written code to use the Introduced scenario yet as a shortcut, since that's less important to me at the moment. But I wonder if unsigned event content can also be abused to find out when we have a directly superseding event and whether we can take a shortcut there too7. But I think you are much more likely to run into a risk of getting out of sync with the canonical state of the room then. No reason why we couldn't preemptively use the events though and still make a call the /state to compare. We could probably write some basic code to just make sure we're not frequently over checking the same room. That's all stuff that needs figuring out in the future though once I have Draupnir MPS up and running with a profiler.

MPS: ProtectionsConfig

The ProtectionsConfig is responsible for configuring and storing which protections are enabled within Draupnir. Just like Mjolnir, Draupnir stores the list of enabled protections in Matrix account data. And this is done under the account data key org.matrix.mjolnir.enabled_protections. Early on in Draupnir's development, we wanted to add new functionality as protections and the first protection that we did this with was the BanPropagationProtection. Of course, naively doing so would mean when Draupnir get upgraded, the new features would be disabled, which is not the behaviour that you would expect. Given that the BanPropagationProtection protection is such an important feature for the UX of Draupnir, we wanted it to be enabled by default. So even for existing deployments of Draupnir that are upgraded, or Mjolnir deployments that are migrated across, the protection will be enabled without their intervention. To do this seamlessly we needed was some kind of migration code.

Originally, this was done by embedding a special key into the event called the SCHEMA_VERSION_KEY or ge.applied-langua.ge.draupnir.schema_version. The associated value to the version key would contain the version of the schema the event claimed to use. Draupnir's old MatrixDataManager would spot events that didn't have the key or conformed to an older schema and update them for us. The migrations were were pretty straight forward to define8. Ironically, the input data is only validated in an adhoc way by type narrowing from unknown and any TypeScript user knows that writing the code to do this is tiring9.

In MPS we provide an interface for the protections config and a standard implementation that works with abstracted capabilities to load the account data. Since MPS is a generic library, it would be inappropriate to add Draupnir specific migrations to the data. So a solution would be needed so that Draupnir could use the standard implementation and provide migrations for it to use, all without making the utility weirdly out of place.

This is done by introducing the concept of SchemedData and SchemedDataMigration to MPS and then allowing a migration path to be given to the standard implementation of ProtectionConfig. This is a concept that has existed in Draupnir's code base for awhile in various forms, and this concept still is not complete. I also worry if we are really abusing a concept here that is way too generic for a very specific purpose.

MPS: ClientsInRoomMap

In January we described the ClientsInRoomMap, a utility that would track all the appservice users present within a room and whether they should be informed about an incoming event. We used this map to also restrict which rooms appservice users can request room state information from. Technically, this isn't necessary but I am paranoid about some obscure consumer of the API being abused down the line to leak room state, and thus room members, out of the appservice. If we naively implemented this restriction, then it would break conventional use of client objects within the code base. For example, to join a room you usually call MatrixClient['joinRoom'], where MatrixClient is from the matrix-bot-sdk.

Consider watching a policy room, the steps in Draupnir post-MPS go like this:

  1. The room is given to the PolicyListConfig.
  2. PolicyListConfig joins the room.
  3. PolicyListConfig requests an instance of a PolicyRoomRevisionIssuer from the PolicyRoomManager so that the policies of the room can be accessed.
  4. The PolicyRoomManager checks that the client is present within the room by using the ClientsInRoomMap.

If during this time, no timeline event has been received to inform the ClientsInRoomMap that our client has joined the room, then the ClientsInRoomMap will tell the PolicyRoomManager in step 3 that our client is not present within the policy room. This will very confusingly present an error to the end user that suggests they are not present within the policy room, when they will see on inspection that the client just joined it.

There are three ways this could be fixed:

  1. Waiting for the timeline event to be received where we make a call to join the room within the PolicyListConfig.
  2. Manually informing the ClientsInRoomMap of our join to the room.
  3. Adding preemption to the client itself when the joinRoom method is called that automatically and atomically updates the ClientsInRoomMap.

The first option, waiting for the corresponding timeline event which acknowledges the join, is terrible. Since now everywhere in the code where you join a room, you are waiting indefinitely for something that may or may not happen. Something that depends on connectivity to the homeserver, and a whole bunch of code then needs to be added somewhere to manage the behaviour of waiting for the timeline event and continuing again. This would be a lot of dedicated engineering for what is really nasty code that indirectly solves the problem. It would also force all code around room joins to not only be noisy but also latent at runtime while we wait for the join event to get sent down to us from our homeserver.

The second option, manually informing the ClientsInRoomMap, is ok. This will likely only require a single line of code per site where we join a room. The update can also be performed instantly. However, it does allow the programmer to make a mistake quite frequently. They can accidentally tell the ClientsInRoomMap that we have joined a room, when really there was an error of some kind and we never did join the room. It also introduces an idiom that must be familiarised with everyone who wants to write code for Draupnir or make protections. Someone unfamiliar with the code base who is developing a protection might not be aware of this new idiom, and then get confused error messages that tell them that their client isn't present in the new room.

The third option, preemption in the client interface, requires significant engineering, since we now have to either modify all uses of the matrix-bot-sdk's MatrixClient or come up with our own alternative. However, it does mean that there is no manual tracking of the joined rooms and we don't have to take extra care to inform new developers.

If you hadn't guessed yet, the third option of preemption is my preferred route. The way this would be implemented is by taking a step into the future, and provide specific capabilities for client actions. One of those capabilities would be a RoomJoiner and that's where we will implement our room join preemption for the ClientsInRoomMap.

Draupnir: testing commands

The Mjolnir integration test harness is unideal, particularly in the context of commands. The way that tests are written currently, regardless of whether commands get tested directly or indirectly, we always assert outcomes by scanning the matrix events that get rendered after the commands are run.

This isn't too different to how end-to-end webtests are written generally in the industry, however, I have a very strong resent towards the way these tests are written. I have a strong resent to tests that are written imperatively to begin with, and I'm yet to to put my intuition into theory, but I should do at some point so that they can be evaluated objectively. What I believe it could boil down to is that tests have to change, and a concise declarative description of what to expect is easier to edit for new changes than an imperative set of steps that need to be fully understood in order to modify just a portion. In order to create a language where a declarative description can express a concept in the first place requires a lot of thought and a full understanding of all interactions of the component we are describing. This being a pre-requisite in itself should already increase the chances of getting better API design.

I guess what I have accidentally argued for then is for the same JSX templates I use to render the results of commands with to also be used for testing the expected result of commands. This isn't really what I was going for, but it is true. Going back to the webtest analogy, this would be like using your react components as the expectation part of your test assertions, you tell me if that would be bad.

So in order to figure out this out, we should ask what we are really testing for when we test commands?

  1. The side effects of the command.
  2. That the command works end-to-end.
  3. The result of the command renders correctly.

Side effects ideally should be covered by unit tests. You'd expect that most application logic to be implemented somewhere distinct from the user interface. You'd also then expect application logic to be testable without invoking procedures indirectly through the user interface. This is already something that is already near universally acclaimed.

If we're testing the side effects of a command, and thus the implementation logic of commands, then testing "that the command works end-to-end" in reality means testing that the "integration glue" around the command is working. Integration glue is the code that sticks user interface to application logic. If we take this further, then the integration glue doesn't need to be tested for each command. The glue just needs to be covered so that we can be confident that the glue will work for every command. To give some specific examples of what this glue would be, here's what I see as integration glue in Draupnir:

  • Extracting command context and parameters from from a matrix event and then calling command specific implementation code. Note, this isn't the same as parameter parsing, we're talking about code that invokes the parser, takes the parse result, and calls the appropriate command designated by the user's message.
  • Taking the result of command logic and calling the appropriate renderer, to generate a render result.
  • Taking the render result, and sending it to Matrix.

This is all glue that should be common to commands, and if it isn't, then you need to write a framework that makes it so first11. This means that we should only need to cover this glue code in a dedicated test, rather than covering glue code through brute force by end-to-end testing every single command to make sure that we hit all our glue.

If we can then break the dependencies of our rendering code and our command code from this same glue (and again this should already be the case), then we can write unit tests specific to those and reduce the complexity of our tests.

The only part of this that I'm hesitant about is renderers. Since you might want to manually inspect the output of the renderers. I don't think that blocks the idea though, since it's a tooling issue really. The tests for renderers just needs to be run with the option to export the results (to Matrix) for review.

Having written this analysis, I'm now curious about what the value of end-to-end web tests is? Because from this point of view, it seems that the only reason to use them is when it's impossible to break free from glue. Or it is impossible to find a concise way to test the renderers without including the entire web stack. This should quite obviously mean that there has been a failure to write modular programs12.

MPS: Capability providers for protections

We have established that we needed to break the dependencies of command specific code in order to make good tests, and also modify the Matrix client interface so that we can inform the ClientsInRoomMap of a room join. We now need a way to break the dependency on a matrix-bot-sdk MatrixClient (and Matrix itself), which is by far the biggest and most complicated stateful dependency.

There is already a precedent for fine grain capabilities being broken down from the different responsibilities of Matrix Clients, which is the widget api. The widget API doesn't provide the same capabilities that a client has, for example there's no capability for joining matrix rooms. The widget API is also oriented towards declaring the degree of attenuation, so for example restricting access room state to a specific event type from a specific set of rooms. Which is good, but we don't need an API like that immediately and we would also still need some conceptualisation of a client to build that API upon.

This is way too much work to attempt all at once, and it would be unwise to jump in and change everything without having some way to evaluate the new API. So to start, I decided to focus on implementing the capability providers for protections with the new client interface. Previously in MPS, we hadn't had time to think about how to implement capability providers. So there existed just one capability provider, the BasicConsequenceProvider.

This was a monolithic interface containing everything that the basic protections would need, and not a very good one either.

/**
 * This has to be provided to all protections, they can't configure it themselves.
 */
export interface BasicConsequenceProvider {
  consequenceForUserInRoom(
    protectionDescription: DescriptionMeta,
    roomID: StringRoomID,
    user: StringUserID,
    reason: string
  ): Promise<ActionResult<void>>;
  renderConsequenceForUserInRoom(
    protectionDescription: DescriptionMeta,
    roomID: StringRoomID,
    user: StringUserID,
    reason: string
  ): Promise<ActionResult<void>>;
  consequenceForUsersInRevision(
    protectionDescription: DescriptionMeta,
    membershipSet: SetMembership,
    revision: PolicyListRevision
  ): Promise<ActionResult<void>>;
  consequenceForServerInRoom(
    protectionDescription: DescriptionMeta,
    roomID: StringRoomID,
    serverName: string,
    reason: string
  ): Promise<ActionResult<void>>;
  consequenceForEvent(
    protectionDescription: DescriptionMeta,
    roomID: StringRoomID,
    eventID: StringEventID,
    reason: string
  ): Promise<ActionResult<void>>;
  consequenceForServerACL(
    protectionDescription: DescriptionMeta,
    content: ServerACLContent
  ): Promise<ActionResult<void>>;
  consequenceForServerACLInRoom(
    protectionDescription: DescriptionMeta,
    roomID: StringRoomID,
    content: ServerACLContent
  ): Promise<ActionResult<void>>;
  unbanUserFromRoomsInSet(
    protectionDescription: DescriptionMeta,
    userID: StringUserID,
    set: ProtectedRoomsSet
  ): Promise<ActionResult<void>>;
}

The way this is has now been split up is as follows. A capability interface is described with a schema as so:

export interface UserConsequences extends Capability {
  consequenceForUserInRoom(
    roomID: StringRoomID,
    user: StringUserID,
    reason: string
  ): Promise<ActionResult<void>>;
  consequenceForUserInRoomSet(
    revision: PolicyListRevision
  ): Promise<ActionResult<ResultForUserInSetMap>>;
  unbanUserFromRoomSet(
    userID: StringUserID,
    reason: string
  ): Promise<ActionResult<ResultForUserInSetMap>>;
}

describeCapabilityInterface({
  name: 'UserConsequences',
  description: 'Capabilities for taking consequences against a user',
  schema: UserConsequences,
});

This is the interface that protections that need to take actions against users will write themselves around, in order to make the functionality pluggable. This includes allowing the protection to run with a complete stub, so that it won't actually take action against users13.

In order to implement the interface, we describe a capability provider using the API below:

describeCapabilityProvider({
  name: 'StandardUserConsequences',
  description: 'Bans users and unbans users.',
  interface: 'UserConsequences',
  factory(_description, context: StandardUserConsequencesContext) {
    return new StandardUserConsequences(
      context.roomBanner,
      context.roomUnbanner,
      context.setMembership
    );
  },
});

This provides a consistent way to to instantiate capabilities by using a factory, as long as the capability provider's factory returns a capability that matches the named interface, then everything will work. The context object is what provides us with this consistency and the factory acts like glue code to pull dependencies from the context and set up the capability for us.

As you can see, the context object has individual capabilities for the client responsibilities of banning and unbanning a user from a room. We also give the SetMembership for the ProtectedRoomsSet so that the StandardUserConsequences can figure out who is joined to the room.

The context object

We should probably explain what the context object is. Basically, the protection suite can't be made aware of the context of its use. For example, we couldn't include the interface of Draupnir in MPS because that would break modularity and the library wouldn't be useful to other software. But we still need protections to be defined by the library consumer that do depend on the context of their use. For example, the ban propagation protection needs access to Draupnir specific code currently to make prompts. The way we fix this is by allowing the library consumer to provide an arbitrary context object to the ProtectedRoomsSet, that their protection descriptions can either use as a dependency or destructure dependencies from. in Draupnir MPS, we use the Draupnir instance itself for this.

Capability context glue

Now that we have decided to change the client interface, and capability providers can be defined to desctructure dependencies from a context object. It is possible to define a standard implementation for the capabilities required by the MemberBanSynchronisationProtection and the ServerBanSynchronisationProtection. Consider the description for the StandardUserConsequences capability provider earlier.

export type StandardUserConsequencesContext = {
  roomBanner: RoomBanner;
  roomUnbanner: RoomUnbanner;
  setMembership: SetMembership;
};

When we come to use this capability provider in Draupnir, unless Draupnir has these properties directly accessible, the factory method for the StandardUserConsequences capability won't work. We therefor need some kind of glue code that creates this context for us from the actual consumer specific context, that the consumer defines ahead of trying to call the factory. This is what that looks like for the StandardUserConsequencesContext.

describeCapabilityContextGlue<Draupnir, StandardUserConsequencesContext>({
    name: "StandardUserConsequences",
    glueMethod: function (protectionDescription, draupnir, capabilityProvider) {
        return capabilityProvider.factory(
            protectionDescription,
            {
                roomBanner: draupnir.clientPlatform.toRoomBanner(),
                roomUnbanner: draupnir.clientPlatform.toRoomUnbanner(),
                setMembership: draupnir.protectedRoomsSet.setMembership
            }
        );
    }
});

This little piece of glue code will get called instead of the factory for the capability provider, setting up the call to the factory with the correct context.

Capability renderers

Next we need to consider capability renderers, which is code that helps the user log and keep track of when a capability is used and by which protection. This is code that will get called around the individual capability methods when called by the associated protection, so that the side effects can be rendered in Draupnir's case to the management room. This could equally be some audit log in an application that isn't Draupnir. It's at this point you really wish you had some of that Kiczales aspect oriented goodness so this wasn't some kind of second class concept, but that's a digression.

Here's the file for the capability renderer to StandardUserConsequences. There's quite a lot of noise for what is really glue code, I'm not happy with it yet, but I can't complain too much. There isn't much code here, and it's not exported anywhere either, so it should be ok.

The capability set

The final piece is how to declare the interfaces of the capabilities that are required for a protection, and giving those dependencies to the factory.

describeProtection<MemberBanSynchronisationProtectionCapabilities>({
  name: 'MemberBanSynchronisationProtection',
  description:
    'Synchronises `m.ban` events from watch policy lists with room level bans.',
  capabilityInterfaces: {
    userConsequences: 'UserConsequences',
  },
  defaultCapabilities: {
    userConsequences: 'StandardUserConsequences',
  },
  factory: (decription, protectedRoomsSet, _settings, capabilitySet) =>
    Ok(
      new MemberBanSynchronisationProtection(
        decription,
        capabilitySet,
        protectedRoomsSet
      )
    ),
});

Annoyingly, this is where typescript starts to fall down a little, since we have to both give the interface to the type parameter for the capabilitySet argument AND we also have to give a description object of that interface so that the code that calls the factory can create the capability set for us from the context. If it isn't clear, the describeProtection utility will take the capabilityInterfaces description, use the value for each property as the interface name and go find that for us. The ProtectionsConfig can later be made to allow each matching capability to be configured, and if no capability provider is named, then the associated default named in defaultCapabilities on the protection description can be used instead.

Gnuxie: The awakening of a Marewolf

It's extremely edgy, I know, but that's cool, and cringe is dead16.

Soft failure and spam

Since the start of the year, spam attacks have been a popular choice on the menu. We had an issue with this at the end of January and the Draupnir room was intentionally targeted because whoever was upset probably didn't understand who we were. Regardless the topic of soft failure came up in our Matrix room.

By mid February the climate had changed and there were waves of attacks in the ecosystem where abusers would spam CSAM to public rooms for whatever reason. Now obviously this is bad, but Matrix is somewhat unequipped to prevent knock on effects. For example, it is in attacks like these where a user joins, sends some messages, and then gets banned that we see something called "soft failure". What happens is that even though a moderator can redacted events sent by the abuser, there are some events that the abuser sent that the moderator can't see in order to redact them in the first place. This is because if your spammer is on a remote server, and you ban them, then there is a gap in the DAG between when the admin issues the ban, and when the the remote server receives the ban and stops the user sending messages. These are valid events, Matrix is supposed to and used to work such that the admin would see these events, because network latency and connectivity issues cause conflicts like this. Matrix is designed to be resistant against this and is. The problem with this arrangement is that if you have a fancy homeserver you can even create any event at any point in the DAG retroactively, by referencing "stale state". Because to all the participating servers in the room, the sudden existence of these events referring to prior state, is indistinguishable from old events they just haven't heard about yet. So if you are a malicious server, as long as you were once joined to a Matrix room, you can append anything you want to the DAG so long as you can get get a network connection to another server in the room17. To stop this from happening you need to fundamentally redesign event authorisation within Matrix. Which is something that Matrix's leadership has avoided, because it is a lot of hard work. Instead, a bodge was introduced, which as you can guess is soft failure. Now, whenever you receive events that refer to authorisation events that are not a part of the current state, the homeserver will "soft fail" them. The receiving server accepts the event but won't show the event to clients via /sync. Additionally, the receiving server will try to pretend that it never saw the event in order to stop its propagation to other servers. This is pretty futile though because if they received the event in the first place then it's very likely that the event has already been accepted without soft failure by some other server in the room, and will be referred to as a forward extremity by that server.

In the specific case where soft failure hits us after banning a spammer, the spammer's server (which in the current climate, almost always is acting complicity rather than maliciously) usually has time to send messages to other servers before those servers also receive the ban. Meaning now when the admin tries to redact the spam, a lot of it gets left behind from the perspective of many in the room. This is an issue that has been documented by many notably heftig in https://github.com/matrix-org/synapse/issues/9329 and subsequently https://github.com/element-hq/synapse/issues/9329.

Draupnir's doctrine

When we discussed this within the Draupnir room on January 31st and we came to the conclusion that the best thing for us would be having a way to be send soft failed events down /sync, and that this should be something accessible to all clients, so that room admins would be able to find soft failed spam and redact it. Obviously the implication is that Draupnir or other tools would be able to do this for them. We agreed that implementing this as some kind of admin API would leave a bunch of room admins out. I also wrote this up in the issue, explicitly making reference to Mjolnir.

For some additional context, the reason why this is a problem for Mjolnir is because as a room administrator, NOT necessarily a homeserver administrator (but it is also a problem for server admins), mjolnir can't see the soft failed events. A solution that would work would be allowing a client to access soft failed events from both /sync and /messages. It will be ok for these events to be given in a redacted form, provided it is very & immediately obvious to a client whether these events have an associated m.redaction event or not. So that Mjolnir can then see and issue redactions that will be seen by other servers. This should ideally not be a Synapse administrator API, since public Matrix homeservers will have room admins who are not admins of the homeserver.

In private, Cat talked with me about how the appservice API would need to push these too for the system to work with draupnir4all. I thought that this discussion had happened in public around the time, but I can't find evidence of that.

MSC4104: Soft-failure-be-gone!

Subsequently, on February 19th I proposed to remove soft failure entirely and introduce a new mechanism that could be used to canonicalise the room history. You should read the the introduction here, and if you like that, you can read the full pull request here.

This would eliminate the need for any special hacks to have a consistent view of soft-failed and non-soft-failed events. But would require fundamental changes to event authorisation, which again, are almost always avoided in Matrix. I don't really care though, it's still important to give these ideas life just in case, and also to develop and inspire other ideas.

Even ideas about new and better protocols.

Irrelevancy, Confusion, Megalomania

A few days later Cat brought up the soft fail issue in the context of a compliant about extensible events. Essentially he was asking if linked media could be introduced before extensible events, as extensible events will create a nightmare for reactive tooling since now there's many new ways to inject media into events that we need to account for. Cat mentions soft failure complicates an already bad situation.

Without the context that we have, Travis states a preference for sending events to moderation tools via the appservice. It later turns out that he's somewhat blind sighted by how prevalent the problem is. At the same same time he brings up a problem with syncing clients whereby there can be gaps in the timeline. He theorises that this would lead to Mjolnir and Draupnir being unable to redact events similar to soft failure. And he quite strongly uses this to advise against our favoured solution of providing an option to show soft failed events to both clients and appservices. I push him on this, and tell him that this does not happen, because Mjolnir has always called /messages to fetch events to redact. He then drops the line of argument and switches to show a problem that does exist, but is much less severe. That when Draupnir or Mjolnir has been down, it won't process commands sent while the bot was down. However, this is also a problem that is common to both the appservice and bot deployments of Draupnir, as we don't have the capability in the appservice to save events while a managed Draupnir is down or crashed. Only an appservice that can store events or crash in its entirety when a client experiences a problem can do this. And they shouldn't really be crashing often in the first place18. The following morning Travis would publish an MSC to show soft failed events only to appservices: MSC4109: Appservices & soft-failed events.

While this is all ongoing (within the same discussion), Travis slowly starts depending on another argument, and relies on it more as we go forward. For some context, we acknowledge deeply that Draupnir does not have the ideal interface. It's never been the intention to force moderators to use commands to carry out the most basic tasks (banning and unbanning from a set of rooms). Additionally Draupnir still does not offer any kind of support for spaces, and it would greatly ease onboarding if it did. It's also always been the intention that Draupnir4all could be turned into an appservice that closely tracks a space, given that we can overcome the same problems that have led to the development of MPS. We want for there to be as little conscious manual intervention from users as possible.

Basically Draupnir doesn't align with the foundation's vision for the future, for whatever reason. Not in this form, not in MPS. Possibly, they want to try roll back to a view where "the homserver does it", but we know that homeserver code bases are hot garbage and that this would be a major compromise to modularity19. We don't even see how they arrive to their decisions because it happens in private. Even now with the announcement that SCT will be focusing on T&S work (albeit, because of changes to legislation in the UK and elsewhere), we're assured that these will be worked out with a series of working groups20 but time drags on and we have no idea yet. Of course, I have to give them chance, and I don't think I have here yet. The worry is that they will continue to predetermine the design and direction in private and use the working groups to compromise on them. Which is backwards and a very exhausting approach, the meta discussion needs to be public too.

When I reiterate these ideas for Draupnir I'm told that they don't align with the vision that the foundation has. I'm again told that Mjolnir was stop gap tooling. And all of these things are a giant omen to give up, and my attempts don't feel respected, and sometimes it even feels as though they are undermined.

I'd like to know what you all think about this, because after the discussion on the 23rd of February, I was really upset. Obviously this stuff feels more personal to me since I'm the one wasting my time21, and that's my problem.

NOTE: Just want to say that, I'm alright, I felt that way at the time and when I was writing this a couple weeks later. But right now I can't say I feel like I care enough anymore, at least not right now, nothing. There's enough discussion throughout this post about what the real problems and how they can be fixed, so focus on that rather than some platitude about Draupnir.

Mjolnir

I want you to read the following exchange as a joke between two friends. As it captures the sentiment quite well.

<TravisR> Mjolnir was actually meant to be a proof of concept and nothing more but then we deployed it to 3 servers and forgot to tell people not to run it 😛

<Nico> Well, you actively told people "We have mod tools! Look! Mjolnir!"

<TravisR> this was after we realized it was too late.

What kind of sucks about this is that in this alternate reality, which is apparently preferable to the foundation, Mjolnir would never have been released and the community would be left with nothing at all22.

These are some harsh words, but this is also why I feel like the foundation has a policy of developing for an internal context first, they don't want to again be left with the legacy left by Mjolnir. Saving their own drowning community is too inconvenient, because there will be an extra body on the ship with agency of their own to consider. Which would seem horrific if you didn't acknowledge that the reason why is that the foundation's own concerns are too much to handle. Though I'm not sure that this argument even make sense, since Mjolnir is now and was with exception to its revival in 2021-2022, given the bare minimum attention to just ensure that it still functions. Is it easier or harder to ignore the voices that have been and gone if Mjolnir does or doesn't exist?

Footnotes:

1

A timeline gap is where a large number events occurred between the previous and current call to /sync, and only the most recent events have been returned https://spec.matrix.org/v1.9/client-server-api/#syncing. This happens most frequently when starting a client up after being away for the day.

2

This is not as it seems though, it is true for policy lists, but room state isn't used everywhere in the code base from a cache. That's because the infrastructure does not exist for us to cache room state like that, as mentioned in our previous update where we introduced revisions and the clients in room map.

3

This is because they can be stale events sent from a server that was knocked offline or a result of any mumber of DAG oddities.

4

Well, this isn't actually true, but in terms of your model of the state it will be. For example, it is possible for this event to also be something stale provided by a previously disconnected server. But we'd have to wait for that server to refer to or send us any superseding event anyways for our server to be able to pull it in. So it's a pretty safe thing to do considering that there won't be any conflicts.

5

A prominent blue matrician has since felt challenged by this statement and we had a nice discussion about this, where he informed me of how the state portion of sync interacts with timeline gaps.

6

Events with fields that are protected from redaction will be classified as Modified if superseding events are found.

7

Not all homservers append previousstate to unsigned though.

8

Technically we're being a bit cheeky here by defining the the schema for an event we don't control, since it is Mjolnir's event. However, the version is scoped to Draupnir and so long as we only use it to disable/enable protections and maintain compatiblity with Draupnir, we're free to do this without causing issues.

9

Which is why there are lots of competing validation frameworks. Which I honestly am not sure how to feel about. We've been using TypeBox in MPS, which has been designed close to JSONSchema. Which is important to us since this is how Matrix events are defined within the specification. However, you will be aware of the phrase "don't validate, parse", which exists because not parsing will force you to use validation checks like sprinkles all over your code base, because you never setup an entry point where you parse data and then that is it. The problem with keeping Matrix events in their true JSON structure (or at least doing so without wrapping them in some other object and then leaving the JSON structure as accessors10). Is that I have to either duplicate the checks or lie to the type checker whenver i do property access. The way I have typebox setup, the type system isn't smart enough to recognise that 'm.room.message' should have the room message type. And it can't, because that would be unsafe. In MPS we have to allow for invalid events so that moderators can redact them if their homserver is silly enough not to soft fail them. Therefor I refer you to the solution in this footnote10, and adding a special wrapper type for an invalid event.

10

Though I'm not sure this would work. Since your messages will clash with possibly valid property names on the event. The real way to do it is to wrap the event in a specialized wrapper for the event type, and then offer the raw JSON or raw content in addition to specialized accessors for that event type. This is complicated though by extensible events, which can annotate any event with media, and makes message scanning a lot harder.

11

I don't need to seriously explain to you why do I? Since if you don't you are going to waste your time writing a bunch of tests for something that could be tested just once in one file. It's exactly the same problem we're trying to fix here.

12

And I don't want you to blame that on the web, the web is good and doesn't stop people from writing the tests that I'm describing. What stops them is a lack of investment into libraries and tools that would let them write the tests that I am describing.

13

And this is also a much safer way to maintain the "no-op" feature from Mjolnir, since all capability providers that take consequences can just be replaced with stubs. If you've ever taken a look at the code base, you'll see a lot of special casing around calls to to ban or redact etc by wrapping them in a check for config.noop. Which developers of Mjolnir and Protections have frequently forgotten about entirely.

14

Though, I have started copying the share link. This still violates the license though. I think reuse will let you do this in a compliant way.

15

People keep pretending they "know" what this is, I don't believe anyone has a real idea, it's just their idea. They're probably all simultaneously true.

16

And forever may it remain dead. Vylet Pony is amazing you should really go listen to some tunes. Can opener fish whisperer is dead easy, but anything since mystic acoustics is absolute game.

17

Conventionally this is prevented by serveracl, but this can be bypassed. While it might appear that the propagation of soft failed events is prevented in the specification by recommending that servers should exclude them from forward extremities, replication is unavoidable if just one participating server refers to a soft failed event just once. And this can happen either intentionally or unintentionally (I somewhat doubt all homeservers enforce this recommendation, and of course the malicious route is still open).

18

This could also be solved by providing working out an API for bot interactions. So Draupnir would then only have to fetch the recent interactions in the management room and respond to all interactions that haven't completed yet. But I don't trust interactions to be proposed as an idea in a generic enough way to allow for much freedom. For example, the command syntax in Draupnir is deliberately designed such that the client could be aware of the commands, their documentation and their arguments. Even the presentation types that each individual argument accepts. To the extent that a command to ban a user could highlight all users on the screen when invoked, and allow them to be selected with the mouse. This is copied from presentation style interfaces from a by-gone lisp machine era. But there's no way anyone wants to be as imaginative as that because it's bloatful etc, what a pitty.

19

I'm editing this on the 16th March, and this came up yesterday in the foundation room. Thib explains his thoughts on this, which are probably shared within the foundation. Essentially, the concern is about consistency vs freedom and modularity. If the homserver is required to manage spaces, synchronise policies etc, then it is more likely that client UX will be consistent. Which is a fair argument, but it does restrict the freedom that we have. I did try to remind the room that so far homeservers have failed to implement many features consistently.

20

Honestly, the announcement of these groups alone is probably because of the engagement Emma, Cat and jjj have been giving in the rooms around the foundation in the context of T&S. Though I do worry about how much time we as a little group (and I'm not speaking on behalf of the others, I don't know how they feel about this yet) will be able to commit to this process. The loose adhoc association is both a strength and a weakness. We'll see though.

21

Though clearly, I do this stuff regardless of what spokespeople, community figures, what have you say. They don't have some silver bullet that's gonna make all the work I've been doing look like complete shit. And it would take some serious organisation and coordination for something like that to materialise. And also fundamental reworks that are avoided so. All stuff that I actually want to happen. The worry is that they won't and we'll get more damaging hacks like soft failure and server acl. Or specification changes built to empower tooling that doesn't exist yet, that disadvantages existing tooling or workflows without their consideration.

23

I have been hesitant to post acknowledgements because I don't want them to be seen as an endoresement of what I have to say, but that feels wrong. You can find the sponsors within the links though.