Adventures in Mastoland

We strongly advise you to read up on the myriad of failed experiments in this space.

Great! Where can I find them?

…hm?

This is a retrospective post about my experiment Searchtodon, an attempt at building a privacy conscious personal timeline search tool for Mastodon. I’m intentionally vague about people and projects relevant to this story, to protect the innocent. Titles used are lighthearted pop-culture references and do not semantically reflect on the content or my opinion.

Last updated 2023-01-16.

Introduction

I’ve been online since ~1997, AOL Chat, Forums, AIM, jabber, IRC, the usual, and then Twitter since 2007, did a first Mastodon exploration in 2017 and finally got swept over fully in November 2022. By all accounts, I’m new here, but none of this is new to me.

A few years ago, I built a toy Twitter web client for myself for experimenting with getting more out of my timeline: e.g. don’t show RTs, but show my top 10 RTs for the last 12/24 hours. Something similar I hear Twitter Pro/Blue/whatever has got now, by way of acquiring nuzzle, with top links posted from your timeline.

I’m not a very good frontend web developer, so this didn’t go far, but one thing that this left me convinced with is that there is a distinct lack of things you can do with your timeline. I put ~15 years into meeting lovely folks around the globe and connecting with them over Twitter, but the default experience does not allow me to get the most out of it. E.g. say I follow more people than I can possibly read all posts from, it’d be great if there was a “slow” section, where people that rarely post are listed, so I don’t miss any of their posts while other folks are more busy, or just 12 hours worth of time zones away, without me having to put that together as a list manually.

As an aside, I know I could follow fewer people, but let’s be real, that’s not gonna happen. I have about a dozen of other ideas what useful things can be done with your carefully curated timeline and I believe there is an opportunity for a lot of fun and useful things that can be done with using someone’s social timeline as the data source and better connect people that way.

One of the things I’ve missed on Twitter for the longest time is being able to recall what I’ve seen before. While Twitter has/had search, to my knowledge there is no way to filter by tweets that have shown up in your timeline. This is honestly baffling to me, and I don’t see this getting fixed any time soon. With my custom client, I could have built that, but never got around to, mainly because I feared it not being worth it (the other work having been a nice learning exercise), because Twitter could take it out any time, as shown with the native clients’ debacle just this week. — I’m not interested in building anything for that platform for the time being.

Into the Mastoverse

With taking the plunge into moving most of my social online activity to Mastodon in late 2022, I am intrigued by the possibilities of an open platform. I believe the biggest long-term impact this has is finally getting the social graph into the open, that’s amazing. But with my small demo on Twitter I knew it was feasible to bring some of my ideas over here and I don’t have to worry about losing access because of the whims of the centralised owners of a commercial platform. This is very very appealing for a lot of tinkery-folk like me. If you are reading this, you are probably one of them.

Since Ivory timeline filtering is already mature enough to cover the needs of the client I wrote for Twitter (no boosts, no replies), I had no need to port that over, and instead thought I go after the next most pressing thing: recalling things I have seen in my timeline. In my case, I follow a lot of techies in very different communities (Web dev, JavaScript, Erlang, Python, Rust, macOS, FreeBSD, Databases etc.) and as a result I sweep up a lot of information that I’m generally interested in, but only occasionally dip into more deeply. The effect here is that when I know my macOS folks talk about certain issues for developers a new OS version brings, a while later, the JS community starts running into this issue, and now I can bridge that gap and can help out. But I can’t remember all the little details and references about everything, so I need a way to find things that I’ve seen before.

Yes, there are favs and bookmarks, but if I knew something was gonna be important, I’d have filed it away already otherwise, so that’s not really helping. Plus not all instances enable search on bookmarks.

I fully realise that many people have none of these problems, or are fine using bookmarks, and that’s great, but it doesn’t solve my problem, and I know now that I’m not the only one.

Act Two

I’ll spare you the details of technically getting to a point where I could search my timeline, but let’s just say a little service that runs as an OAuth app dumping plaintext files into a directory on my Mac and then using it’s built-in Spotlight search was done in about an hour or two.

However, it was clear that a “runs on your Mac” solution doesn’t work when your computer is asleep, and with the somewhat exasperating default 400 post timeline limit, not being able to catch up things properly would defeat the usefulness of this. — There is an argument to be made for: well if you are asleep and miss posts, you didn’t see them, so you can’t know to search for them. And while that reasoning is certainly correct, it misses that folks use multiple clients to access Mastodon at different times and that I might see a post in Ivory before going to bed, but after I closed my computer for the day. — I now concede however, that a “really only the posts I’ve seen” indexing would be preferable, but there is no cross-client standard for reporting that anywhere, so I’m not hopeful this will get anywhere. The only other option is: index your entire home timeline.

Attentive readers will point out that at the time of all this, I was on an instance who’s ToS state:

Content on this instance must not be archived or indexed wholesale by automated means by any user or service. Active users may export their data through the export functionality or the API.

When reviewing the feasibility of my project I read “wholesale” as ”everybody’s all the time” and “export their data through […] the API” as “keep a copy of my timeline” as covered as allowed. — I know now that I was mistaken in my, granted, optimistic reading of this, and that I should have reached out to the admins there to double check.

Back to November 2022: when first signing up for that instance I distinctly remember that one of the top configuration options for the profile after sign up was “Opt-out of search engine indexing” and it was checked by default for me (from here on out, I’ll refer to this feature as the noindex flag). I thought that it was really nice that this is such a prominent control feature empowering users and choosing a safe default.

This was my first mistake.

It turns out, not only do few people seem to know about this, almost nobody made a conscious choice here. And if I read things correctly, this flag does not federate, which seems like a tremendous oversight.

I reached out to a few folks that I saw have this enabled and discussed whether what I was building was considered “a search engine” (I didn’t think so), and I learned that a) there are folks that would continue to to use the noindex flag to not get indexed by Google, but they’d be fine being part of a “scoped to a user’s timeline and no one else” type of search (I too fell into this category at the time), and b) on properly considering this setting for the first time, and while agreeing that my thing wasn’t a Google-like public search engine, they’d expect the noindex flag to cause their posts to be excluded. — I did fear that this would diminish the usefulness of the search tool, but eventually came around to this point of view.

While talking to more folks, I was introduced to the #nobot hashtag that accounts use to indicate they don’t want to have anything to do with any bots, which I made to behave like the noindex flag, and I added #nosearch in case folks wanted to be more fine-grained with this.

Threats

Next, I needed to consider if I was adding new vectors of abuse to Mastodon (or strengthening old ones). My reasoning went thus:

if I was a bad actor, my 1-hour-index-to-spotlight experiment would give me all the benefits without anyone knowing about this. I am sadly convinced this is already happening, it is just so simple.
running a custom instance gives you admin rights to all posts that are being read on that instance and those posts live in a database that supports searching. And the ecosystem in general seems to fine with folks running their own instances.

I still don’t think that anything I did would make any of this easier for bad actors or worse for the community.

While considerng other prior experiences that could inform this, I thought of Twitter’s third party thread unrollers that rehost tweets with advertising next to them. They were genuinely useful before Twitter fixed its rendering of threads, but it was annoying to me. Eventually, the services added a feature where the thread-poster could block the unroll-account and unrolling would no longer work. While annoyed that I had to do that four times, I could live with the power balance here.

Mastodon isn’t Twitter, but I also thought this set enough of a precedent that honouring noindex, #nobot and #nosearch would be a decent enough equivalent.

This was my second mistake.

Prior Art

One of the biggest volume of feedback on this was that I should have looked at the search projects that came before. If those folks had taken the time to read the associated website, they’d have seen that I did, alas.

My calculus was:

noindex & friends are well established
no additional threat vector is introduced
indexing is limited to (mostly) what a human has seen
and searching is limited only ever to that one human
open to feedback & suggestions, especially critical
and promise to shut it down with enough negative signal

This is substantially different from the other “fediverse crawlers” (as later confirmed by some of the most fierce critics) that I saw, technically and in framing. I didn’t think this even qualifies as a crawler, but I did get some early feedback that folks would consider this a crawler, but I was optimistic that this would at least bring out valuable feedback for future iterations.

Finally, for the quick demo to validate the user experience and basic functionality to work, I put this together as an OAuth app that runs as web service. That way you can just sign into the UI and start using Mastodon as before, without having to run anything yourself. To give this an honest shot, I believed this needed to be easy to get started with.

A consequence of this is that, for the multiple hours this experiment ran, all indexing happened on one of my servers. Given the framing of this as an experiment with a direct goal to inform operational overhead for Mastodon operators (more on this below), and given that I generally know what I’m doing running web services, I thought that was a valuable trade-off to make for now.

That was my third mistake.

The Experiment

From the get-go, I framed Searchtodon as an experiment with three hypotheses to validate (or not). The quotes are from the Searchtodon website.

1. User experience: can private search for Mastodon be done in a functional way and will folks find this useful?

First, quantities.

The post announcing Searchtodon has received more interactions than I could have imagined:

Comments: 129
Boosts: 700
Favs: 938

The number of signups at the end of the experiment:

While not all boosts should be counted as support, as some folks are just interested in the experiment, this still proves to me that a feature that lets you search through your home timeline would be popular enough. I hope the folks at Mastodon see this as encouragement in this direction.

A lot of acceptance surely comes from using an excellent open source web client, as the “Home Search” addition done by me was minimal, and limited in functionality.

But numbers aren’t everything, what about the qualitative angle? — From the folks interested in the functionality, the feedback was predominantly positive. It didn’t work for a few folks, and their feedback was understanding of the early stage of the project.

A good number of people didn’t quite understand the difference between this and what Mastodon already supports. I could clearly have explained things a little better.

Finally, some folks wrote in that they have no need for this as they don’t have this problem, or the regularly provided tools are enough for them. This was entirely expected. It is good to keep in mind that this is not something “for everybody” when arguing for its general acceptance. But enough people liked it.

Hypothesis one: confirmed. Enough people find this useful.

2. Operational feasibility: index data costs storage, search costs compute, etc. Even if the Searchtodon stack is slightly different from Mastodon, it plays in the same ballpark. In case premise #1 comes back positive, we can learn what additional resources will instance operators need to provide private search.

I’m happy to report that things look entirely feasible.

The way things worked, I stored the full post JSON with inline account and reblog (with its own reblog.account) as individual entries in CouchDB. This made the indexing and retrieving side extremely simple, but duplicated quite a bit of data. If two users received the same post, that was saved twice in strictly separate database files, to make account deletions easy.

From that base storage, I created Lucene search indexes partitioned by user timeline.

There are three obvious ways to improve on this:

deduplicate data, all those account records could have been exfiltrated into their own representations.
deduplicate posts across all users.
or forego the additional JSON store by side-caring this to Mastodon directly and only add the search indexes (Lucene or Postgres should both work. Given everything else, this is the probably best way forward).

That said, here’s the data usage as implemented:

~ 1GB JSON + index storage per 100 users per 24 hours. The optimisations outlined above could reduce this need consevatively by 50% – 80%
used ZFS block level lz4 compression, with compressratio ranging between 3.5x – 4.3x . One could go with a higher compression level with, say gzip-5 (higher gzip settings are not worth the CPU/space trade-off). Additional space savings from experience with gzip-5 could be another ~4x.

Hypothesis two: Doable. I’ve since found some demo implementations using Postgres for search in Mastodon, but no serious take-up as far as I can tell. Worth watching though. — Running this as a free-forever not-part-of-Mastodon service would likely not be sustainable, and it remains unclear if folks would pay for storage costs over multiple years (or if the index should go back that far, for that matter).

Additional learnings: Some Mastodon admins are usually not computing resource constrained, but person-time constrained. They just run the bare minimum Mastodon setup and that’s already enough work for a team of two admins. If those admins encounter optional pieces of infrastructure, they will skip them, as each piece adds to the complexity of the system operated with regards to software updates and interoperability. The only way forward here would be having this feature available in the Mastodon core distribution, using the Postgres full text search feature.

3. Community (most important): is private search for Mastodon actually something that can be done in a way that gels with the community rather than against it? — The folks behind Searchtodon do not wanna fight anyone and if there is enough negative signal, we’ll shut it down. — Open questions include: how to handle reporting, defederation after the fact of instance-to-instance trust, are noindex, #nobot & #nosearch good enough account markers or will the Mastodon community have to invent more mechanics around delegating trust from a user’s timeline to tools operating on their timeline?

Hypothesis three: Nope. It is safe to conclude that as implemented, Searchtodon does not “gel with the community”. This warrants its own section:

The Feedback

As outlined above, the support was tremendous and encouraging. I genuinely wanted to help out folks with a problem and Searchtodon found an audience.

Among the supporters were folks that said they liked the idea, but wouldn’t want to sign up for an experimental service run by one person, which is more than fair, this is me most of the time these days.

The next biggest caveat mentioned here however was the implementation choice of making this a standalone service. As that has two distinct consequences:

data that is previously known and trusted to only exist in one instance now lives in a second place that needs to be trusted to keep that data safe.
folks’ posts that were stored and indexed didn’t opt into that (c.f. noindex & friends above).

It also highlighted a distinct lack of mechanics around this in the Mastodon ecosystem, more on that later.

From here on out it got a little more erratic, as folks commented and criticised things that didn’t match reality, or only confirmed their preconceived notions. Some folks have a reflex to react negatively when hearing “mastodon” and “search” in the same sentence without actually evaluating what is proposed. Names were called.

After the Shutdown

The post-shutdown feedback also came in a bunch of relevant flavours:

thanks for taking this down
thanks for listening, this is rare
thanks for paving the way, I’ll rethink my approach (good!)
thanks for trying this, I’ll find a better loophole (don’t do this)
thanks for trying this, I hope this problem will be solved eventually

Lessons Learned

Enough folks of the current Fediverse want opt-in experimenting rather than opt-out experimenting.
There is a difference between what a thing is vs. what people perceive that thing to be. For success, it is important for whoever makes the thing to respect both of these positions.
Profile hashtags for opt in/out are limited in usefulness:
- many experiments are not viable if there are not enough posts to operate and reaching a critical mass of opt-ins limits progress. Some folks think this is good.
- with account metadata limits, this can only go so far. Adding more and more hashtags is not a scalable solution. And forcing new users through a gauntlet of choices with hard-to-grasp consequences is not gonna make this place more popular.
- maybe the Mastodon onboarding experience could include a “allow my posts to participate in experiments” like operating systems ask to “send crash logs to the developer” on signup/upgrade. Then we could formulate a set of guidelines that experiments must adhere to. Maybe some more differentiation like CC-BY-NC where folks can say “you can do whatever you want”, “experimentation is okay, but no commercial exploitation (ads etc) of my posts”, or “nope, nothing” (the default), or anything in between.
Folks don’t trust a central service and/or closed source (and rightfully so). Ideally start as open source.
Folks value the trust relationship they have with their instance admins and the admins of their followers. Things like OAuth apps that take a delegation of that trust, especially without an opt-in process, are suspect under certain circumstances.
- I think this area benefits most from clarification in the larger ecosystem. Consider a web or native client that uses the Mastodon REST API and OAuth to authenticate a user.
  - They get a copy of all the data that the users have access to and usually present them in an ephemeral manner (old content gets eventually pushed out by new posts coming in). This seems to be a largely accepted use of posts, APIs and terms of service.
  - To offer a good application experience, the client app, in both the web and native scenario, keeps a local cache of all the data it is currently accessing, so restarting the app is instantaneous and no new data has to be downloaded from the internet. This still is largely a reality today and appears to be largely accepted.
  - Depending on the implementation, it could be that an app not merely caches the data, but stores it permanently, which makes it part of the device’s backup. These backups more often than not are Google or Apple cloud backups under control by, admittedly very capable, admins that the original poster has no trust relationship with, and no option to opt in or out. This too seems to be generally accepted.
  - Next is a client that just keeps all data stored it’s ever seen. I’m not aware of a public client that does this, but there are a few folks looking to experiment with this (not me). This would probably violate some instance’s ToS, but I haven’t seen this as grounds for defederation yet. I’m relatively certain, someone is doing this privately today. — I’ve seen responses go in either direction here, with folks saying that seems fair and absolutely not. For people in the second camp, Mastodon has no effective way to protect them, but I predict will need one before long.
    - The one exception of this is single-user instances, which are commonly accepted, it seems. Unless the person running one doesn’t fall out of anyone’s good graces, nobody will even know what they are doing with the data. A few people have privately reached out to me that they used this exact loophole to find things in their timeline in the past (before you look, none of them are in my follower/followee lists). I like the idea of single-user instances, and I’d prefer them not to get ostracised because some of them might be bad actors.
  - Next is a server-side project, like Searchtodon. We don’t have to rehash that this implementation didn’t meet the community’s standards, but changes could be made:
    - make this an open source tool that anyone can choose to run
      - there would have to be an easy way to blanket opt out of these. Maybe a good-faith list that an operator can add themselves to that dissenting instances or users then can defederate or block automatically. Bad actors wouldn’t do this of course, but that’s nothing new.
      - and/or there could be a “this user just signed up for service X” message going out to all followers and giving them a choice to opt-in. Services (especially new ones) sending messages on the user’s behalf has always been icky on Twitter, but maybe this is a good enough cause to establish this practice here.
  - And then: build this into Mastodon. I think for timeline search, this is the inevitable future, IF this feature gets demanded enough. This would solve any trust issues, consent could be handled on the regular instance-to-instance trust/defederation network, and operationally, the experiment (above) showed that the resource overhead is manageable.
    - there are details to be discussed about how long these indexes should go back. Realistically I think 3 – 6 months probably addresses ~80% – 90% of my needs, so maybe that’s an acceptable convention (same for permacaching clients).
- Finally, we can choose not do to any of these things and leave everything as is in Mastoland, but I’m certain the Fediverse at large will move on eventually. And from some of the responses I’ve got is that is what old-school folks here desperately wish for: get the new influx of users move on as fast as possible.

The Way Forward

I am still optimistic about Mastodon’s potential to become a lot better for a lot more people and I hope this write-up helps folks to not make the same mistakes again. But I’m also a bit disheartened by the response to this endeavour. I think Mastodon culture needs to become better at handling these kinds of things. Not for my sake, but for the culmination of these two factors:

If a big enough “bad actor” joins here, and we are on the brink of this with major platforms having announced upcoming ActivityPub interop, whatever they do will happen regardless of the consent given by folks who think they can federate content but restrict where it goes. I’m not suggesting this as a justification to just make away with people’s data, but realistically this is going to happen eventually and I’d rather Mastodon has tooling and conventions available to deal with this rather than “we always hoped this would never happen”. The only alternative to this is further hard splintering of communities that have more in common than they don’t and I really hope we can avoid that.
The Fediverse at large will only get more and more popular and mainstream. This means that these new people’s needs will want addressing, and I’d rather have a system in place where we can experiment safely to help those people rather than “pressuring the good actors out until only the bad ones are left” (paraphrasing a few responses I got). And while not claiming to be one or the other, I agree with this sentiment.

—

I regret my mistakes and I apologise to everyone harmed by this experiment.

Thank you for reading.