Adventures in Mastoland

We strongly advise you to read up on the myriad of failed experiments in this space.

Great! Where can I find them?

…hm?

This is a retrospective post about my experiment Searchtodon, an attempt at building a privacy conscious personal timeline search tool for Mastodon. I’m intentionally vague about people and projects relevant to this story, to protect the innocent. Titles used are lighthearted pop-culture references and do not semantically reflect on the content or my opinion.

Last updated 2023-01-16.

Introduction

I’ve been online since ~1997, AOL Chat, Forums, AIM, jabber, IRC, the usual, and then Twitter since 2007, did a first Mastodon exploration in 2017 and finally got swept over fully in November 2022. By all accounts, I’m new here, but none of this is new to me.

A few years ago, I built a toy Twitter web client for myself for experimenting with getting more out of my timeline: e.g. don’t show RTs, but show my top 10 RTs for the last 12/24 hours. Something similar I hear Twitter Pro/Blue/whatever has got now, by way of acquiring nuzzle, with top links posted from your timeline.

I’m not a very good frontend web developer, so this didn’t go far, but one thing that this left me convinced with is that there is a distinct lack of things you can do with your timeline. I put ~15 years into meeting lovely folks around the globe and connecting with them over Twitter, but the default experience does not allow me to get the most out of it. E.g. say I follow more people than I can possibly read all posts from, it’d be great if there was a “slow” section, where people that rarely post are listed, so I don’t miss any of their posts while other folks are more busy, or just 12 hours worth of time zones away, without me having to put that together as a list manually.

As an aside, I know I could follow fewer people, but let’s be real, that’s not gonna happen. I have about a dozen of other ideas what useful things can be done with your carefully curated timeline and I believe there is an opportunity for a lot of fun and useful things that can be done with using someone’s social timeline as the data source and better connect people that way.

One of the things I’ve missed on Twitter for the longest time is being able to recall what I’ve seen before. While Twitter has/had search, to my knowledge there is no way to filter by tweets that have shown up in your timeline. This is honestly baffling to me, and I don’t see this getting fixed any time soon. With my custom client, I could have built that, but never got around to, mainly because I feared it not being worth it (the other work having been a nice learning exercise), because Twitter could take it out any time, as shown with the native clients’ debacle just this week. — I’m not interested in building anything for that platform for the time being.

Into the Mastoverse

With taking the plunge into moving most of my social online activity to Mastodon in late 2022, I am intrigued by the possibilities of an open platform. I believe the biggest long-term impact this has is finally getting the social graph into the open, that’s amazing. But with my small demo on Twitter I knew it was feasible to bring some of my ideas over here and I don’t have to worry about losing access because of the whims of the centralised owners of a commercial platform. This is very very appealing for a lot of tinkery-folk like me. If you are reading this, you are probably one of them.

Since Ivory timeline filtering is already mature enough to cover the needs of the client I wrote for Twitter (no boosts, no replies), I had no need to port that over, and instead thought I go after the next most pressing thing: recalling things I have seen in my timeline. In my case, I follow a lot of techies in very different communities (Web dev, JavaScript, Erlang, Python, Rust, macOS, FreeBSD, Databases etc.) and as a result I sweep up a lot of information that I’m generally interested in, but only occasionally dip into more deeply. The effect here is that when I know my macOS folks talk about certain issues for developers a new OS version brings, a while later, the JS community starts running into this issue, and now I can bridge that gap and can help out. But I can’t remember all the little details and references about everything, so I need a way to find things that I’ve seen before.

Yes, there are favs and bookmarks, but if I knew something was gonna be important, I’d have filed it away already otherwise, so that’s not really helping. Plus not all instances enable search on bookmarks.

I fully realise that many people have none of these problems, or are fine using bookmarks, and that’s great, but it doesn’t solve my problem, and I know now that I’m not the only one.

Act Two

I’ll spare you the details of technically getting to a point where I could search my timeline, but let’s just say a little service that runs as an OAuth app dumping plaintext files into a directory on my Mac and then using it’s built-in Spotlight search was done in about an hour or two.

However, it was clear that a “runs on your Mac” solution doesn’t work when your computer is asleep, and with the somewhat exasperating default 400 post timeline limit, not being able to catch up things properly would defeat the usefulness of this. — There is an argument to be made for: well if you are asleep and miss posts, you didn’t see them, so you can’t know to search for them. And while that reasoning is certainly correct, it misses that folks use multiple clients to access Mastodon at different times and that I might see a post in Ivory before going to bed, but after I closed my computer for the day. — I now concede however, that a “really only the posts I’ve seen” indexing would be preferable, but there is no cross-client standard for reporting that anywhere, so I’m not hopeful this will get anywhere. The only other option is: index your entire home timeline.

Attentive readers will point out that at the time of all this, I was on an instance who’s ToS state:

Content on this instance must not be archived or indexed wholesale by automated means by any user or service. Active users may export their data through the export functionality or the API.

When reviewing the feasibility of my project I read “wholesale” as ”everybody’s all the time” and “export their data through […] the API” as “keep a copy of my timeline” as covered as allowed. — I know now that I was mistaken in my, granted, optimistic reading of this, and that I should have reached out to the admins there to double check.

Back to November 2022: when first signing up for that instance I distinctly remember that one of the top configuration options for the profile after sign up was “Opt-out of search engine indexing” and it was checked by default for me (from here on out, I’ll refer to this feature as the noindex flag). I thought that it was really nice that this is such a prominent control feature empowering users and choosing a safe default.

This was my first mistake.

It turns out, not only do few people seem to know about this, almost nobody made a conscious choice here. And if I read things correctly, this flag does not federate, which seems like a tremendous oversight.

I reached out to a few folks that I saw have this enabled and discussed whether what I was building was considered “a search engine” (I didn’t think so), and I learned that a) there are folks that would continue to to use the noindex flag to not get indexed by Google, but they’d be fine being part of a “scoped to a user’s timeline and no one else” type of search (I too fell into this category at the time), and b) on properly considering this setting for the first time, and while agreeing that my thing wasn’t a Google-like public search engine, they’d expect the noindex flag to cause their posts to be excluded. — I did fear that this would diminish the usefulness of the search tool, but eventually came around to this point of view.

While talking to more folks, I was introduced to the #nobot hashtag that accounts use to indicate they don’t want to have anything to do with any bots, which I made to behave like the noindex flag, and I added #nosearch in case folks wanted to be more fine-grained with this.

Threats

Next, I needed to consider if I was adding new vectors of abuse to Mastodon (or strengthening old ones). My reasoning went thus:

  1. if I was a bad actor, my 1-hour-index-to-spotlight experiment would give me all the benefits without anyone knowing about this. I am sadly convinced this is already happening, it is just so simple.

  2. running a custom instance gives you admin rights to all posts that are being read on that instance and those posts live in a database that supports searching. And the ecosystem in general seems to fine with folks running their own instances.

I still don’t think that anything I did would make any of this easier for bad actors or worse for the community.

While considerng other prior experiences that could inform this, I thought of Twitter’s third party thread unrollers that rehost tweets with advertising next to them. They were genuinely useful before Twitter fixed its rendering of threads, but it was annoying to me. Eventually, the services added a feature where the thread-poster could block the unroll-account and unrolling would no longer work. While annoyed that I had to do that four times, I could live with the power balance here.

Mastodon isn’t Twitter, but I also thought this set enough of a precedent that honouring noindex, #nobot and #nosearch would be a decent enough equivalent.

This was my second mistake.

Prior Art

One of the biggest volume of feedback on this was that I should have looked at the search projects that came before. If those folks had taken the time to read the associated website, they’d have seen that I did, alas.

My calculus was:

This is substantially different from the other “fediverse crawlers” (as later confirmed by some of the most fierce critics) that I saw, technically and in framing. I didn’t think this even qualifies as a crawler, but I did get some early feedback that folks would consider this a crawler, but I was optimistic that this would at least bring out valuable feedback for future iterations.

Finally, for the quick demo to validate the user experience and basic functionality to work, I put this together as an OAuth app that runs as web service. That way you can just sign into the UI and start using Mastodon as before, without having to run anything yourself. To give this an honest shot, I believed this needed to be easy to get started with.

A consequence of this is that, for the multiple hours this experiment ran, all indexing happened on one of my servers. Given the framing of this as an experiment with a direct goal to inform operational overhead for Mastodon operators (more on this below), and given that I generally know what I’m doing running web services, I thought that was a valuable trade-off to make for now.

That was my third mistake.

The Experiment

From the get-go, I framed Searchtodon as an experiment with three hypotheses to validate (or not). The quotes are from the Searchtodon website.

1. User experience: can private search for Mastodon be done in a functional way and will folks find this useful?

First, quantities.

The post announcing Searchtodon has received more interactions than I could have imagined:

The number of signups at the end of the experiment:

While not all boosts should be counted as support, as some folks are just interested in the experiment, this still proves to me that a feature that lets you search through your home timeline would be popular enough. I hope the folks at Mastodon see this as encouragement in this direction.

A lot of acceptance surely comes from using an excellent open source web client, as the “Home Search” addition done by me was minimal, and limited in functionality.

But numbers aren’t everything, what about the qualitative angle? — From the folks interested in the functionality, the feedback was predominantly positive. It didn’t work for a few folks, and their feedback was understanding of the early stage of the project.

A good number of people didn’t quite understand the difference between this and what Mastodon already supports. I could clearly have explained things a little better.

Finally, some folks wrote in that they have no need for this as they don’t have this problem, or the regularly provided tools are enough for them. This was entirely expected. It is good to keep in mind that this is not something “for everybody” when arguing for its general acceptance. But enough people liked it.

Hypothesis one: confirmed. Enough people find this useful.

2. Operational feasibility: index data costs storage, search costs compute, etc. Even if the Searchtodon stack is slightly different from Mastodon, it plays in the same ballpark. In case premise #1 comes back positive, we can learn what additional resources will instance operators need to provide private search.

I’m happy to report that things look entirely feasible.

The way things worked, I stored the full post JSON with inline account and reblog (with its own reblog.account) as individual entries in CouchDB. This made the indexing and retrieving side extremely simple, but duplicated quite a bit of data. If two users received the same post, that was saved twice in strictly separate database files, to make account deletions easy.

From that base storage, I created Lucene search indexes partitioned by user timeline.

There are three obvious ways to improve on this:

  1. deduplicate data, all those account records could have been exfiltrated into their own representations.
  2. deduplicate posts across all users.
  3. or forego the additional JSON store by side-caring this to Mastodon directly and only add the search indexes (Lucene or Postgres should both work. Given everything else, this is the probably best way forward).

That said, here’s the data usage as implemented:

Hypothesis two: Doable. I’ve since found some demo implementations using Postgres for search in Mastodon, but no serious take-up as far as I can tell. Worth watching though. — Running this as a free-forever not-part-of-Mastodon service would likely not be sustainable, and it remains unclear if folks would pay for storage costs over multiple years (or if the index should go back that far, for that matter).

Additional learnings: Some Mastodon admins are usually not computing resource constrained, but person-time constrained. They just run the bare minimum Mastodon setup and that’s already enough work for a team of two admins. If those admins encounter optional pieces of infrastructure, they will skip them, as each piece adds to the complexity of the system operated with regards to software updates and interoperability. The only way forward here would be having this feature available in the Mastodon core distribution, using the Postgres full text search feature.

3. Community (most important): is private search for Mastodon actually something that can be done in a way that gels with the community rather than against it? — The folks behind Searchtodon do not wanna fight anyone and if there is enough negative signal, we’ll shut it down. — Open questions include: how to handle reporting, defederation after the fact of instance-to-instance trust, are noindex, #nobot & #nosearch good enough account markers or will the Mastodon community have to invent more mechanics around delegating trust from a user’s timeline to tools operating on their timeline?

Hypothesis three: Nope. It is safe to conclude that as implemented, Searchtodon does not “gel with the community”. This warrants its own section:

The Feedback

As outlined above, the support was tremendous and encouraging. I genuinely wanted to help out folks with a problem and Searchtodon found an audience.

Among the supporters were folks that said they liked the idea, but wouldn’t want to sign up for an experimental service run by one person, which is more than fair, this is me most of the time these days.

The next biggest caveat mentioned here however was the implementation choice of making this a standalone service. As that has two distinct consequences:

It also highlighted a distinct lack of mechanics around this in the Mastodon ecosystem, more on that later.

From here on out it got a little more erratic, as folks commented and criticised things that didn’t match reality, or only confirmed their preconceived notions. Some folks have a reflex to react negatively when hearing “mastodon” and “search” in the same sentence without actually evaluating what is proposed. Names were called.

After the Shutdown

The post-shutdown feedback also came in a bunch of relevant flavours:

Lessons Learned

The Way Forward

I am still optimistic about Mastodon’s potential to become a lot better for a lot more people and I hope this write-up helps folks to not make the same mistakes again. But I’m also a bit disheartened by the response to this endeavour. I think Mastodon culture needs to become better at handling these kinds of things. Not for my sake, but for the culmination of these two factors:

  1. If a big enough “bad actor” joins here, and we are on the brink of this with major platforms having announced upcoming ActivityPub interop, whatever they do will happen regardless of the consent given by folks who think they can federate content but restrict where it goes. I’m not suggesting this as a justification to just make away with people’s data, but realistically this is going to happen eventually and I’d rather Mastodon has tooling and conventions available to deal with this rather than “we always hoped this would never happen”. The only alternative to this is further hard splintering of communities that have more in common than they don’t and I really hope we can avoid that.

  2. The Fediverse at large will only get more and more popular and mainstream. This means that these new people’s needs will want addressing, and I’d rather have a system in place where we can experiment safely to help those people rather than “pressuring the good actors out until only the bad ones are left” (paraphrasing a few responses I got). And while not claiming to be one or the other, I agree with this sentiment.

I regret my mistakes and I apologise to everyone harmed by this experiment.

Thank you for reading.