Blog post about remoteStorage as a local-first database (request for comments)

raucao · May 2, 2020, 9:49am

Hey @jaredly , thanks for the post and the feedback request!

I found quite a few issues in your post, so I took the time to run through the entire thing and address most points directly. Sorry for the overlong reply! Hopefully it is helpful to you, and possibly other people as well. Also, please feel free to ask follow-up questions about anything, of course!

remoteStorage is an offline-first solution that’s been around for quite some time, and stands out for having a formal spec, first drafted in 2012.

remoteStorage as a protocol is not offline-first per se. It is designed so that clients are able to implement offline-first sync. The current JavaScript reference client, remoteStorage.js, has been written as one such offline-first client. I think it’s important to make this distinction with every point in this article, as it is quite a big difference.

It would be really cool if it took off, but unfortunately there are only one or two commercial providers and the most-promising-looking open-source remoteStorage server has been unmaintained since early 2018.

I’m not sure why you think Mysteryshack is the “most-promising-looking” server, but there’s no arguing about language preferences, of course. However, I think it’s misleading to point out one server implementation for a draft protocol being unmaintained, when it’s not the one that had the most contributors in the past. The last PR for Armadietto was merged a couple months ago. And the most stable implementation, php-remote-storage usually gets timely responses from the author, when someone finds a bug. As it is so stable, that just doesn’t really happen much anymore.

Nevertheless, I figured it would be interesting to see how it stacks up according to my local-first database criteria.

Nice list!

Correctness

How are conflicts handled?

Conflicts are, well, not handled. If two clients change the same document, whichever syncs to the server first wins.

This is, well, not the whole story.

Conflicts are not managed on the server side per se, but the protocol is designed in a way that clients are able to handle them (déjà vu ). The way this works is that you tell the server your last version when you want to PUT (or DELETE) a document, and if the server has a different version stored, it must tell you so by not overwriting the data and returning a 412 status code (see versioning in the spec).

The reference client, remoteStorage.js, does throw conflict events, with both the local and remote data when an app would try to overwrite something. However, as handling a conflict is very much custom to the use case at hand, it’s on the app author to actually implement this. Depending on the case, they can either choose some automated-merge route, or even let the user handle it via UI. But the protocol cannot prescribe this for every use case, so it doesn’t.

How “bullet proof” is it? How easy is it to get it into a broken state (e.g. where different clients continue to see inconsistent data dispite syncing)?

Just a typo: “despite”

With such a simple protocol you’d think it would be pretty robust, but in my short time integrating it into my example app I managed to get into a state where refreshing the page and logging out & in again failed to show me the right data. Only when I opened devtools and disabled the browser cache was I able to get the right information to load. That’s one hazard of relying so heavily on the browser cache I guess. It’s possible that the fault was in the server implementation that I used, who knows.

remoteStorage.js doesn’t rely on the “browser cache” at all. And again, the protocol does not prescribe how something should be cached, but merely makes it possible to implement caching based on (directory listing and document) ETAGs. rs.js uses localStorage for a little bit of data, but the actual user data is stored in IndexedDB by default. Clearing the browser cache would not have any influence on that.

I don’t know what happened there, and I think it would be interesting to look at why it failed exactly. My educated first guess would be that the app didn’t use the “ready” event from the library on startup (a.k.a. refreshing the page). In that case it would be a documentation issue, which we should fix asap.

(I’m using RS apps daily myself, and I’m writing new ones every now and then, and I usually have a hard time getting anything into a broken state these days to be honest. Not saying you’re doing something wrong, but that our documentation needs to be improved. Also, private tabs can be buggy still, due to broken browser behavior.)

Is there consistency verification built-in, to detect if you’re in a broken state?

Nope, the server is trusted to calculate etags correctly, and there’s no verification that the data loaded is consistent.

This is true. There’s no strong verification. However, I think it’s useful to know that ETAGs on servers are often calculated deterministically from the content, especially when using an object storage API as a back-end. In any case, consistency in this case is trusted to the server, as you say. I guess one possible solution for use cases that require strong proofs is to sign or encrypt the data client-side.

How well does sync preserve intent? In what cases would a user’s work be “lost” unexpectedly?

If a user makes changes to a single “document” on two devices while offline, their changes on one of the devices (the one that syncs second) will be completely lost.

As explained above, this is just not true for the protocol or client per se. This is only true for apps that do not implement conflict handling. The sever is required to reject an overwrite when the client gives it an ETAG of the older version it knows.

Cost

Storage

How much data does the client need to store to fully replicate?

It appears to grow somewhat nonlinearly with the number of documents. At 100 documents, it was 58x the size of a naive jsonified version of the data, at 300 documents it was 108x.

This statement makes no sense to me out of context. Also, it is impossible to calculate this correctly in the first place, as the numbers would be heavily impacted by the size of the documents, which vary extremely between use cases.

However, I’d be interested how you arrive at these extremely high numbers, because storing meta data for a document usually doesn’t take 58 times the size of a document. So it looks to me like the data is incorrect as well. Could you publish the actual source data, and the method for deriving these numbers, so people can verify these claims?

How much data does the server need to store?

At 100 documents, 58x; at 200 documents, 36x; at 300 documents, 78x

Same as above. This makes no sense to me, and is impossible to calculate as such. It’s also highly unlikely to have this much overhead for any normal usage.

How complicated is the server logic?

It seems fairly simple conceptually. The server needs to calculate etags for each collection (folder), and stores each document as a plain json file.

The last statement there is false. Clients/apps can choose any MIME type they wish to store/sync. JSON is merely the default for structured data on the Web, but not a requirement for remoteStorage, which can accommodate your todo lists the same as your photo collection.

But the logic is indeed fairly simple! This is also intentional by design, because it should be easy to add RS support to existing servers (think e.g. Nextcloud and such), when the final protocol is published.

Code / implementation

remoteStorage.js

tests: looks like over 3000? All passing on master.

Sounds correct, yes. They’re also very helpful while we’re converting things to TypeScript.

coverage: not tracked (although there is a 6-year-old issue about it)

I’ guess coverage is at about 90-95% of the functionality. We still find the occasional missing or wrong test.

armadietto (the node.js server I evaluated)

community: no commits in the past 9 months.

Not sure where you got this timeframe, but it’s obviously false.

Other notes

When replicating, the client will make 1 request per document . Which, if you have several documents, ends up being a ton of network requests, and a long wait from cold start.

This is true for the initial sync. However, with HTTP/2 now being widely deployed, those requests are usually done within the same TCP connection, and that’s actually pretty fast. (After initial sync, the client only checks ETAGs on directories, until it finds the one with the actual updates, so that’s very quick and efficient in the vast majority of cases.)

Also, synchronization is done via simple polling (every 10 seconds or so).

True for the remoteStorage.js default. You can also change the sync interval, but we’ve found it to be a good default compromise between speed, efficiency, and usability so far.

Flexibility

How does it react to schema changes? If you need to add an attribute to an object, can you?

With remoteStorage, the schema is defined by the client, and the server has no knowledge or opinions about data shape. So if you deploy a new version of the client with a new attribute, it can add that, but I haven’t seen any accounting for data migration.

All correct. Clients can store whatever they want. They just need to handle it themselves. Personally, I’d love for there to be a remoteStorage.js add-on/module for data migration!

Can it be used with an existing (server-side or client-side) database (sqlite, postgres, etc.) or do you have to use a whole new data storage solution?

It certainly could, but the whole point of remoteStorage is that the app developer has zero control (or knowledge) over the backend.

For server operators, this is a question of what the server implementation supports, of course. Armadietto supports pluggable storage back-ends for example, so it would be possible to add support for someone’s existing back-end, if they want that.

Can it sync with Google Drive, Dropbox, etc. such that each user manages their own backend storage?

Technically it can sync to Google Drive or Dropbox, but I’ve found the implementations to be extremely buggy. Certainly the whole point of remoteStorage is that each user manages their own backend storage, but the best-supported “backend” is the custom remoteStorage protocol, and the options for public remoteStorage providers are quite limited.

Yes, this is mostly due to the protocol still being a draft. However, both our GDrive and Dropbox implementations do work in general, and we are fixing more bugs as we find them over time. They are definitely not as stable as the RS support yet. If you could report bugs that you find, that would be immensely helpful for being able to fix them!

Does it require all data to live in memory, or can it work with mostly-persisted data? (such that large datasets are possible)

Large datasets aren’t advisable (imo) due to the simplicity of the syncing mechanism.

How so? The syncing mechanism is simple, but it can easily support large datasets. You can selectively sync only subdirectories for example. And you don’t have to keep everything in memory at all. remoteStorage.js uses IndexedDB by default, and it has been designed for and tested with many thousands of documents cached locally. Making this performant in an app requires a certain degree of expertise in Web technologies, but it is certainly possible, and it’s not arcane magic.

The one thing that I wouldn’t recommend as of now, is uploading large single documents, because we don’t have resumable uploads yet.

Does it support e2e encryption?

There’s nothing built-in for e2e encryption, but it could potentially be added (there’s nothing in the protocol that would prohibit it).

True. And some existing apps do store encrypted data already. Nothing is required for it from the sync/server side.

Is multi-user collaboration possible, where some users only have access to a subset of the data? (think firebase access rules)

No. remoteStorage is a very personal protocol, with no support for multi-user data-sharing situations.

This is not entirely true. It is indeed mostly meant for personal data, but you can share documents from your /public category/folder (including custom indexes if you implement that). It’s just that nobody can write data to your account by default. However, the protocol does not prohibit creative application of the technologies it is based on. For example there’s nothing preventing a server provider from implementing organization accounts, which could give out tokens for categories based on server-side custom ACLs.

Production-ready

Is it being used in production?

I think so? The mentions I’ve seen of it being used in production apps are a couple of years old, and I haven’t seen a centralized list of “here’s who’s using it”.

As a decentralized protocol, you cannot see who’s using it really. That’s kind of the entire point. But if I put my RS provider hat on, I can tell you with certainty that there are many production users. You can also see new apps integrating support fairly regularly, albeit most of them being side projects at the moment. For larger potential providers and apps, I think the draft status of the protocol is a still a turn-off (and IETF actually discourages wide-scale production usage of draft-stage protocols).

How well does it handle offline behavior?

Quite well.

Glad to hear!

Does it correctly handle working on multiple tabs in the same browser session?

Not when offline. When online, each tab syncs via polling separately, and so the tabs synchronize within 10 seconds of each other.

Actually, there are supposed to be events for changes from other tabs, but this is actually an open bug for rs.js at the moment. It’s going to be one of the first things to be fixed/refactored, when the TS conversion has been merged.

Does it bake in auth, or can you use an existing authentication setup?

It bakes in auth, and you wouldn’t use existing authentication because each user brings their own backend.

True. However it is possible to do auth differently, and give your users an access token for your RS API in some custom way. So you can basically use the entire protocol without the auth part, if you want. But this removes the user’s ability to choose their own storage account, of course. (I just think it’s worth mentioning, because it’s a thing that I’m doing for a customer’s app in production right now, and it’s still nicer to do private storage with an open protocol than to lock them into some proprietary API.)

Conclusion

remoteStorage occupies an interesting place in my mind. On the one hand, it’s an 8 year old project that still receives active maintenance, which is a pretty big achievement in and of itself. On the other hand, its simplicity means that it’s lacking a lot of features that people have come to expect from modern web applications. Overall, given that it’s still actively being developed, it could very well gain some of those features and become a strong solution for building modern local-first apps in the future.

Even though I’ve pointed out quite a few ~~falsehoods~~ misunderstandings in the post, I actually like this conclusion, and see nothing wrong with it. Hopefully, our community will manage to address the pain points you experienced and described! You’re wholeheartedly invited to join us in this effort, even if it’s “just” by reporting bugs when you see them, which is actually as important as the code required to fix them. But any and all help would be much appreciated, as all of this is a grassroots effort with no corporation money backing it.

The original spec author of remoteStorage is now working on SOLID, which seems like it might be a successor in many ways.

I don’t want to speak for @michielbdejong, but I think it’s worth pointing out that he also just came back to the RS community after a longer pause, and intends to also work on RS again, and help bring the spec on standards track, as well as contribute to rs.js again sometime soon. (There’s certainly a lot of overlap in what SOLID and RS address, but there’s also a considerable amount of differences in what they do and how they work.)

Phew, that was a long one! I hope this is helpful for clearing up some misunderstandings. But I think some paragraphs need more detailed inquiries, especially the one about the sizes of stored data and such. And again, it would be extremely helpful to report actual bugs when you see them, but also maybe to get some feedback on your specific implementation before publicly calling something broken, where it may just be a documentation issue with the JS client library.