A Conflict-Free Replicated JSON Datatype

raucao · August 16, 2016, 3:05pm

This sounds rather interesting and highly relevant to what we do:

Many applications model their data in a general-purpose storage format such as JSON. This data structure is modified by the application as a result of user input. Such modifications are well understood if performed sequentially on a single copy of the data, but if the data is replicated and modified concurrently on multiple devices, it is unclear what the semantics should be. In this paper we present an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state. It supports arbitrarily nested list and map types, which can be modified by insertion, deletion and assignment. The algorithm performs all merging client-side and does not depend on ordering guarantees from the network, making it suitable for deployment on mobile devices with poor network connectivity, in peer-to-peer networks, and in messaging systems with end-to-end encryption.

(I haven’t read it yet, but thought it looked interesting enough to share right away. Would love to know what you think of it and start a conversation, if it looks like something we want to adopt/use.)

untitaker · August 16, 2016, 4:27pm

The abstract does make pretty strong claims, but note that JSON doesn’t describe the full semantics of the contained data. The presented algorithm can only guarantee that valid JSON comes out, but possibly not in the format expected by the application.

An example is given on page 5:

The algorithm doesn’t understand that a task is only valid if it has both a "title" and a "done" field. When a task is deleted, it instead interprets this similar to “two keys got deleted”. When “done” is set to true, it is reinserted, creating a bogus task item.

Of course you could then add a post-processing step which then discards invalid items, but do you really want that? Task items may get corrupted for entirely different reasons, and silently discarding them is the worst way to deal with them, even if there’s not much to recover from {"done": true}. Practically, this is not actually a “conflict-free” sync algorithm. It only limits you to one single conflict resolution (discarding), and you still have to add application-specific code for that.

Ideas for remoteStorage.js

IMO in practice you don’t really get around implementing an application-specific UI for when conflicts happen, if storing both versions of the conflicting data (like Dropbox does) is not an option. Unfortunately remoteStorage.js leaves this largely to the developer.

remoteStorage.js could offer a full ORM

It also allows remoteStorage.js to store the data in a proper database (no idea what options there are on the client side), and only (de-)serialize data when down-/uploading. Having a database means that you can automatically index your data, another burden taken off the application developer’s back.

remoteStorage.js can currently maximally know three things when synchronizing:

the file’s state on side A
the file’s state on side B
the file’s state after the last sync (which was equal on A and B back then)

It can merely choose one of those states, and has no idea how to merge them. But with an ORM, remoteStorage.js can at least do this sort of conflict resolution per-field.

At this point we’re basically reinventing databases. I think http://hood.ie is miles ahead of remoteStorage.js in that regard, but there’s not that much you can do with a simple filesystem/kv-store.

When syncing, remoteStorage.js would automatically serialize to JSON. This means that the data on the user storage is not compatible with any existing standard (just raw JSON instead of e.g. iCalendar files), in exchange for making applications with complex data schemas much easier to write.

But even that one disadvantage can be removed by allowing the application developer to define their own serialization and deserialization functions. A calendar app would by default create one JSON file per event in its own proprietary JSON-schema autogenerated by remoteStorage.js, but the developer could still go the extra mile and override this behavior with their own application-specific logic.

remoteStorage.js could implement its own conflict resolution prompt

Since remoteStorage.js now knows so much about the data it stores, it can now show a simple dialog that shows the conflicting task items side by side and ask the user to pick one. This doesn’t necessarily look pretty since it will also show a lot of internal information (such as the table row’s ID).

At this point it might also make sense to integrate with UI frameworks such that the application developer can pass e.g. a React component that takes a task and represents it nicely when attached to the DOM.

michielbdejong · September 18, 2016, 2:19pm

Thanks for reading the article and pointing out why its claim to the impossible (resolving conflicts without user interaction and even without domain knowledge) is indeed unfounded. Domain knowledge helps a lot with conflict resolution (I think this is what you mean with ‘a full ORM’).

We once hoped to do exactly that in the remoteStorage modules - i.e. understand what each field in a JSON object means, and provide higher-level methods like changePhoneNumber() for a contact, which would “know” how to resolve a conflict with a changeProfilePhoto() update on the same object, but never really got it off the ground.

Apps should always implement a handler for change events, and if a change is caused by a sync conflict (server always wins), then that’s indicated in its origin, and the oldValue is also available in the event object.

raucao · September 23, 2016, 2:49pm

Finally got around to reading the whole thing and responding.

That’s true for this last example, but that one explicitly explains the limits of such an algorithm for automatic resolution. The other examples deal with other types of resolution. Consider this one for example:

The paper also mentions that this not meant to resolve 100% of conflicts automatically, although it’s clearly a side comment and the goal is to automate as much as possible (also leaving open the implementation of a schema language in order to account for domain-specific data structures and resolution strategies).

An implementation may keep metadataabout the provenance of each value (who made the change,on which device, at what time) to assist the developer inautomatically resolving such conflicts in an application-specific way, or deferring to users for manual resolution.

That said, I do agree with the following statement:

But I disagree that having a “proper database” on the remote end is solving this in a way that keeps the idea of remoteStorage being a simple key/value store and would likely put us on a path towards considerably more complex server specs and implementations at some point.

Having some kind of optional ORM on the client-side however – in addition to simple object operations – makes a lot of sense to me, and adding APIs for resolving conflicts to that seems to naturally make sense as well then.[quote=“untitaker, post:2, topic:354”]
Since remoteStorage.js now knows so much about the data it stores, it can now show a simple dialog that shows the conflicting task items side by side and ask the user to pick one. This doesn’t necessarily look pretty since it will also show a lot of internal information (such as the table row’s ID).
[/quote]

I can imagine this as being an add-on for the (new) widget, with some kind of simple API to pass the conflict data as well as a schema and/or fields to be hidden for example. In addition to that, it should be easy to do it with existing UI libs, of course.