Adapting to more high-performance use cases

this week i tried to store 5000 email headers in an unhosted web app; about 8Mb of data when JSON-stringified. when you compare it to standard lamp or native C/C++ software development (where anything under 1 million rows is considered a small dataset), the experience was really quite slow and problematic.

we also already found the problem that doing many consecutive commits to IndexedDB is slow (importing an addressbook into remoteStorage takes several minutes), and sometimes IndexedDB just throws an error at times where you don’t want it to.

basically, i think we should move IndexedDB out of the critical path of how an app accesses its data.

i think the way forward is to start using an approach more like memory paging. this would require a change in the baseclient API:

  • keep subtrees of the data tree in memory, and allow synchronous access to that.
  • provide a function to load and unload certain subtrees from IndexedDB to keep mem usage down
  • the official local copy of the data is in memory, and from there we push a changes feed both to IndexedDB and to remote (if connected).
  • for efficiency, data in IndexedDB can be stored in bigger composite objects, representing whole subtrees. we can also store incrementals, and replay them on top of the latest snapshot when recovering from a browser crash or page refresh. these incrementals can then be applied to the snapshot when the subtree is unloaded from memory.

there are also several data structures that are useful;

one important one is an append-only history, where older parts are archived and only the newest part is kept loaded into memory.

another is the PrefixTree which i use in https://github.com/michielbdejong/meute/blob/master/remotestorage-modules/module-utils.js#L1-L134 and which automatically creates more subdirs for longer prefixes when you keep adding items to it.

and sometimes it’s useful to keep multiple indexes into the same data; for instance, i may want to have a pointer to the last messages from a specific contact, for each contact, and a list of contacts who match a certain text-search string

actually maybe we want to keep all of this in the modules, as it is now. the baseClient can expose the promises interface with slow access to IndexedDB and all the errors it throws. then the module can provide a fast synchronous interface to in-memory data, and optimize what it stores, and how it loads and unloads data in memory.

IndexedDB is actually made for this, so maybe we should investigate why it’s slow with the current code first?

yeah, i asked Dale Harvey from PouchDB about it and found out some more from links he pointed me to, basically, the big penalty is committing writes. everything else is fast. i think i also have an idea why the AbortErrors were happening. i’ll revisit caching layer performance once i get sync-per-node working.

… and so I did; things are starting to get properly high-tech now. So sync-per-node makes sure multiple threads plow away at the asynchronous synchronization, in a way that’s robust against page refresh, request timeouts, and interruption of connectivity. But this problem was still happening, all the IndexedDB writes were queueing up because they’re not batched. So to fix this, I added a “commit cache” to our indexeddb storage. I’m already using it in meute, and a PR for it is in preparation. See Add a commit cache to the IndexedDB storage · Issue #622 · remotestorage/remotestorage.js · GitHub for more details

1 Like

Just as an aside: have you thought about using Web Workers for the tasks? It seems unreasonable to block the main thread for background tasks.

good idea! opened Consider moving sync to a Web Worker · Issue #623 · remotestorage/remotestorage.js · GitHub about this