Resumable file uploads

silverbucket · October 28, 2018, 1:41pm

Hi all, resumable file uploads have been at the top of the list in terms of features to add to the RS spec, specifically because it will enable us to deal with large files where a normal single HTTP POST would be impractical.

There’s been some discussion about it here:

github.com/remotestorage/spec

Optionally support byte ranges support on PUT requests.

opened 02:01PM - 26 Nov 15 UTC

michielbdejong

In specs 02 and 03 we had the option to announce support for byte ranges on GET …requests through WebFinger. In spec 06 we're moving this to the `Accept-Ranges` response header, to better comply with HTTP. For GET requests this is fine, since a client can just try, and get the whole resource if its request header was ignored. For PUT requests this is a bit more complicated (also, how would you announce "supported for GET but not for PUT" in an OPTIONS response?). So we should see if we can find a better way to announce byte range support for GET / PUT / both / none.

General Approach
I’ve been toying with a proof of concept for this, and while it’s not yet complete, the general idea would be as follows:

Select a file in browser
Generate a checksum of the entire binary blob
Chop it up into reasonably sized binary chunks
Generate a checksum of the payload?
Made a series of POSTs to the RS server using custom headers to indicate byte ranges/total, checksum of payload and final file?
On the RS side, put the pieces back together and verify the checksum
Send back header response based on validation of the checksum(s)

I’ve got steps 1-4 completed in a simple demo app, along with half of step 5 (I’m generating headers based on the data, but not submitting to the RS server yet):

Here are the custom headers I’m using at the moment, if anyone has any suggestions for any existing headers we could use in place of any of these, feel free to chime in here so we can improve this as we go.

X-Content-ID: d41d8cd98f00b204e9800998ecf8427e
X-Content-Range-ID: f1ba78aaee2fce91983793c8b90a38a5
X-Content-Range: bytes 2621440-3145728/3472612
Content-Type: image/jpeg

So, in this case X-Content-ID is the checksum of the original data loaded into the browser, X-Content-Range-ID is the checksum of the payload in this POST, X-Content-Range indicates size (bytes) beginning - end byte range of that POSTs payload, followed by / total size of the data being sent. Finally Content-Type is the only “standard” header so far which indicates the file type.

Question What should the response servers look like?
Question What should the server do with each payload until it’s ready to re-constitute the file? Temp storage dir? User storage?

Optimization & Performance
In my experience in dealing with large binary blobs in the browser (several gigs, for example) there are some less than ideal side effects.

Loading the entire file into the browser (memory usage skyrockets)
Processing large binary files causes performance issues and when done in the main thread can cause complete unresponsiveness in the UI.

In order to account for these issues, we’d probably need to implement one or more web-workers to handle different parts of the process, as web-workers run in their own thread, we can try to keep as much work outside of the main thread so as not to have too heavy an effect on the app itself. In order to reduce memory consumption, we can discard old payloads once they’ve been uploaded to the server. This would mean likely the highest memory usage would be upfront, and then reduced during each POST.

Question Memory leaks can be tricky. How best to handle cleanup? Can we discard the entire web-worker after each upload?

Your thoughts?
These are just my general thoughts on the topic without having dug into the more difficult aspects of the implementation details yet. if anyone has any ideas or suggestions feel free to comment here and we can hopefully get something done if not in the next spec draft, then the following one in the spring.

Cheers
-Nick

DougReeder · October 29, 2018, 1:19am

The Streams API would be useful here. I’ve asked a contact at Mozilla if that’s what they are currently implementing. I think there’s a polyfill, but haven’t been able to locate it yet.

Even Shared Workers don’t outlive their associated pages. If the objective is fire-and-forget uploads, we need a Service Worker using Background Sync. Background Sync is currently available only in Chromium browsers, but in development for Firefox and Edge. If keeping activity off the foreground thread is important when Background Sync is unavailable, the upload can run in a Service Worker, which is available on all modern browsers. If it isn’t run on a regular fetch event, it can be run on a custom fetch event (i.e. use a path that doesn’t correspond to a path on the server).

raucao · October 30, 2018, 10:51am

Thanks for getting the ball rolling on this!

While looking into how we could best implement uploads on our servers at 5apps, and as our backend object storage uses an S3-compatible API, we checked out S3 multi-part uploads. Their docs are nice and clear on how it all works.

Summarizing the S3 way:

Explicitly create a specific upload for an object using a special POST request, upon which the server generates and returns its own upload ID, which is then used in a URL param in the PUT requests for the chunks.
The client also sends an MD5 hash of the object so the server can validate that it has received and assembled it correctly
The PUT requests do not need to specify the content rage in bytes, but only a sequential ordering number, so the server knows how to put the pieces together in the end.
Each PUT responds with an ETAG for the chunk, which the client has to collect until done. Its response status code will be 100 Continue until the last chunk, where it’s a 200 OK.
To finish the upload, the client provides the list of all successfully uploaded chunks with their ETAGs for the server to re-assemble into the final object.

My takeaways:

If the client chooses the upload ID, we’d probably want to have some rules for its format
We could and probably should use the same response codes (100/200) for the PUT responses. ETAGs could make sense as well
URL params vs. headers might actually be a good idea, because:
1. URLs are usually logged by default, which could make debugging easier
2. As its own valid resource, a chunk constitutes its own (part) object addressable by URL
The MD5 is a good idea imo, but it might make using Streams impossible, as we can’t hash the entire object when we only see parts of it at a time
Sequential part numbers instead of content ranges might be nice for RS, too. This way the server doesn’t have to keep track of the specific ranges during the upload.

I think this should be up to the server, so implementors can choose whatever makes most sense for their server architecture. Requiring putting it in user storage would prevent the necessary flexibility imo.

I agree with @DougReeder, that ideally we should use Streams in the first place. However, i don’t know if we can actually read files via streams in all browsers right now.

silverbucket · October 31, 2018, 12:46pm

Talking with Melvin Carvalho today about this topic and he pointed me to https://uppy.io/ which is based on an open protocol for resumable uploads called tus which I hadn’t heard of before.
https://tus.io/protocols/resumable-upload.html#core-protocol

Perhaps adopting this would be a better option than implementing something ourselves? There is a JS client, but the official server is implemented in Go, I’m not sure if a node.js version exists.

raucao · October 31, 2018, 3:17pm

I just reviewed the Tus protocol and noticed some things that could make it a bad fit for RS (and SOLID for that matter):

It is rather complex, because the base protocol is optimized for streaming uploads, while multi-part uploads are an extension to the protocol, that requires uploading different filenames and then concatenating on the server
It uses custom headers extensively, but file metadata is optional. Not even a file name is prescribed by default. Creating new files is also an extension.
Content length is required for the final object. There’s an extension for delaying the announcement of it and adding it to a later PATCH request
The protocol has its own discovery mechanism via OPTIONS requests and custom response headers, but it’s only a SHOULD
There’s no apparent plan to publish this as at IETF or with any other standards body

I think the main issue with just requiring the Tus protocol is that it is unnecessarily complex for RS server implementers (seeing that there are enough existing client libraries one could re-use at least).

We would need to support both the normal resumable uploads that are streaming to a single resource, as well as the concatenation feature, if we want parallel, chunked uploads of files (in addition to catching uploads failures on single resources, then discovering how much the server retrieved, then continuing from there). In fact, I think the concept of multi-part uploads is nicer overall for implementing resumable uploads in the first place. Because both sides only have to keep track of which chunks have failed, and then retry those.

Furthermore, we’d have to use a mix of core features and required extensions, but also change core feature requirements, because e.g. we always need a filename/URL and content type. So in a way, we’d have to abuse that protocol a little bit, and bend it to our needs in a way that makes our use of it non-standard. And we have to describe all of that in the RS spec, instead of just saying “servers can support resumable uploads, just use Tus and announce it in Webfinger”.

That said, the authors of Tus have been very welcoming to changes and contributions, so maybe it’s possible to change Tus in a way that allows us to more easily integrate it. I’m just a bit skeptical that it would make things easier for implementers in the end, because the fundamental difference in use cases is probably not something we can meaningfully adjust.

raucao · October 31, 2018, 3:25pm

Interesting:

raucao · October 31, 2018, 3:34pm

GitHub - tus/tus-node-server: Node.js tus server, standalone or integrable in any framework, with disk, S3, and GGC stores.

DougReeder · October 31, 2018, 8:19pm

I think it is possible to avoid loading the whole file in memory, in any browser that supports Blob.slice. A File is a Blob. That makes it possible to write a polyfill for the Stream API, if no one else has written one, by the time we need it.

raucao · October 31, 2018, 8:48pm

So there’s a different way of reading files in a browser than FileReader? Because I don’t see how you could stop that one from reading the entire file, even if you can listen in on progress while it’s reading.

DougReeder · November 1, 2018, 2:15am

FileReader works on Blobs as well as Files, so you just use readAsArrayBuffer on one of the slices.

When I run this Pen: https://codepen.io/DougReeder/pen/oaKBQY
I get these increases of memory:
Firefox
1.8MB in 100,000 byte slices: memory: 20.35MB → 20.87MB
309MB in 100,000 byte slices: memory: 20.61MB → 27.35MB
Chrome:
309MB in 100,000 byte slices: memory: 18.0MB → 18.3MB

If you want to tinker with this, I’ve put the code on GitHub:

I’m not sure I have the test quite right, though. Can someone check that for peak memory usage?

raucao · November 1, 2018, 10:19am

Very cool!

With a 735MB file, with 100,000 byte slices, over the entire course of the read procedure, my Firefox stays right around 30MB, while Chromium tops out at under 19MB. And with 1M byte slices, I get 16MB in Chromium and 26 in Firefox.

@silverbucket I guess that solves one of the main issues you outlined. No need to ever load an entire file into memory to begin with.

silverbucket · November 1, 2018, 10:39am

Awesome, I was hoping that would be the case with loading files but hadn’t done any tests. When downloading a file in chunks (resumable downloads is what I had previously worked on for a specific project) you have the problem that you cannot combine the file until all chunks are received, meaning you need to hang on to every chunk until the end.

I guess since we’re posting the chunks to the server right away we can keep the memory usage very low, but the downside would be that we won’t ever know the full checksum of the file. Correct?

raucao · November 1, 2018, 11:07am

I would think so. But maybe @DougReeder has a solution for that, too .

I guess that’s also why checksum verification is an optional extension in Tus.

DougReeder · November 2, 2018, 2:41am

In general, hashing algorithms accept chunks of bytes at a time - we just need to find some JavaScript code that accepts input in chunks (SubtleCrypto does not, alas). My first attempt used http://www.bichlmeier.info/sha256.html but that gives wrong results when you feed it more than one chunk.

Probably one of these would work: https://www.npmjs.com/search?q=message%20digest

silverbucket · November 2, 2018, 8:49am

@DougReeder have you looked at the demo repo I linked to in the initial post? We’re using SparkMD5 to create checksums, it’s fairly simple - but the question is about getting a checksum of the entire file - in the current example we’re reading the file to get the checksum, then splitting it up into chunks and get checksums on each of those chunks as we’re generating post headers for each payload.

DougReeder · November 3, 2018, 1:26am

Ah! SparkMD5 does nicely. I’ve updated FileSlicer to calculate chunk and overall MD5s:

You can calculate an MD5 of the whole file in your demo-rs-chunked-upload by adding
let md5whole = new SparkMD5.ArrayBuffer();
…
md5whole.append(data.chunk);
…
let hash = md5whole.end();

DougReeder · November 3, 2018, 1:53am

My thought is that, initially it’s fine to have a fixed chunk size, but it should be possible for the uploader to vary the chunk size according to network conditions and performance. For example, if it takes more than five minutes to upload a chunk, it would make sense to make the next chunk smaller. If a chunk uploads in less than a second, the next chunk could probably be larger.

The initial chunk size could be determined based on the NetworkInformation API, if available:
availability

Another possible feature would be to only upload large files over wi-fi or wired connections. However, that depends critically on the Network Information API, which suggests a low priority.

raucao · November 3, 2018, 10:49am

Works like a charm! Also, I ran this locally now, and it seems like all the CodePen scripts took up 20MB of memory for themselves. With just the plain app running, it’s now around 10MB of memory usage in my Firefox for the same file I tried last time.

raucao · November 3, 2018, 11:07am

I had the exact same thoughts. That’s one of the reasons why I like the approach of just specifying part numbers instead of byte ranges. This way, the server doesn’t have to keep track of byte ranges, but also the client can change the range for failed parts dynamically for the next try. I think it’s both cleaner and more flexible.

As with streaming data, we can only know the MD5 after having read everything, and already uploaded most things, I guess that would form a requirement to submit the hash last, right? So how do we feel about special requests for finishing (and possibly creating) the upload?

I also just compared how S3 uses MD5s again, and they actually use them to validate parts instead of the final object, to ensure nothing was lost during transmission of the chunk. In the completion request it then requires the returned part ETAGs to assemble the file. Validating parts also makes sense I think, because if the hash doesn’t match at the end of your 2GB upload, then you have to delete and re-upload the entire file, whereas with part validation, you only need to re-upload invalid parts.

For reference, and because these docs are a bit difficult to find from the overview:

Uploading an object using multipart upload - Amazon Simple Storage Service

silverbucket · November 3, 2018, 1:38pm

md5whole.append(data.chunk);

Ah right, of course, I’ve used this feature many times in the past but it’s been a few years since I’ve done anything like this so I’m a bit rusty in my thinking. This is exactly how I implemented the download md5 verifications in a previous project. I wonder if I can find a copy of that code, as there would probably be a few lessons learned we could apply to this. I had also implemented adjustable chunk sizes as you mentioned, worker threads to offload work from the main thread, and handled a lot of different edge cases.

I had the exact same thoughts. That’s one of the reasons why I like the approach of just specifying part numbers instead of byte ranges. This way, the server doesn’t have to keep track of byte ranges, but also the client can change the range for failed parts dynamically for the next try. I think it’s both cleaner and more flexible.

Not sure why numbers would be better, you’d just lose a bit of information about the size of the payload. There’s nothing that says each byte range in a sequence of payloads has to be the same size. So, while the first payload could be, for example, 500 bytes, the next could be 300 bytes, and the following 800 bytes. Neither the client nor the server would need to do anything special due to the fact that they were differing sizes.

As with streaming data, we can only know the MD5 after having read everything, and already uploaded most things, I guess that would form a requirement to submit the hash last, right? So how do we feel about special requests for finishing (and possibly creating) the upload?

Perhaps a stand alone HTTP request (independent of whether or not an upload just finished) that compares a provided checksum against the checksum of a give location?

I think the server should automatically handle a finished upload the minute the last byte is received, so it shouldn’t be predicated on a special finished request.

I also just compared how S3 uses MD5s again, and they actually use them to validate parts instead of the final object, to ensure nothing was lost during transmission of the chunk. In the completion request it then requires
the returned part ETAGs to assemble the file. Validating parts also makes sense I think, because if the hash doesn’t match at the end of your 2GB upload, then you have to delete and re-upload the entire file, whereas with part validation, you only need to re-upload invalid parts.

Absolutely I agree we should validate each chunk, that’s why I implemented the checksum head for each chunk in the demo mock-up.