remoteStorage

Resumable file uploads


#27

There are two major problems with that:

  1. Content ranges on PUT are explicitly forbidden
  2. X headers are officially deprecated
    That’s why Tus has its own custom headers.

Actually this prompted to me to do a bit more digging, and I came across some interested exceptions that might be important for us. See this post:

Essentially, if we’re appending to a file, we can use Content-Range and we could start the transfer with a Content-Length. Haven’t thought it through entirely, but it’s something to consider.

Did you mean Content-Range is forbidden on POST?


#28

No, I mean PUT. The SO answer is actually wrong, Mark Nottingham (who should know it) said so as first comment, and the last comment on the answer explains why:

An origin server that allows PUT on a given target resource MUST send a 400 (Bad Request) response to a PUT request that contains a Content-Range header field (Section 4.2 of [RFC7233]), since the payload is likely to be partial content that has been mistakenly PUT as a full representation.

See https://tools.ietf.org/html/rfc7231

Tus is actually using PATCH instead of PUT, for appending to the same resource.

Have you read through the entire Tus spec? I think it makes the different approaches I described much more clear.


#29

Thanks for clarifying, it makes sense as PUT is actually an update (overwrite) of a file, not an append, so Content-Range should be rejected. Actually it totally makes sense to use PATCH (i’d forgotten about it) as it’s otherwise functionality not described within the HTTP nomenclature.

Given that, if we went with PATCH + Content-Range - addressing your previous concern, we could already give it the expected total file size up-front, and with every payload sent. What are your thoughts?


#30

I forgot to add - the alternative you were proposing would be to use a series of POST payloads each with a unique resource name, then indicating with a HEAD after the fact to compile the pieces?


#31

One scenario is when no acknowledgement is received after sending a chunk repeatedly (but the connection still appears to be up). One tack to take in such a case is to reduce the chunk size. However, that introduces the possibility of the receiver receiving a long chunk and a short chunk, starting at the same location. In that situation, byte ranges would be better than chunk index numbers.


#32

Since TCP is already resending packets that fail, sending explicit chunks is with checksums is somewhat redundant.

A different approach is to allow the client to inquire about an apparently-failed upload, have the server respond with the length of file successfully received, and have the client re-send the remainder of the file.

I’ve implemented the client side of the last part - sending only the remainder of the file. The HTTP request won’t succeed unless you happen to have a server that accepts cross-domain posts from anywhere, but you can examine the HTTP request in the Network tab of your browser debugger. I’ve added some custom headers containing the MD-5 of the whole file, and the range of bytes being sent.
Pen: https://codepen.io/DougReeder/pen/GwRVxM
respoitory: https://github.com/DougReeder/ResumableUpload


#33

PATCH cannot create a resource, which is why in Tus you create the upload via POST, and then PATCH the resource. Also, it is only linear as you cannot PATCH a single resource multiple times in parallel. So basically, if you use POST+PATCH, but you still want parallel uploads, then you need to implement multi-part/parallel uploads as an extra feature, as Tus does with the Concatenation extension. So you end up supporting two different concepts for resumable uploads, with the aggregated implementation complexity of both.

The alternative I meant was to do it similar to what S3 does with multi-part uploads, and because multi-part uploads are also resumable in a way, we could support only that approach, but have both performance and resumability reasonably covered.

The approach is to PUT parts to their own resource, and then tell the server when you’re done, so it can assemble the final resource. So for example you could do something like:

PUT /slvrbckt/videos/isle-de-goree.mpg?part=1 HTTP/1.1
Host: rs.example.com
Content-length: 100000
Content-MD5: ...

[a chunk]

S3 uses upload IDs in query params, too. So their final request is a POST containing a list of the parts (with ETags) to /slvrbckt/videos/isle-de-goree.mpg?uploadId=123. Some of the benefits are that there are no custom headers whatsoever and all request are just plain standard HTTP, and that you don’t need a linear upload queue on the client side.

With the Tus Concatenation extension, you upload to different resources and then send a special concatenation request to assemble the file:

POST /files HTTP/1.1
Upload-Concat: partial
Upload-Length: 5

POST /files HTTP/1.1
Upload-Concat: partial
Upload-Length: 6

PATCH /files/a HTTP/1.1
Upload-Offset: 0
Content-Length: 5

PATCH /files/b HTTP/1.1
Upload-Offset: 0
Content-Length: 6

POST /files HTTP/1.1
Upload-Concat: final;/files/a /files/b

HTTP/1.1 201 Created
Location: https://tus.example.org/files/ab

But this has drawbacks as well, of course. The main one being that you need to keep track of uploads (hence S3’s upload IDs and eventual pruning of unfinished uploads). However, when you have linear PATCH requests, your file is always complete and you just stop appending more data to it when you’re done.

Personally, I’m not trying to convince anyone that multi-part uploads are the way to go. I’m just trying to get a complete picture of all available options and their benefits and drawbacks.


#34

I’m wondering now, if doing nothing but Tus Core with zero extensions wouldn’t be the simplest solution for now. For example:

  • Create a file with PUT without adjusting the spec, but make it possible to append data by PATCHing it. Allow asking for offset as per Tus protocol (HEAD request)
  • Do not support parallel/multi-part uploads

Benefits:

  1. This would make us API-compatible with Tus, and people could choose to use existing code/libs for both client and server functionality.
  2. Minimal API surface changes, only adding a few things to RS for being able to resume failed uploads, as well as streaming uploads by PATCHing in client-controlled chunks.
  3. The more people agreeing on Tus, the more likely it is (hopefully) that it’s going to be submitted to IETF. In which case we would then already rely on the right standard.

Drawbacks:

  1. Have to use custom Tus headers that have nothing to with RS
  2. We’re relying on a standard that did not go through standards body review and processes

Summary of additions, afaics:

  1. HEAD for retrieving offset after upload failure
  2. PATCH for appending content to an existing file (i.e. resume a failed upload)
  3. Add header to PUT for indicating resumable support
  4. Require server to keep partial files, if the client indicated Tus support in the PUT
  5. Add Tus as optional feature, explain it a bit, and ref/link their spec in an RFC-compatible way (find out if possible to rely on external spec if it’s not published by a standards body, but only on a random website). Alternatively, copy the relevant parts and define them in the RS spec.
  6. (optional) Tus protocol info in HEAD/PATCH/OPTIONS response headers (if we want to be officially compliant as Tus server, we need to add all their headers)

What do you think?


#35

That

  1. does what we need
  2. can be efficiently implemented in browsers (using fetch(), but not XHR, I think)
  3. has been refined, over years of use

Something functionally equivalent to the Tus headers would need to be sent, anyway.

That has my support.

It might be of value to make a branch with full Tus compliance, allowing easy comparison of the weight.


#36

The Tus Core way sounds good to me, too.

One question that I have: What happens when the client doesn’t complete the upload? Will the user have a partial file in their storage or should the server discard the file after a certain amount of time?
And related to that: Should a multipart file that is in the process of being uploaded already appear in the directory listing or only after the full upload has been completed?


#37

Those are excellent questions, answers to which are indeed missing from my summary of spec changes!

In my opinion, the client should send a PUT with the intended Content-Length and what the server would assume to be the entire file. Then, if for whatever reason the server doesn’t receive that amount of bytes (and the reason can be that the client just doesn’t want to send it at once), it needs to wait for the client to finish the upload until the file is considered complete and thus appears in directory listings and such.

As the same version of a file should never be uploaded by two different clients at the same time, only the uploading client has to know about the partial file’s existence. Which it does, because it needs to keep track of what it sent successfully, until the upload is complete.

Expiring the partially uploaded file is actually specified in the Tus Expiration extension. So the question is if we want to allow one or more extensions from the start or not.


#38

@silverbucket What do you think?


#39

I agree, using Tus seems like the best option for us, but after reading through the Tus docs again, and maybe I’m slightly misunderstanding @raucao’s comments about POST to replace Creation, but most of the extensions seem pretty important:

Creation, Expiration, Checksum, Termination – everything aside from Concatenation basically, seem fairly important and we’d want to support. You could make the case that Termination could be dropped and covered with Expiration.

Does this add more complexity to implementation? What concerns are the main concerns with using extensions?


#40

Maybe that’s because I never wrote anything about POST to replace creation? My proposal outlines how we don’t need the creation extension, because we already create files via PUT. The only change necessary to tell the server that the client wants a resumable upload is to add a Tus-Resumable header to that same PUT request.

I don’t see why any of those would be required to make resumable uploads work. They are optional extensions in Tus for a reason. So I think the question is if we want to allow some or all of them (except for creation) as the optional extensions that they are. (Which could be exactly what you said there, it’s just not entirely clear to me.)

If we all agree on Tus being a reasonable direction, then how about we start prototyping with Tus core and see where it takes us?


#41

Just had a quick browse through https://github.com/tus/tus-node-server/tree/master/lib and it looks like it would be fairly easy to rip out the relevant parts and port them to https://github.com/remotestorage/armadietto.

The caveat being that you’d most likely not want to allow large files with the Redis store, but only the filesystem one.

/cc @les


#42

I see, though I’m not sure how that makes us compatible with Tus, if a client that expects a Tus-compatible server to start a multi-part upload would initialize the file a different way then what the rs-server is expecting.


#43

I see the confusion, but if you consider how both Tus and RS work, then it should be fairly clear why anyone can create uploads any way they want, which is why Creation is an optional extension in Tus.

The entire Tus protocol is an extension to whatever server and client people are running. A Tus client cannot be compatible with RS out of the box in any case, because it doesn’t do any of the other required RS client functionality, like connecting an account, getting a token, and using that token as a header to authorize the upload requests. Tus is an extension that people can add to their client/server solutions in order to facilitate resumable uploads in addition to whatever other functionality they have.


#44

Yeah, that makes sense. So essentially we’d be leaving as much RS code in place (creating or deleting files) while adding just the resume/multi-part upload functionality.

However that still begs the question of checksum and expiration/termination extension headers. To me those seem like features we’d really want to have from the get-go to failsafe against fragile network conditions. How else would we either verify a transfer wasn’t corrupted along the way, or abort a transfer (with the resources being removed - I guess that ties back to @galfert’s question as to where the file lives before it’s completed, how do we reference it if not via. the expiration/termination extensions)?


#45

I think there’s a reason those are also optional. They are certainly nice to have, but I’d say it’s arguable if they should be required, as they are not strictly necessary to facilitate uploads.

This is mostly mitigated by knowing the exact length of bytes for the final upload. It’s not bulletproof, but can be considered good enough for the minimum required verification imo.

The expiration extension is obviously nice to have, but a file would just be incomplete until it’s deleted, when that’s not implemented. The server can also prune after a while, regardless of an extension spec. I think the idea of the expiration extension is that if a server has heavy load, and especially public uploads enabled, then the client can know when to give up on a slow, ongoing upload. But without the extension, you would just receive back a upload byte offset of 0 for the next PATCH in case the partial file has been pruned. Which is surely an edge case until RS is a widely used standard and clients attempt monster uploads of massive files in the first place.

The termination extension is already exactly the same as what RS would do, namely a DELETE request to the resource, which should just remove whatever is there–complete or not-- by simple HTTP semantics and existing RS protocol spec. So any RS server already supports that extension by design.


#46

P.S.: I think we agree that servers should probably be allowed to support some of the extensions, especially checksums and expiration.