Fixing issues #6, #7, and #8. Adding concurrent uploads of parts.

nathanpeck · nathanpeck · commit dc0cd6663e88 · 2014-08-11T17:07:53.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,41 @@
+Changelog
+=========
+
+### 0.5.0 (2014-08-11)
+
+* Added client caching to reuse an existing s3 client rather than creating a new one for each upload. Fixes #6
+* Updated the maxPartSize to be a hard limit instead of a soft one so that generated ETAG are consistent to to the reliable size of the uploaded parts. Fixes #7
+* Added this file. Fixes #8
+* New feature: concurrent part uploads. Now you can optionally enable concurrent part uploads if you wish to allow your application to drain the source stream more quickly and absorb some of the bottle neck when uploading to S3.
+
+### 0.4.0 (2014-06-23)
+
+* Now with better error handling. If an error occurs while uploading a part to S3, or completing a multipart upload then the in progress multipart upload will be aborted (to delete the uploaded parts from S3) and a more descriptive error message will be emitted instead of the raw error response from S3.
+
+### 0.3.0 (2014-05-06)
+
+* Added tests using a stubbed out version of the Amazon S3 client. These tests will ensure that the upload stream behaves properly, calls S3 correctly, and emits the proper events.
+* Added Travis integration
+* Also fixed bug with the functionality to dynamically adjust the part size.
+
+### 0.2.0 (2014-04-25)
+
+* Fixed a race condition bug that occured occasionally with streams very close to the 5 MB size threshold where the multipart upload would be finalized on S3 prior to the last data buffer being flushed, resulting in the last part of the stream being cut off in the resulting S3 file. (Notice: If you are using an older version of this module I highly recommend upgrading to get this latest bugfix.)
+* Added a method for adjusting the part size dynamically.
+
+### 0.1.0 (2014-04-17)
+
+* Code cleanups and stylistic goodness.
+* Made the connection parameters optional for those who are following Amazon's best practices of allowing the SDK to get AWS credentials from environment variables or AMI roles.
+
+### 0.0.3 (2013-12-25)
+
+* Merge for pull request #2 to fix an issue where the latest version of the AWS SDK required a strict type on part number.
+
+### 0.0.2 (2013-08-01)
+
+* Improving the documentation
+
+### 0.0.1 (2013-07-31)
+
+* Initial release
diff --git a/README.md b/README.md
@@ -6,13 +6,14 @@ A pipeable write stream which uploads to Amazon S3 using the multipart file uplo
 
 ### Changelog
 
-_June 23, 2014_ - Now with better error handling. If an error occurs while uploading a part to S3, or completing a multipart upload then the in progress multipart upload will be aborted (to delete the uploaded parts from S3) and a more descriptive error message will be emitted instead of the raw error response from S3.
+## 0.5.0 (2014-08-11)
 
-_May 6, 2014_ - Added tests using a stubbed out version of the Amazon S3 client. These tests will ensure that the upload stream behaves properly, calls S3 correctly, and emits the proper events. Also fixed bug with the functionality to dynamically adjust the part size.
+* Added client caching to reuse an existing s3 client rather than creating a new one for each upload. Fixes #6
+* Updated the maxPartSize to be a hard limit instead of a soft one so that generated ETAG's are consistent due to the reliable size of the uploaded parts. Fixes #7
+* Added a changelog.md file. Fixes #8
+* New feature: concurrent part uploads. Now you can optionally enable concurrent part uploads if you wish to allow your application to drain the source stream more quickly and absorb some of the backpressure from a fast incoming stream when uploading to S3.
 
-_April 25, 2014_ - Fixed a race condition bug that occured occasionally with streams very close to the 5 MB size threshold where the multipart upload would be finalized on S3 prior to the last data buffer being flushed, resulting in the last part of the stream being cut off in the resulting S3 file. Also added a method for adjusting the part size dynamically. (__Notice:__ If you are using an older version of this module I highly recommend upgrading to get this latest bugfix.)
-
-_April 17, 2014_ - Made the connection parameters optional for those who are following Amazon's best practices of allowing the SDK to get AWS credentials from environment variables or AMI roles.
+[Historical Changelogs](CHANGELOG.md)
 
 ### Why use this stream?
 
@@ -146,6 +147,31 @@ var UploadStreamObject = new Uploader(
 );
 ```
 
+### stream.concurrentParts(numberOfParts)
+
+Used to adjust the number of parts that are concurrently uploaded to S3. By default this is just one at a time, to keep memory usage low and allow the upstream to deal with backpressure. However, in some cases you may wish to drain the stream that you are piping in quickly, and then issue concurrent upload requests to upload multiple parts.
+
+Keep in mind that total memory usage will be at least `maxPartSize` * `concurrentParts` as each concurrent part will be `maxPartSize` large, so it is not recommended that you set both `maxPartSize` and `concurrentParts` to high values, or your process will be buffering large amounts of data in its memory.
+
+```js
+var UploadStreamObject = new Uploader(
+  {
+    "Bucket": "your-bucket-name",
+    "Key": "uploaded-file-name " + new Date()
+  },
+  function (err, uploadStream)
+  {
+    uploadStream.concurrentParts(5)
+
+    uploadStream.on('uploaded', function (data) {
+      console.log('done');
+    });
+
+    read.pipe(uploadStream);
+  }
+);
+```
+
 ### Tuning configuration of the AWS SDK
 
 The following configuration tuning can help prevent errors when using less reliable internet connections (such as 3G data if you are using Node.js on the Tessel) by causing the AWS SDK to detect upload timeouts and retry.
diff --git a/lib/s3-upload-stream.js b/lib/s3-upload-stream.js
@@ -1,23 +1,34 @@
 var Writable = require('stream').Writable,
+    util = require("util"),
+    EventEmitter = require("events").EventEmitter,
     AWS      = require('aws-sdk');
 
+var cachedClient;
+
 module.exports = {
+  setClient: function (client) {
+    cachedClient = client;
+  },
+
   // Generate a writeable stream which uploads to a file on S3.
   Uploader: function (connection, destinationDetails, doneCreatingUploadStream) {
     var self = this;
 
-    if (arguments.length == 2){
+    if (arguments.length == 2) {
       // No connection passed in, assume that the connection details were already specified using
       // environment variables as documented at http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-configuring.html
       doneCreatingUploadStream = destinationDetails;
       destinationDetails = connection;
-      self.s3Client = new AWS.S3();
+      if (cachedClient)
+        self.s3Client = cachedClient;
+      else
+        self.s3Client = new AWS.S3();
     }
     else {
       // The user already configured an S3 client that they want the stream to use.
       if (typeof connection.s3Client != 'undefined')
         self.s3Client = connection.s3Client;
-      else {
+      else if (connection.accessKeyId && connection.secretAccessKey) {
         // The user hardcodes their credentials into their app
         self.s3Client = new AWS.S3({
           apiVersion: 'latest',
@@ -26,19 +37,33 @@ module.exports = {
           region: connection.region
         });
       }
+      else if (cachedClient) {
+        self.s3Client = cachedClient;
+      }
+      else {
+        throw "Unable to find an interface for connecting to S3";
+      }
     }
 
     // Create the writeable stream interface.
     self.ws = Writable({
       highWaterMark: 4194304 // 4 MB
     });
 
+    // Data pertaining to the overall upload
     self.partNumber = 1;
-    self.parts = [];
+    self.partIds = [];
     self.receivedSize = 0;
     self.uploadedSize = 0;
-    self.currentPart = Buffer(0);
-    self.partSizeThreshold = 5242880;
+
+    // Parts which need to be uploaded to S3.
+    self.pendingParts = 0;
+    self.concurrentPartThreshold = 1;
+
+    // Data pertaining to buffers we have received
+    self.receivedBuffers = [];
+    self.receivedBuffersLength = 0;
+    self.partSizeThreshold = 6242880;
 
     // Set the maximum amount of data that we will keep in memory before flushing it to S3 as a part
     // of the multipart upload
@@ -49,31 +74,138 @@ module.exports = {
       self.partSizeThreshold = partSize;
     };
 
+    // Set the maximum amount of data that we will keep in memory before flushing it to S3 as a part
+    // of the multipart upload
+    self.concurrentParts = function (parts) {
+      if (parts < 1)
+        parts = 1;
+
+      self.concurrentPartThreshold = parts;
+    };
+
     // Handler to receive data and upload it to S3.
-    self.ws._write = function (Part, enc, next) {
-      self.currentPart = Buffer.concat([self.currentPart, Part]);
+    self.ws._write = function (incomingBuffer, enc, next) {
+      self.absorbBuffer(incomingBuffer);
 
-      // If the current Part buffer is getting too large, or the stream piped in has ended then flush
-      // the Part buffer downstream to S3 via the multipart upload API.
-      if (self.currentPart.length > self.partSizeThreshold)
-        self.flushPart(next);
-      else
+      if (self.receivedBuffersLength < self.partSizeThreshold)
+        return next(); // Ready to receive more data in _write.
+
+      // We need to upload some data
+      self.uploadHandler(next);
+    };
+
+    self.uploadHandler = function (next) {
+      if (self.pendingParts < self.concurrentPartThreshold) {
+        // We need to upload some of the data we've received
+        upload();
+      }
+      else {
+        // Block uploading (and receiving of more data) until we upload
+        // some of the pending parts
+        self.once('chunk', upload);
+      }
+
+      function upload() {
+        self.pendingParts++;
+        self.flushPart(function (partDetails) {
+          --self.pendingParts;
+          self.emit('chunk'); // Internal event
+          self.ws.emit('chunk', partDetails); // External event
+        });
         next();
+      }
+    };
+
+    // Absorb an incoming buffer from _write into a buffer queue
+    self.absorbBuffer = function (incomingBuffer) {
+      self.receivedBuffers.push(incomingBuffer);
+      self.receivedBuffersLength += incomingBuffer.length;
+    };
+
+    // Take a list of received buffers and return a combined buffer that is exactly
+    // self.partSizeThreshold in size.
+    self.preparePartBuffer = function () {
+      // Combine the buffers we've received and reset the list of buffers.
+      var combinedBuffer = Buffer.concat(self.receivedBuffers, self.receivedBufferLength);
+      self.receivedBuffers.length = 0; // Trick to reset the array while keeping the original reference
+      self.receivedBuffersLength = 0;
+
+      if (combinedBuffer.length > self.partSizeThreshold) {
+        // The combined buffer is too big, so slice off the end and put it back in the array.
+        var remainder = new Buffer(combinedBuffer.length - self.partSizeThreshold);
+        combinedBuffer.copy(remainder, 0, self.partSizeThreshold);
+        self.receivedBuffers.push(remainder);
+        self.receivedBuffersLength = remainder.length;
+
+        // Return the original buffer.
+        return combinedBuffer.slice(0, self.partSizeThreshold);
+      }
+      else {
+        // It just happened to be perfectly sized, so return it.
+        return combinedBuffer;
+      }
+    };
+
+    // Flush a part out to S3.
+    self.flushPart = function (callback) {
+      var partBuffer = self.preparePartBuffer();
+
+      var localPartNumber = self.partNumber;
+      self.partNumber++;
+      self.receivedSize += partBuffer.length;
+      self.s3Client.uploadPart(
+        {
+          Body: partBuffer,
+          Bucket: destinationDetails.Bucket,
+          Key: destinationDetails.Key,
+          UploadId: self.multipartUploadID,
+          PartNumber: localPartNumber
+        },
+        function (err, result) {
+          if (err)
+            self.abortUpload('Failed to upload a part to S3: ' + JSON.stringify(err));
+          else {
+            self.uploadedSize += partBuffer.length;
+            self.partIds[localPartNumber - 1] = {
+              ETag: result.ETag,
+              PartNumber: localPartNumber
+            };
+
+            callback({
+              ETag: result.ETag,
+              PartNumber: localPartNumber,
+              receivedSize: self.receivedSize,
+              uploadedSize: self.uploadedSize
+            });
+          }
+        }
+      );
     };
 
     // Overwrite the end method so that we can hijack it to flush the last part and then complete
     // the multipart upload
     self.ws.originalEnd = self.ws.end;
     self.ws.end = function (Part, encoding, callback) {
       self.ws.originalEnd(Part, encoding, function afterDoneWithOriginalEnd() {
-        if (self.currentPart.length > 0) {
-          //Check to see if a last ending write might have added another part that we will need o flush.
-          self.flushPart(function () {
-            self.completeUpload();
-          });
-        }
-        else
+        if (Part)
+          self.absorbBuffer(Part);
+
+        // Upload any remaining data
+        var uploadRemainingData = function () {
+          if (self.receivedBuffersLength > 0) {
+            self.uploadHandler(uploadRemainingData);
+            return;
+          }
+
+          if (self.pendingParts > 0) {
+            setTimeout(uploadRemainingData, 50); // Wait 50 ms for the pending uploads to finish before trying again.
+            return;
+          }
+
           self.completeUpload();
+        };
+
+        uploadRemainingData();
 
         if (typeof callback == 'function')
           callback();
@@ -88,7 +220,7 @@ module.exports = {
           Key: destinationDetails.Key,
           UploadId: self.multipartUploadID,
           MultipartUpload: {
-            Parts: self.parts
+            Parts: self.partIds
           }
         },
         function (err, result) {
@@ -98,7 +230,7 @@ module.exports = {
             self.ws.emit('uploaded', result);
         }
       );
-    },
+    };
 
     // When a fatal error occurs abort the multipart upload
     self.abortUpload = function (rootError) {
@@ -115,47 +247,6 @@ module.exports = {
             self.ws.emit('error', rootError);
         }
       );
-    },
-
-    // Flush a single part down the line to S3.
-    self.flushPart = function (callback) {
-      var uploadingPart = Buffer(self.currentPart.length);
-      self.currentPart.copy(uploadingPart);
-
-      var localPartNumber = self.partNumber;
-      self.partNumber++;
-      self.receivedSize += uploadingPart.length;
-      self.s3Client.uploadPart(
-        {
-          Body: uploadingPart,
-          Bucket: destinationDetails.Bucket,
-          Key: destinationDetails.Key,
-          UploadId: self.multipartUploadID,
-          PartNumber: localPartNumber
-        },
-        function (err, result) {
-          if (err)
-            self.abortUpload('Failed to upload a part to S3: ' + JSON.stringify(err));
-          else {
-            self.uploadedSize += uploadingPart.length;
-            self.parts[localPartNumber - 1] = {
-              ETag: result.ETag,
-              PartNumber: localPartNumber
-            };
-
-            self.ws.emit('chunk', {
-              ETag: result.ETag,
-              PartNumber: localPartNumber,
-              receivedSize: self.receivedSize,
-              uploadedSize: self.uploadedSize
-            });
-          }
-
-          if (typeof callback == 'function')
-            callback();
-        }
-      );
-      self.currentPart = Buffer(0);
     };
 
     // Use the S3 client to initialize a multipart upload to S3.
@@ -174,3 +265,5 @@ module.exports = {
     );
   }
 };
+
+util.inherits(module.exports.Uploader, EventEmitter);
diff --git a/package.json b/package.json