Google Cloud Functions Overview

Categories: Cloud, BigData

Introduction

This is a quick introduction to the Google Cloud Functions service, part of the Google Cloud Platform (GCP).

Cloud functions are a kind of application which accepts either http-requests or pubsub messages as input, and are limited to providing a single function as the entry-point to the application. Thus:

  • a cloud function which accepts HTTP requests can only be bound to one URL (endpoint).
  • a cloud function which accepts Pubsub messages can only be bound to one subscription.

Note that the cloud function does not “register a listener” for http or pubsub; instead external configuration (in the cloud-function deployment step) determines under what conditions the function will be executed and what data it will be passed.

In addition:

  • a cloud function has a limit to how long it may run to handle a single request (default: 1 minute; max configurable: 9 minutes). If any request lasts longer than the configured maximum time then the container in which the cloud function is runing will be killed.

A cloud function is assumed to be non-threadsafe; multiple requests are never sent concurrently to the same container instance. An instance is reused serially, ie once it has finished executing a request it may be sent another one. This means that caching data between requests in “local memory” can increase performance. However requests are distributed evenly across all instances, so storing client-specific state between requests is not possible (except via an external database or key-value-store).

Languages

A cloud function must be written in Javascript using the node.js framework.

Execution Environment

Each “instance” of the cloud function that Google starts is actually a docker container running Linux, node.js, and the cloud function code. The container image used is even specified: gcr.io/google-appengine/nodejs. You can download this yourself with docker commands, and investigate what is available inside it.

Google builds the container for you, but you submit a list of the files to include - and these are not limited to javascript files.

You can specify the desired amount of RAM and disk for the container environment - cloud functions are currently rather flexible here.

Warning: when a cloud function “completes” (by returning from the invoked function, or by executing the top-level callback object) and there are no pending requests that the cloud-functions infrastructure can allocate to that container instance, then the container is effectively “frozen” (by setting its CPU-share to zero). This means that any “uncompleted callbacks” in node.js will also be “frozen” until the next request is allocated to that container - or until the container is killed. Node.js code must be very careful to complete all asynchronous work before completing the original request.

Scaling

One of the nice things about cloud-functions is that they scale down to zero instances if there is no work to do - very useful for applications which are only used during business hours, and for apps still “under development” which are only used intermittently. Having an AppEngine or ComputeEngine instance running 24 hours per day, 365 days per year, is rather costly (around US$1200 per year taking into account the available discounts); for intermittently-used systems, automatic scale-to-zero is financially highly desirable.

Getting ComputeEngine instances to scale down to zero when no Pubsub messages are waiting is possible but moderately complex (I am planning to write an article on this soon). Getting AppEngine Standard instances to scale down to zero when no http-requests are pending is easy, but as far as I know it is not possible to scale-to-zero while still being availale to handle Pubsub messages (except possibly by using a push-subscription, but that has significant disadvantages). AppEngine Flexible cannot scale-to-zero at all.

Google can start up new containers running the cloud function in a matter of seconds - and will scale up to many thousands of containers running the same image in parallel if the workload (number of pending http-requests or number of pending pubsub-messages) requires that.

Running binary applications in cloud function environments

Because a cloud function runs in a docker container that contains Linux (currently Debian), and because you can include any desired files in the container, it is possible to bundle applications with the cloud-function, and for the cloud-function to execute those applications with “spawn”. I have tried bundling a complete Java JVM and a Java application with a cloud function, and executing the Java application works.

Note however that a request has a maximum lifetime of 9 minutes, so this has only limited use. It isn’t generally possible to use Cloud Functions to perform complex machine-learning tasks for example!

And due to the way a container is “frozen” on completion of a request (see above), executing external processes is probably best done via require('child_process').spawnSync rather than just .spawn.

An Example Cloud Function

The following node.js-based cloud function program receives Pubsub events which are triggered when a file is uploaded to a bucket, and starts a dataflow (ie Apache Beam) program to process that file:

const google = require('googleapis');

const dfltProjectId = '....';
const jobNameBase = 'dataflow-from-cloud-fn';
const dataflowPath = 'gs://.../path/to/template';

/**
 * Obtain a google authentication token and then send an http request to the dataflow service to run
 * a specific dataflow with the input filenames.
 */
function startDataflow(file, eventId, callback) {
    google.auth.getApplicationDefault(function (err, authClient, projectId) {
        if (err) {
            console.error('Authentication with default failed');
            callback(err, "Unable to authenticate");
        }
        console.info('Authentication with default successful');

        const dataflow = google.dataflow({
            version: 'v1b3',
            auth: authClient
        });

        const inputPath = `gs://${file.bucket}/${file.name}`;

        // Oddly, param projectId is sometimes set and sometimes undefined. Better to hard-wire it..
        if (typeof projectId === 'undefined') {
            console.info('Using default projectId');
            projectId = dfltProjectId;
        }

        // two concurrent jobs with the same name are not allowed, so add the file-change eventId (which is unique)
        const jobName = `${jobNameBase}-${eventId}`;

        console.info('Instantiating dataflow template with properties: ' +
            `jobName=${jobName}; projectId=${projectId}; input=${inputPath}`);

        dataflow.projects.templates.create({
            projectId: projectId,
            resource: {
                parameters: {
                    input: inputPath,
                },
                jobName: jobName,
                gcsPath: dataflowPath
            }
        }, function (err, response) {
            if (err) {
                // object err is of type Error
                console.error("problem running dataflow template, error was: " + err);
                console.error(err);
                callback(err, "problem running dataflow template");
            } else {
                console.info("Dataflow template response: ", response);
                callback(null);
            }
        });
    });
}

/**
 * Process a file uploaded to google-storage by starting a dataflow.
 */
function runFlowFor(file, context, callback) {
    if (file.resourceState === 'not_exists') {
        console.info(`File ${file.name} deleted.`);
        callback();
    } else if (file.metageneration === '1') {
        // metageneration attribute is updated on metadata changes; on create value is 1
        console.info(`File ${file.name} uploaded.`);
        startDataflow(file, context.eventId, callback)
    } else {
        console.info(`File ${file.name} metadata updated.`);
        callback();
    }
}

/**
 * Background Cloud Function to be triggered by Cloud Storage.
 *
 * Warning: if a file is uploaded, then overwritten before this cloud-function receives the event, then
 * this code will process the later version of the file twice. This is not expected to be a problem in
 * real-world systems.
 */
exports.processFile = (event, callback) => {
    const file = event.data;
    const context = event.context;

    console.info(`  Event Type: ${context.eventType}`);
    console.info(`  Bucket: ${file.bucket}`);
    console.info(`  File: ${file.name}`);
    console.info(`  Metageneration: ${file.metageneration}`);

    if (file.name.startsWith('some/path/')) {
        console.info(`File ${file.name} being processed.`);
        runFlowFor(file, context, callback)
    } else {
        console.info(`File ${file.name} not relevant.`);
        callback();
    }
}

A package.json file is also needed. When your cloud fn code needs extra node.js packages, then run npm install --save {packagename} to update the package.json file. The installed libraries themselves do not need to be included in the file you upload to cloud storage (they will be reinstalled as specified in package.json).

{
  "name": "sample-cloud-storage",
  "version": "0.0.1",
  "dependencies": {
    "googleapis": "22.2.0"
  }
}

The resulting index.js and package.json then just need to be uploaded to some folder in cloud storage. Via the GCP web UI, or the gcloud SDK commandline, a cloud-function is then defined that references those uploaded files and specifies how the function should be “triggered” (eg via http, or pubsub).

References