Categories: Cloud, BigData

Introduction

This article talks (fairly briefly) about Google Cloud Storage (a service where files can be stored “in the cloud”) and about how to move files between on-premise storage and google cloud storage.

This article is short because cloud storage is probably the simplest of all the GCP services, and the official documentation does a pretty good job of describing what google-cloud-storage is.

Cloud Storage Overview

Google cloud storage works somewhat like a “shared file server”, and somewhat like a key/value database which supports really large “value objects”. Technically, cloud storage is an object store. Cloud storage does not offer a POSIX-conformant API, nor can it be accessed via standard protocols such as SAMBA or NFS; it offers a REST-API ¹ (which is rather complex) which is wrapped in a more developer-friendly API via Google client libraries for many different languages. Symbolic-links and hard-links are not supported.

The top-level object in cloud storage is the “bucket” (somewhat like a “filesystem instance”). A bucket belongs to a project.

A bucket has an associated location which specifies where the data is stored geographically (often a legal or business requirement).

Each file in cloud storage has an associated storage class, which is inherited from a default setting on the bucket if not specified when the file is written. The storage class defines how fast it can be accessed - and affects the price for data storage. The location specified for a bucket can limit the types of storage class permitted for objects within that bucket (in particular, a regional bucket cannot contain objects with storage-class set to multi-regional and vice versa).

Each file in a bucket has its own set of access-rights which control who can access it. Files may be completely public (useful for distributing files and hosting websites), or may be limited to specific groups or users. A bucket has a default set of access-rights which are inherited by each new file created in the bucket unless explicit permissions are provided.

Basic versioning can be enabled for a bucket (off by default). When enabled, then older versions of files which have been overwritten or deleted can still be retrieved. The downside is that the storage needed to retain these files is charged to the project.

The Object Lifecycle Management component allows “rules” to be defined which Cloud Storage applies automatically. These rules can delete or move files with certain properties (eg those older than a specified age), etc.

Each “file” is limited to 5TB, but a bucket itself has unlimited capacity. A file has a set of metadata attributes associated with it; some are used by cloud storage itself but custom attributes can also be stored.

Files in cloud storage are immutable; a file can only be replaced by a new version - though the API provides some useful ways to effectively create a new file from an existing file and a “set of changes”. Objects can be overwritten no more than once per second

Transferring Files to and from Storage

It is a commonly-needed task to upload files from a corporate datacenter into Google storage in order to then process it with Dataflow, import it directly into BigQuery, etc.

The best options are:

use the gsutil commandline application (which is available as part of the GCloud SDK or a standalone download)
write a program that uses the Google cloud libraries to interact with Cloud Storage (libs available for Python, Java, and many other languages)

While Cloud Storage does provide a REST API, it is rather complex to use correctly - and authentication is also complex. It therefore is really not a good idea to try to implement something new; just use the libraries.

The gsutil application is simply a Python program that wraps the GCloud SDK in a commandline-parser, so anything that gsutil does can also be done programmatically.

Command gsutil rsync .. works similarly to the well-known rsync tool, and can be used to ensure that all files in a specific local directory are mirrored to cloud storage (or vice versa).

There are commercial ETL tools that work with the Google environment, but they address really complex issues and presumably have very high price-tags. I am not aware of a simple “sync files to cloud storage” application other than gsutil rsync.

Atomic Uploads

Even on non-cloud systems, it is common for one application to write files to disk and for another application to poll a filesystem directory to detect new files. In such cases, the writing and/or reading application needs to take precautions to ensure that the reader does not:

start reading a file before the writer has completely written it, or
read a file that will never be complete because the writer crashed part-way-through

A common solution is for the writer to write to a temporary directory on the target filesystem, or to use a temporary filename (eg starting with a dot, starting with “tmp”, or ending with “.tmp”) and then to atomically rename the temporary file to its final name only after all data has been written. The reading app just has to ensure that it ignores such temporary files - and then partial files are ignored, as desired. A crash leaves a file permanently with a temporary name; this needs to be “cleaned up” at some time, but that is not a complicated task.

Fortunately, uploads to Google cloud storage effectively implement the above pattern automatically; any call to the cloud-storage APIs to upload data first writes the data “invisibly” to disk and becomes visible only when the sequence of http requests is properly terminated. The gsutil application works in the same way.

Triggering Processing on Upload

When a file is uploaded to cloud storage, that “bucket” can be configured to send a Pubsub message with a reference to the new file. This can be used to trigger automated processing of new files - a very convenient feature. See this vonos article and the Google documentation for more information.

Hosting a Static Website

One interesting feature of cloud storage is that you can use it to host a complete static website. Cloud storage is able to serve up index-pages (ie treat “http://domain/foo” as a request for “http://domain/foo/index.html”), and to deliver customised “file not found” error messages. Metadata attributes can also be set on files to affect the http-headers returned for urls matching that file (in particular, cache-control).

Footnotes

Most GCP services actually support two APIs: json-over-http (“rest”) and protobufs-over-http (gRPC). Some older services also support the legacy xml-over-http protocol. ↩

About

Recent Posts

Categories

Google Cloud Storage Overview