Python - Distributing and Installing Code

Categories: Programming

Overview

I recently wrote an article about some of the more interesting things I encountered while learning Python.

This article is a continuation, summarizing my research on various techniques for packaging and uploading, downloading and installing Python-based code.

WARNING: I am an experienced developer, but new to Python. There may therefore be errors/misunderstandingsm in this article - if so, feedback is welcome!

Overview

As developers, we want to take advantage of the large selection of third-party libraries available.

And as developers, we also want to write libraries and applications, and distribute them (whether to production environments in our own companies, or to third-parties).

This article looks at some of the options for downloading/installing Python code, and for publishing Python code.

Unfortunately, the topic is somewhat complicated as there have been various solutions developed over time. Below I discuss the half-dozen most common methods (as far as I know).

There is good documentation in the official Python Packaging User Guide, although they do sometimes mix information on current and obsolete approaches, particularly in the Packaging Key Projects page.

Terminology: Distributions and Packages

Most of the IT world uses the word “package” to describe a bundle of software that can be downloaded and installed - eg Debian or RedHat packages.

However Python uses the word package to mean a directory containing a set of modules (Python sourcecode files). The Python community and documentation therefore often uses the word “distribution” to mean an installable (distributable) bundle of Python code. In the setuptools documentation, “distribution” is defined as “a Python package at a specific version”.

Of course the word “distribution” in the Linux world typically means an entire operating system and supporting software - this is NOT the intended meaning of the word “distribution” in Python documentation.

And sometimes the Python world does use the term “package” to mean a downloadable bundle of Python code - the Python Package Index and documentation for the pip package installer are the most obvious examples.

I use the terms “distribution” and “package” interchangeably (as synonyms) below.

Summary of Options

Here’s a quick summary of the various solutions discussed in more detail below:

  • distribute sourcecode files directly (as single files, as a git-checkout, or an archive that the receiver unpacks into an arbitrary directory)
  • distribute sourcecode as an “executable zipfile” (assuming python already installed) - ie PEX
  • provide a complete executable, including sourcecode and a Python interpreter (“python freezers”)
  • paas-specific solutions (Heroku, Google AppEngine, etc)
  • a container (docker, etc)
  • native unix packages (rpm/apt/etc)
  • egg bundles
  • use distutils to build:
    • an sdist (source distribution) - tar.gz file
    • a bdist (binary distribution) - rpm, windows-exe or other formats
    • a wheel distribution

This article then looks at the distutils option in more detail, including package index servers (eg pypi) and the standard Python package manage pip.

Distributing Sourcecode Directly

This choice is only useful for simple projects, but should not be forgotten - simple is sometimes better. As an example, the Bottle Web Framework is distributed as a single Python file.

It assumes that the target system has a Python interpreter of an appropriate version.

Distributing as source-code has obvious problems if there is a dependency that includes native “C” code; in that case, one of the more advanced approaches should be used instead.

Distributing as a File or Archive

When the code to be distributed is just one file, and has no dependencies then just copying the file around may be the easiest thing to do. Or for multiple files, a zip or tarfile can be distributed and unpacked by users.

If the code to be distributed needs third-party libraries, then you can distribute an archive-file (eg a zip) which contains the necessary libraries embedded in it. When it is unpacked, and a “script” within the unpack-directory is executed, Python automatically adds the directory of the script to its “module search path”, ie other modules/packages included in the archive will take precedence over files installed elsewhere.

The PEX format takes this one step further - the archive file is actually executable, ie can be run without unpacking it. See later for info on PAX.

Distribute as Git Checkout

Sometimes, the nicest way to distribute Python code is just to “git clone {url}” on each machine the code should run on.

This makes it obvious to sysadmins where the master copy of the code is (git remote -v). It also allows easy updates: git pull. And it makes “hot fixes” trivial to feed back into the official repo: git push.

When the code has dependencies on third-party libraries, then the “bundle them in the archive” solution suggested above clearly is a bad idea when the code is a checkout of a git repo. Instead, pip freeze can be used by the developer to capture those dependencies, and pip install -r requirements.txt can be used on the target systems. This is starting to get a little inelegant now, but possibly still a reasonable solution in some cases.

Executable Zipfiles and PEX

Two interesting features of Python can be combined with an interesting feature of the zip archive format to allow creation of archive files containing python code which can be directly executed without unpacking, ie “distribution” is just a matter of ensuring the target system has a suitable Python interpreter, and then copying that one (archive) file.

Python can be told to execute a specific file as a script (ie application). However when the location is not a file but a directory, then Python looks for a file named __main__.py within that directory.

And Python can be given a zipfile as the file to execute, in which case it effectively treats that zipfile as a directory. That means it looks for a file __main__.py in its (internal) root directory.

The zip archive format includes a special “start sequence” that marks where the zip-archive actually begins within a file, and allows arbitrary data to precede that marker. This is not simply part of a specific zip implementation, but actually part of the specification, and is intended to allow “self-extracting zipfiles”. The Python interpreter also supports this, ie when given a zipfile, will skip any leading non-zip data.

This means that you can take any tree of Python files, zip it, then prefix it with something like this:

#!/usr/bin/env python3

The result is something that can be executed as a shell-script (assuming chmod +x); standard unix shell behaviour launches the specified command and feeds it the contents of the file. And voila! The __main__.py file within that archive is run by Python - which skips the zipfile header and then runs __main__.py. No install step is needed.

Creating such files can be done by hand, but there is also a build-tool named PEX (pip install pex) that adds a few nice-to-have features:

  • creates a __main__.py file that:
    • parses various pex-related commandline options, then delegates to another python file within the archive - ie your “entry point” is in a normal file, not __main__.py
    • modifies sys.path to ensure there are no conflicts between libs in the pex-file and in the external python environment.
  • includes third-party libraries (specified individually on the commandline, or via a file in the same format that pip freeze generates)
  • and does the zipping and prepending of the shellscript header line

By convention, the “executable zipfiles” that PEX generates have suffix .pex.

An “executable zipfile” that includes all its necessary third-party dependencies is roughly equivalent to a “fat jar” in the Java world.

See this ‘WTF is PEX?’ video presentation (only 15 minutes) for more information. Note that PEX was previously part of package twitter.common.python (as described in video) but is now its own separate pypi package.

Freezers

There are several projects that can take Python source-code, and generate an executable application that bundles a complete Python interpreter together with the code.

Unfortunately, the word “freeze” is also used for a number of quite unrelated projects - and in particular pip freeze is quite different (and discussed later). The relevant projects appear to be:

See the Python Guide for more info.

PaaS-specific Solutions

A number of “cloud environments” provide a hosted Python “Platform as a Service” runtime environment into which you deploy your Python code without needing to manage the underlying hosts.

There are a number of advantages to such a system - including automatic load-balancing, SSL termination, backups, etc. However each PaaS implementation comes with its own build/packaging tools and rules.

Examples:

  • Heroku
  • Google App Engine

Docker (or other container formats)

If you are looking to distribute a complete python-based application rather than a library, then one of the nicest solutions for end users is to create a container, with all necessary components installed and upload that to a container registry. The end user (which may be your own operations team, or even yourself) just needs the appropriate container runtime (eg Docker) installed, and then to “pull” the container.

Building a container is a whole complicated topic that is out of scope here.

Native Unix Packages (dpkg/rpm/etc)

Unix distributions usually come with their own package-management tools.

Distributing code as a native package (eg RPM or DPKG) is certainly easy for end-users. Such packages can be built with setuptools (see section on bdist later); presumably there are other tools that can also build such packages.

Egg Bundles

An “egg” is a zipfile with name of form *.egg, where the zipfile contains:

  • Python code
  • compiled native libraries (optional)
  • and an EGG-INFO subdir with various metadata files about the package (similar to Java jar’s META-INF subdirectory).

Eggs are actually built by the setuptools library, which has not been discussed yet. However given that using an egg does not necessarily require using setuptools, it seems appropriate to discuss it first.

When an egg does not contain any native code, then the file can simply be placed on the searchpath, ie the file path is added to environment variable $PYTHONPATH (similarly to adding a Java jarfile to $CLASSPATH). Code from the egg can then be run with python -m {module}.

It is also possible to place an egg file (ie a zipfile) in a directory that is already on the searchpath (eg site-resources) - without unpacking it. Other Python files can then import modules from it - but the importing code first needs to do some manipulation of sys.path to “activate” the egg. This manipulation involves calling some code from module pkg_resources which is part of the setuptools package - ie setuptools needs to be installed first, in the environment that uses the egg. As discussed later, setuptools has a special status - it is not in the standard library, but is very much part of Python’s core functionality (like pip); if pip is installed, then so is setuptools - and all Python installations of version 3.4 or newer install pip by default.

For development purposes, there are also “unpacked” forms of eggs that are directory-trees rather than zipfiles:

  • a directory with name “*.egg” containing an EGG-INFO subdir
  • a directory with one or more subdirs named “Project-Name.egg-info” (a “development egg”)

Eggs can provide a “plugin-style” framework, where multiple eggs are within directories on the searchpath (ie are findable by pkg_resources but not activated by default). Other code can then use pkg_resources to find all eggs matching specific seach-criteria, and activate them. One use is to have different versions of the same library, and select a specific desired version. Another use is to dynamically discover all eggs which implement a specific “plugin interface”, and activate one or all of them. This is something like Java’s “service discovery” or OSGi features.

An egg can specify dependencies on other eggs - in which case those are also located and activated via pkg_resources. If no matching dependency can be found, an exception is thrown - ie all dependencies must have already been installed (they are not automatically downloaded).

Alternately, an egg can embed other eggs (ie its dependencies).

The EasyInstall application (part of setuptools) can download a source-code bundle from a python package index (eg Pypi), build an egg from it, and copy that egg into the site-packages directory in one step. See later for more info.

See this article for an excellent introduction to the EGG format and features.

An egg archive can also be installed into a Python installation, ie the archive is unzipped and the various components copied into the Python directories - in particular, site-packages. After this has been done, any modules in the egg-file are accessable in the normal manner - via import statements in Python code, and by python -m {modulename}.

The newer wheel distribution format is basically a superset of the egg format, and can be used in the same way - as a “plugin” that needs to be activated, or installed as modules. See later for more on wheel files.

See also: setuptools: Internal Structure of Python Eggs

One problem with eggs is the generation of .pyc files. As noted earlier, Python source-code is ideally compiled into byte-code once, and this byte-code is cached thus making later loads of the module faster. However for “shared” site-packages directories, this step needs to be done during install as the users who run the code later may not have privileges to write to a __pycache__ subdir of the site-packages directory holding the Python source code. The egg format solves this by bundling the .pyc files in the distribution. This however has its own disadvantages: larger files, and the fact that .pyc files are not necessarily portable across Python interpreter implementations (eg CPython/JPython) or across versions of the same interpreter. The wheel format fixes this by delegating the problem of generating .pyc files to the installer.

Distutils, Setuptool, and Pip

Module distutils is included in the Python standard library, and provides low-level apis for:

  • invoking c-compilers (wrappers for various implementations such as gcc, msvc, etc) - used when building modules which include native code
  • extract data from setup-scripts
  • create tar or zip files
  • various filesystem-related utilities useful when writing build/install tools
  • a basic “plugin framework” for hooking up “command handlers” to be triggered by commandline options
    • and a set of plugins that build windows-specific installers, linux rpm files, etc.
  • unpacking a tar/zipfile into a local site_packages directory tree (python code, executable scripts, datafiles, native libraries, etc. go into different locations)
  • register a package with a python index server, ie make a basic network call to upload metadata about a locally-built package (but not the package contents)

It is not expected that users or developers interact directly with distutils - some of the functionality is obsolete, replaced by better implementations in setuptools while other functionality is low-level and is used by setuptools (and other similar build/install tools).

Setuptools is a project which provides libraries and executable tools for building distributions (installable packages) from source code, and tools for installing such distributions. However setuptools does not do any “dependency management” - it is roughly equivalent to Debian’s “dpkg-build/dpkg” or RedHat’s “rpmbuild/rpm”.

Pip is a “dependency manager” which inspects a package to be installed, and downloads/installs the other packages that it depends on. Pip integrates with “package indexes” (servers hosting pools of packages), including the standard Pypi (Python package index), but can also install from local directories or download dependencies but not immediately install them. Pip also keeps track of which packages are installed.

While distutils is part of the python standard library, setuptools and pip are installed slightly differently. The standard Python download from python.org does include setuptools and pip - but as wheel packages in a special directory. The standard library includes a very small stub-module called “ensurepip” which finds those packages and unpacks them into the “site-packages” directory for the python install. And the standard install process automatically runs “python -m ensurepip” so that this occurs almost transparently (and without needing network access). After this process is complete, “python -m pip” works just as if pip had been provided as part of the standard libraries. When venv is used to create a project-specific site-packages directory, ensurepip is also run automatically - installing setuptools and pip into that newly-created virtual environment. The benefit of having setuptools/pip as “default-installed packages” rather than in the standard library is that:

  • it is possible to avoid installing them if you really want to
  • and they can be updated independently of the Python standard libraries - pip can be used to download new versions of setuptools and even itself!

In summary:

  • setuptools is something like Debian’s “dpkg-buildpackage” or RedHat’s “rpmbuild” for building packages, together with the dpkg or rpm tools for installing them
  • pip is something like Debian’s apt or RedHat’s dnf/yum which installs not only a package but also its dependencies
  • distutils is something you don’t usually interact with directly, unless you are writing build-tools or installers yourself

EasyInstall

Script easy_install (aka EasyInstall) is part of setuptools. It was one of the first build/installation tools for Python; pip is newer than EasyInstall, and supports many of the same usecases (better) - but not all.

Module pkg_resources

Module pkg_resources is part of setuptools. It provides utilities for managing libraries which are distributed as wheel or egg files, and installed as “plugin libraries” rather than normal libraries.

The section on EGG files earlier describes how EGG (and WHEEL) archives can be placed on PYTHONPATH or in site-packages, at which point they can be found but are not automatically available like modules installed into site-packages. The pkg_resources module provides APIs to find, filter, and activate such libraries.

Module pkg_resources has a function “find_plugins(Environment)” - where Environment is effectively a list of directories. The function scans those directories looking for egg/wheel files and returns a list of such libraries that can potentially be added to sys.path. Items will only be returned when their declared requirements (as specified in their internal metadata-files) are available. When an item is added to sys.path then its declared “entry points” are registered; other code can then look up these entry-points and invoke them. This provides a kind of “service discovery” feature (similar to Java’s discovery convention). And this in turn allows “plugins” to be added to a program just by adding egg-files to the searchpath.

The result is something a little similar to Java OSGi - particularly its “WorkingSet” concept. The primary “working set” of any application is “sys.path”; WorkingSet methods can be used to determine which (of the available pool of eggs) are actually added to sys.path.

See also the pkg_resources documentation on the setuptools site.

Setuptools in Detail

Overview

Setuptools is actually a library, ie provides an API. The usual way to build a distribution is to write a Python script (by default, named setup.py) which calls into the setuptools API to define relevant metadata. When this script is executed, the call to setuptools causes a resulting output-file to be generated - with the output format depending upon what commandline options were passed.

Setuptools includes support for formats sdist (a “source distribution”) and various kinds of “bdists” - binary distributions.

A source distribution (sdist) is an archive containing Python code (of course), plus “data files” if needed - and the original setup.py script. If a project includes native C code, then the sdist archive also includes the relevant files in source form. Installing a sdist file is done by running the setup.py script again - at which point files are copied into the local Python environment (ie site-packages). Any native code included in the archive is compiled on the target machine during install, using the compiler-wrappers included in module distutils.

The bdist (binary) formats supported include:

  • egg
  • wheel
  • rpm/dpkg
  • windows executable installer

The sdist format is platform independent - any platform with Python, setuptools, and a setuptools-compatible compiler can install modules with native code components (assuming the native code compiles on that platform).

The bdist_* formats, excluding wheel, are platform dependent. They are expected to be built on the platform they target (eg windows installers are built on windows).

Simple Example with setup.py

See: https://docs.python.org/3/distutils/introduction.html#distutils-simple-example

Write a simple python file named setup.py with the following format:

from distutils.core import setup
setup(name='foo',
      version='1.0',
      py_modules=['foo'],
      )

Then run python setup.py {target-type}.

For output formats that “embed” their dependencies, the result will include all the files listed in py_modules, and their transitive dependencies. There are additional options for specifying non-Python files to include (eg C sourcecode). A file MANIFEST.in can further customise the set of files included. File setup.py is also included in the bundle.

When target-type = sdist then the generated file is just an archive-file (tar.gz or zipfile or various other options). The user “installs” this by:

  • unpacking the archive
  • running python setup.py install

which causes any “modules” in the local directory to be copied to the local python installation’s “third party modules” directory.

The setup.py file can also specify “scripts”, in which case they are included in the bundle when it is built, but are not copied into the target Python environment (as they are meant to be run directly, rather than used as modules). Note however that any file can be a module, and modules can be “executed” via python -m modulename (the module is loaded with __name__ = __main__).

The target_type of “bdist_wininst” produces a windows executable file that installs the python code into the local python environment.

Similarly, “python setup.py bdist_rpm” creates a linux rpm package, etc.

File setup.py can specify:

  • requirements, ie packages that need to be installed on the target system first (with versions)
  • provides, ie packages that this distribution “installs”

Running setup.py directly does not automatically download needed dependences. However using pip to pull down a sdist (or bdist) package from Pypi will cause required packages to be pulled down too. See pip for more details.

For target-type of sdist, any native source-code is just included in the generated distribution - and is compiled when the distribution is installed.

For the bdist_* taret types, native source-code is compiled and the result is included in the generated distribution.

Once a “distribution” has been created, you can distribute manually (eg store it on your webserver, or email it around). Or you can use python setup.py register and python setup.py upload to interact with the Pypi package index server (or any other package index server).

Security Issues with Setuptools

Using a Python file (usually named setup.py) to define the metadata needed for building or installing a distribution is elegant in some ways.

However it has a significant security problem: in order to install a sdist package, the script (provided by the distribution packager) must be run on the target machine. And not as just any user, but as the user who owns directory site-packages - often root.

The wheel format instead uses a declarative metadata format, which is then parsed and used as data by the installer (wheel module or pip). This means that although pip needs to be run as the owner of dir site-packages, it doesn’t execute arbitrary Python code as that user.

Wheel Format

As noted in the wheel package description, wheel has two different roles::

A setuptools extension for building wheels that provides the bdist_wheel setuptools command A command line tool for working with wheel files

Wheel format is defined in PEP 427, which is the best resource for learning what Wheel format actually supports.

Wheel itself is provided as a downloadable/installable distribution on Pypi - though AFAIK it is also included with modern versions of pip - and is thus installed by default on modern Python installations. Pip can install wheel packages since v1.4.

The wheel project provides a “plugin” for setuptools to support building and installing wheel-format archives.

To quote the documentation, wheels:

is the new standard of Python distribution, intended to replace eggs offers a superset of the functionality provided by the existing wininst and egg binary formats

However AFAICT, wheel does not offer the “plugins” feature of eggs - it is intended only for installing files into site-packages directories.

PyPi hosts both wheel and egg archives. Many popular packages have been updated to wheel, but not all.

A wheel is:

  • a zip archive with a specially formatted filename and suffix “.whl”.
  • containing a directory {distribution}-{version}.dist-info with further files:
    • METADATA
    • WHEEL
    • RECORD - a list of the files in the wheel archive (a manifest) with a hash for each
    • RECORD.jws - digital signature of the RECORD file (which contains hashes of all files, thus effectively signing everything)

Pip can convert an sdist distribution into a wheel distribution: with pip instal wheel.

The root of the zipfile may also contain:

  • various files or subdirs to be installed into {site-packages}
  • subdir {distribution}-{version}.data/ with its own subdirs:
    • purelib - plain python code (also to be installed into {site-packages}
    • platlib - platform-specific code (eg windows-specific, or debian-specific)
    • include - c header files
    • scripts - shellscripts that users may directly execute
    • data - nonexecutable reference data files

Simply unpacking a .wlh file into site_packages is usually sufficient to “install” the project, though more advanced approaches are recommended as the “unzip” approach does not correctly handle the following:

  • installing files under {distribution}-{version}.data
  • putting bundled scripts into a “scripts” dir which is on $PATH
  • updating the #!python line for files in scripts to point to the exact path of the python interpreter instance into which the code has been installed
  • putting bundled c-header-files into a “headers” dir
  • generating a .pyc bytecode file for each .py source-code file (for a shared python environment, this cannot happen “on demand” when the code is used because the bytecode is cached in dirs which a normal user will not have write-rights to).

Using unzip also does not:

  • validate hashes of the archive
  • validate the signature of the archive

Installing a wheel with pip will cause all of the above to be done.

The WHEEL file contains both version-info, and some flags that control details of the “install” process.

The wheel builder tool generates egg-info files from its wheel-specific metadata files. It is possible to convert existing egg files into wheel files -but not the reverse.

Still unanswered question: how does wheel deal with native code? It definitely bundles precompiled binaries, ie source-code is not expected to be compiled during installation - but I have the impression a single wheel can include binaries for multiple different target architectures - in which case I am not sure how such a wheel is built.

Pip in Detail

Pip can install from local files, or from the network.

Like most package managers, it is not possible for multiple versions of the same logical package to be installed concurrently; one version must be chosen from the available set. Pip chooses the newest version which is compatible with the package that caused it to be downloaded. (see eggs and working sets which have a different approach)

Pip does not currently have true dependency resolution; when a set of packages are to be installed in one “transaction”, pip just installs them one at a time, pulling in needed dependencies as it goes. Not looking at the “global requirements” can lead to a situation where a later item in the transaction cannot be installed because an earlier one pulled in a conflicting dependency.

Pip keeps a local cache of wheel files (like maven) - but not sdist files.

Pip can download files but not install them with:

  • pip install --download DIR {packagename}|{-r requirements.txt}

Python supports looking for modules in:

  • the stdlib install dir (global for the installation)
  • a site-specific install dir (global for the installation)
  • or a per-user dir ($PYTHONUSERBASE)

Packages can be installed into the per-user dir with:

  • pip install --user SomePackage

However in practice, using venv is recommended - in which case “pip install” automatically installs into the local venv instead of either global or per-user.

pip is a “dependency resolution tool” that supports various underlying package formats, similarly to how:

  • “apt” is a tool that supports dependency resolution for packages in dpkg format
  • “yum/dnf” are tools that support dependency resolution for packages in rpm format

The underlying formats that pip can download and install are:

  • sdist
  • egg
  • wheel

Pip dependency-mgmt features:

  • packages can declare their own dependencies
  • pypi metadata declares dependencies
  • requirements.txt
  • pipfile/pipfile.lock

It is possible to run your own “private index server” with only approved packages (similar to using a maven repo-manager).

Pip Freeze

When developing code, it is usual to just pip install .. packages as you find you need them. However when distributing code using the “distribute source code directly” approach, it is useful to provide other users (often developers) wiht a specific list of packages to install in order to get the application working. This can be done by running pip freeze, which outputs a list of all the packages currently installed (eg into the active virtual environment). Another user can create a virtual environment and then “replay” this file in order to install exactly the same set of packages. The exact commands are:

  • pip freeze > requirements.txt
  • pip install -r requirements.txt

References