Categories: Programming
Overview
Here are some notes I made while learning Python. It is a reasonably easy language to learn (at least the basics), but:
- many books for novice programmers fail to cover the interesting cases, and
- as a Java/C/C++/Perl/Javascript/other programmer, I initially leapt to some incorrect assumptions and had to unlearn/relearn a few things.
Without doubt, the best starting point for experienced developers is the official Python Tutorial. This gives a readable but rapid introduction to the core features of the language - not great for novice developers but perfect for those coming from other languages.
The official Python Language Reference comes in two parts - the core language, and the libraries. The core language part is reasonably readable (certainly compared to specifications for other languages), and the library reference is excellent. I recommend reading these too - yes, really. As an experienced developer, I got more out of these than the “Introduction to Python” books I tried.
The Python Guide is a site that is similar to this article in spirit - although I found some advice there out-of-date (or they are trying to also cater to users with older versions of Python).
Despite the quality of the official documentation:
- There were a few things I felt could have been summarized more briefly;
- there were some things that were not clear to me at first, and
- there were things that I misunderstood (often due to assumptions based on other languages)
This article addresses these three issues.
Interestingly, much of the tricky stuff appears to be related to Python’s object-oriented programming support (Python also supports procedural and functional programming).
This article discusses Python3 only; Python2 is now fairly close to dead (at last). It also assumes you are using some kind of Unix.
IMO, the best thing about Python is not the language, but its large standard library and the huge collection of third-party libraries available for it. This article does not look at any third-party libraries - although it does look at how such libraries are installed and found at runtime.
Note that object-oriented features are at the base of most Python, including such things as modules which are addressed before looking at Python classes and objects. It is just impossible to present things without forward references to concepts not yet explained. On the other hand, I expect you have read at least the official Python tutorial first, and so know roughly how Python classes work anyway.
I have written two additional articles on Python, covering features that are too complex to describe in this already-long article:
- Packaging Python Code
- Async Programming (ie using keywords async/await and module asyncio)
Why Python?
Here are some reasons why Python might be a good choice for a project:
- is cross-platform
- can replace awk or shellscripts with a “real programming language” that is more readable and has more features (eg exceptions)
- includes a good standard library (maps, lists, etc) which things like shell does not
- has GUI libraries (unlike shell, etc)
- no compile cycle (C, C++, Java etc)
- has relatively dense code (ie is not verbose) while being very readable
- is a relatively small language (fewer built-in features to learn than some others)
- has a vast range of third-party libraries, and a good package-manager (pip) to install them with
- has a good and active community
- cpython has relatively easy bindings to native code (unlike shell etc)
- cpython can be embedded into applications
- cpython provides a REPL environment (interactive mode) for interactive code development
- and is completely open-source
Here are some personal opinions on why Python might not be a good choice:
- performance in single-threaded mode is not good (Python is slower than many other comparable languages)
- performance in multi-threaded mode is poor - in general, efficient use of multiple cores is obtained by starting multiple processes instead
- lack of static typing for developer support, documentation, and IDE autocomplete
- relatively easily decompiled (ie keeping an implementation secret is not effective)
Basic Stuff
Python has several implementations:
- the standard interpreter CPython (so named because it is implemented in C)
- pypy which is an interpreter combined with a just-in-time tracing compiler (so named because it is implemented in Python). Do not confuse pypy with pypi the Python package index!
- IronPython - Python 2.x for dot-net runtime (support for Python 3.x is in progress, not yet released)
- Jython - Python 2.x for the JVM runtime
Python has a reasonably large standard library, and a large catalogue of third-party libraries available at pypi.org.
The pip
package manager tool can download and install packages from the pypi.org catalog. Pip can download into specific environments (eg an environment per application) to avoid version-conflicts between applications on the same host. See later for more on pip
and venv
.
Some Python libraries are wrappers around native libraries. Such libraries might need to be installed manually (either with a native-code package manager such as apt, or from source-code) before the installed Python module actually works. Recently, pip
and its file-format have been enhanced to support wheel bundles, which include precompiled binaries for multiple operating systems along with the Python code - if the publisher of the library has made the effort to build the libraries. Various library packaging techniques are examined in a separate article.
PEPs
The Python Enhancement Process is how the language evolves - people write PEP documents suggesting changes/enhancements to the language which are then accepted or rejected. Many of the features in the Python language core, and in the standard libraries are officially defined in PEP documents rather than in a single central “official language specification”.
It is useful to know this, as searching for common Python problems on the internet often returns the advice “see PEP xyz”.
Naming Conventions
The Python standard library is, for historical reasons, not completely consistent. The recommended practice for new code is:
- mostly, use
lowercase_with_underscores
- constants should be
UPPERCASE_WITH_UNDERSCORES
- however classnames should use CamelCase (even though many classes in the standard library do not)
- and exceptions (which are classes) should use form
{SomeClassName}Error
(assuming they do represent errors) 1
Underscores
When a name has two leading underscores, and does not end with two underscores, then Python mangles the name into form _{classname}__{originalname}
This ensures that there is never a “name collision” with an attribute defined in a subclass. Such names should obviously not be used for attributes which are intended to be part of the classes public behaviour - ie should only be applied to attributes used for internal purposes by methods on that class. As a side-effect, it makes access to such attributes slightly more complicated for external code - but the goal of mangling is not to provide “access control for attributes” like some object-oriented languages support.
Names that have two leading and two trailing underscores are treated specially by the Python environment; these are sometimes called “dunder names”. User code is not supposed to define new names of this form; user code may override methods with such names (when such support is documented). Sometimes it is acceptable to read or write such names (see documentation) - although in most cases, Python provides a wrapper function to access attributes with such names (eg instead of accessing obj.__dict__
, use vars(obj)
).
A name with one leading underscore is an indication that the name is intended to be private to the module; a “from module import *
” will not import such names. However it is accessible via explicit use of the module name (“weak privacy”).
Programming Styles
Python supports:
- procedural programming
- functional programming
- object-oriented programming
In my opinion, there are strengths and weaknesses in Pythons support for each of these styles.
Procedural programming requires support for functions that take and return data-structures, and a library with mutable collections of various types. All the basics are supported in Python. However IMO the procedural programming support is tricky without proper data-structure types; Python really only offers dict
(a key/value map) or “named tuples”.
Functional programming requires support for closures and functions as first-class entities. Both of these are present. The standard library modules ‘functools’, ‘itertools’ and ‘operator’ also provide some functional-related helpers. However functional programming also benefits from libraries with immutable collections, monads, and various other helpers; Python’s standard library offers little support here. The Python community is also largely object-oriented rather than functional, so searching for help on functional programming in Python is likely to be difficult.
Object-oriented programming is well supported in Python - multiple inheritance is there, and “interfaces” via the abstract base class
standard library module. I’m still struggling with the idea of non-typed object oriented programming - to the point where a class cannot declare the members that its instances have, and although interfaces exist, no user of an object ever tests the type of a parameter. And this dynamic behaviour means that performance of object-oriented code is extremely poor compared to typed languages.
The dynamic-typed nature of Python just doesn’t feel right to me, coming from a typed programming language background. For small hundred-line scripts, ok. But my interest is in large-scale projects (multiple developers over multiple months to dozens of developers over years) and I fail to see how such applications can be successfully created without the support of a type-system. On the other hand, there are successful Python projects of this scale. The Python tutorial itself states that the primary reasons to use Python include being able to do what shell-scripts, awk, etc. can do - but with the power of a full programming language when needed.
Parsing Source Code Files
When a Python file is first loaded, it is parsed and converted to bytecode format. This is a relatively simple process in which keywords, operators and literals are recognised and an abstract-syntax-tree (or similar) is built for the source. A SyntaxError is reported if the code is clearly not valid (syntax tree cannot be built). However functions, classes, etc are not registered in any namespace in this phase.
Once the file has been converted to bytecode, that bytecode is immediately executed from start to finish. A Python source-code file should thus be seen as a sequence of executable statements. This is quite different from compiled languages, where source-code is “passive input to the compiler”.
One effect of this approach is that Python can be used similarly to shell-scripts.
Another effect in more complex applications is that Python code can affect what in other languages is the “compilation” phase. As examples:
- Python code within a file can generate Python classes at runtime, which can then be referenced later in the file.
- A Python module can dynamically determine, during loading, which other modules to import.
However in most cases, the executable statements in a Python file simply create variables, functions (function-objects), and types (class-objects), and register them with the enclosing module’s namespace.
For class
statements at top-level within a module, the body of the class is executed when the module is loaded (though in a temporary namespace; see classes later). Usually, the code in the class body simply creates function objects (which are later attached to the class as methods), or objects of other kinds (which are later attached to the class as class members).
For def
statements at top-level within a module (or top-level within a class-body), the body of the function is (obviously) not executed; instead a function object is created which points at the block of associated bytecode. That bytecode is only evaluated when the function is called. Importantly (and not so obviously), class and function definitions (and even import statements) are allowed within the body of a function, but have no effect (are not evaluated at all) until the containing function is executed.
Bytecode Cacheing
By default, the CPython interpreter saves the bytecode for each Python file in a file named {basefilename}.pyc
(normal) or {basefilename}.pyo
(optimised). When a Python file is loaded as a module (see later), and the cached bytecode file is newer than the corresponding source, then the bytcode is simply loaded from the cache-file for performance.
These bytecode files are usually stored in a directory named __pycache__
in a subdir of the file that was “compiled”. Python 3.8 has a new feature which allows an environment-file to specify the location of the cache-directory. The bytecode of the “main file” specified on the Python interpreter commandline is never cached - only modules that such a file loads.
Files in the Python standard library are also compiled to bytecode and cached. However this compilation process is normally triggered during install of the Python environment, rather than “on demand”, because normal users usually do not have write access to the __pycache__
directories within the Python installation directory.
Cacheing of bytecode can be suppressed if desired.
Python code can actually be distributed as just bytecode files, without the source-code, and the CPython interpreter will happily run it. However this is not particularly useful, as bytecode:
- is not guaranteed to be portable between different Python interpreters, and
- is not guaranteed to be backwards-compatible between releases
Modules
The contents of each source-code file is represented in memory as a separate module object. The Python interpreter keeps a global map of all modules that have been loaded (in variable sys.modules
), ie a table of module-name -> module-object
.
One of the standard attributes of a module object is __name__
. When the file has been executed directly from the commandline (eg via python3 foo.py
) then a new module object is created with module-name of __main__
(regardless of what the filename is). When the file has been indirectly loaded via an “import” statement in some other file, then the module object created for that file has property __name__
set to the filename (without suffix).
All declarations (variables, functions, and classes) within a file are stored in the associated module object. Or in other words, each assignment-statement, function-definition, or class-definition which was executed as the module’s contents were processed caused entries to be added to the module object.
An import-statement at the global level within a module is also an executable statement, and is executed when the module is loaded. First the interpreter-global map is checked; the file is parsed if (and only if) the module is not already present. New entries in the namespace of the importing-module are then created which point to the specified (imported) objects from the specified module.
The local filesystems which are searched to find the sourcecode for a module is held in variable sys.path
- and any module can change this, thus affecting where later imports are looked for. It initially includes the directory in which the “main” Python file is stored, the PYTHONPATH
environment variable, and a constant path built in to the Python interpreter.
A string occurring immediately after the start of a file becomes the “docstring” for the module (and is stored in attribute __doc__
).
Statement “import X” loads X if not already loaded, and adds a reference to module X to the local namespace. Statement “import X as Y” does the same, but the reference name in the local namespace is “Y”. Statement “from X import A, B” loads X if not already loaded, then adds references A -> X.A
and B -> X.B
into the local namespace - but does not add X itself.
Each module object has several attributes:
-
__name__
: string (mentioned above) -
__all__
: list(str) (controls imports) -
__author__
and__version__
: strings - for documentation
These can be read (and sometimes assigned-to) from within the module.
When “from X import *
” is executed, then all names in the module’s __all__
variable are added (or all names if that list is empty). As wildcard-imports are not recommended in production code, that isn’t very significant - but some IDEs may prefer to recommend (autocomplete etc) names from __all__
if it is defined.
An import-statement can occur in places other than at the top of a file. When it occurs within a function-definition (def-statement), then the import and namespace-updates occur when (and each time) the function is invoked. As modules are cached globally, that means that the actual import (reading of referenced source) will occur on the first call to the function (assuming nothing has loaded it earlier). Thereafter, just the importing of names into the local namespace (the function-namespace in this case) is done each time.
The importlib
standard library module can be used for more control over module-loading (eg dynamically choosing the module name).
The function dir()
effectively lists the attributes of the current module.
Because modules add names to the namespace of the importing module, there is the possibility of naming-conflicts. The import
feature therefore provides various ways of renaming things as they are imported (ie the name under which something is registered in the importing namespace might be different to the default name).
Modules do not really form a hierarchy; a module may contain references to other modules in its namespace but the modules themselves are really independent entities (“peers”).
Standard library module sysconfig holds various interesting host-specific settings, including the location of the standard library.
There are some hooks for customising how “import” actually works - although the need to do so is presumably extremely rare.
Packages
A package is a group of modules (ie group of source-code files) that “belong together”, all stored in a filesystem directory.
A package can include a file named __init__.py
2. This file is executed when:
- code imports the package, or
- code imports any module in the package
The init-file can contain any arbitrary code; it is in effect a standard module, except that it is “auto-loaded” when any module in the package is loaded.
One special case does exist: when __init__.py
defines variable __all__
then this is not interpreted as a list of importable names in the local namespace (as for normal modules) but instead a list of importable modules in the package. Thus from package import *
will cause all modules referenced by the all-list to be loaded.
A package can also include a file named __main__.py
in which case the package can be “run” in the same way as a module (python -m {packagename}
); see later.
Note that the word “package” is sometimes used in another sense: a bundle of software that can be downloaded and installed.
Running Python Code
Python code can be executed in several ways:
python {filename.py} ...
python -m {modulename} ...
python -m {packagename} ...
python {zipfile} ...
- or as an executable script
The first option (explicitly specifying the filename) simply loads that file, and passes it to the Python interpreter. The directory from which the specified file was loaded is added to the module-search-path, ie any “import” statements executed during the program will look for modules relative to that directory first, before looking in the standard locations.
The second option looks for a file named modulename.py
in the current module search-path - usually current-dir, site-packages, std-library.
When Python finds that the name used with the “-m” option is a Python package rather than a Python module, then it executes the file __main__.py
from that package (see section named Packages).
When a zipfile is passed to the Python interpreter, Python treats it as a package, ie looks for a file named __main__.py
in the root directory of that zipfile.
As usual in Unix, a file which starts with #!/path/to/interpreter
and is marked as “executable” can be run just like a binary application. As the path to the Python interpreter can differ between operating systems, it may be better to set the first line instead to #!/bin/env python3
which will find the command using $PATH
.
Rather oddly, it is also possible to create a file which is a zipfile appended to an executable script.
Files named {packagename}.pth
can be added in any of the “site-package” directories in the Python module search path; the file contents are then added to sys.path
. Similarly, a file sitecustomize.py
can be added to any directory in the searchpath, and it will be evaluated before any other imports (and so can modify sys.path
). See the docs for site.py for more details.
Calling Native Code
Python code can call into native (non-Python) code in two ways:
- A Python module can be completely implemented in C (or any language with a C-compatible binary API) and compiled into an object-file. The module can then be “imported” by using code just like pure Python modules can be imported.
- Pure Python code can dynamically load an (unmodified) native shared library, and then call into it.
A module implemented in C (or other language) must use Python-specific calls internally to be able to correctly accept Python input parameters, and correctly deal with them. A number of standard libraries are implemented this way.
Python code which calls into a normal native shared library must instead map variables into forms that the called code can understand before invoking the underlying code. For example, when a function expects an int and a c struct, then Python code must map a Python int object into the equivalent form, and build the expected C struct as a byte-array before invoking the target function. The necessary code is verbose and ugly but this approach is often used.
The SWIG project provides tools to generate Python wrappers for arbitrary C libraries, as an alternative to hand-writing such logic.
It is common for Python to be used as “high level glue” between logic implemented in C libraries which do heavy numerical processing. The NumPy
mathematical library is a very widely-used example; mathematical computation is done in native code and Python is used to “orchestrate” the calls. This gives a combination of performance at low-level and readability at high-level.
Namespaces
Whenever code is executed, there is always an implicit “current namespace” (a mapping of variable-names to variable-values).
When Python starts loading a module for the first time, it creates a namespace. All code at the “top level” within the file for a module uses the module-level namespace to find variables or add new ones. A module-namespace is also called a “global” namespace - though it is not global-for-all-modules, just per-module.
When Python invokes a function, it creates a new namespace with the current namespace as its “parent” and makes the new namespace the default. Any function parameters are then added to the new namespace before the code associated with the function is executed. On return from the function, that newly-created namespace is discarded. Functions therefore effectively have dynamic scoping, ie the name-lookup-chain depends on the application’s call-flow rather than the lexical structure of the program. Most programming languages use lexical scoping rather than dynamic scoping; however given that nested functions cannot be directly invoked (they don’t actually exist except as raw bytecode until the nested function is invoked), the difference is hard to observe in practice.
The root namespace of any chain of namespaces is the builtins
namespace that holds various core functions and variables, and references to auto-loaded modules from the standard library. Functions like print
and types like int
are defined as entries (varname-to-value mappings) in the builtins
namespace.
Globals and Nonlocals
Variables are just entries in a namespace (varname mapping to value). Python uses a consistent pattern for looking up namespaces, and modifying them:
- a read looks first in the local (current) namespace; if not found then look in the parent of that namespace, and so on.
- a write always writes directly to the local namespace (creates a binding of that name in the local namespace, ie add a mapping from varname to value)
This means that constants (including method-names) can be effectively defined at module-level; from there they are readable by all code that needs them.
However if code using a child namespace tries to assign to that name, a “shadowing entry” is created in that child (local) namespace. In particular, this means that when a variable is declared at module scope, and code in a function tries to assign to it, then the assignment creates a value only in the function scope (namespace) - and no change occurs in the module.
Within any scope (eg within a function), the keyword global
can be used to modify this behaviour; the line of code “global foo
” ensures that any assignment to foo
in the local scope actually modifies the namespace of the enclosing module (effectively updating self.__module__[foo]
). This occurs regardless of how many layers of namespaces lie between the current namespace and the module namespace.
WARNING: if a global variable is of a mutable type, then its state can be changed via read-and-call-a-mutable-method, without needing to rebind the variable itself. A global
declaration is thus not needed for such changes to be visible outside of the namespace in which the change was made.
The keyword nonlocal
does something similar; it causes whatever existing “parent namespace” holds the variable of that name to be updated. Or, as the original PEP states: “nonlocal
prevents the name from becoming local” (even when being assigned to).
Nested Functions
Python supports functions defined within functions, eg:
def outer(lname, fname):
print("lname=", lname)
def inner(fname):
print("full name: {}, {}".format(lname, fname))
inner(fname)
outer("smith", "joe")
When a module with this content is loaded:
- executing the “def” line causes an entry named “outer” to be created in the module namespace which points to a function-object that has a block of bytecode.
- executing the last line of the above example actually invokes the function-object bound to name “outer” (ie calls function outer).
Executing outer first causes a new namespace to be created, with two initial entries: lname
and fname
. The parent of this namespace is the module namespace.
Then the bytecode associated with function outer is executed:
- An object named
print
is looked up in the current namespace (not found), its parent namespace (the module, not found), and finally thebuiltins
namespace (found). The resulting function-object is then invoked. - A new function object is created which points to a block of bytecode, and this object is registered in the current namespace under name
inner
. - The object named
inner
is looked up in the current namespace (found) and invoked. This causes a new namespace to be created, with an initial entryfname
, etc.
Note that the function inner
only exists temporarily. Each time outer’s bytecode is executed, the “def” bytecode causes a new function-object to be created whose code
attribute points to the bytecode for function inner
. That new function-object contains a reference to the “enclosing namespace” (and is thus a closure). That namespace will be a different object on each call to outer
. This process is moderately fast - the ascii source-code does not need to be parsed, only the bytecode needs to be executed.
Protocols and Duck Typing
In a language like Java, we might define interface Foo {..some methods}
, and then define a method doSomething
which accepts an instance of type Foo
.
However this has the disadvantage that we might well have an object that has the necessary methods, but does not have type Foo
as an ancestor - and so cannot be passed to that target method.
Python supports something called “duck typing” - if a function expects a specific parameter to provide a method with a specific signature, then it should just invoke that method without caring about what type the object is. When the object has such a method, fine. When not, an error is reported. Simple.
However this raises the question: how does the caller of a function know what methods are expected on the parameters?
Python uses the word “Protocol” to describe the situation where interfaces meet duck-typing. A protocol is effectively an informal text document that provides exactly the same information that the Java interface above does - what methods need to exist, what parameters they should take, and (perhaps) what logical behaviour is expected. Any Python object that is consistent with this informal text document is said to implement the protocol. A function then just declares what protocols specific parameters are expected to support.
In short, protocols are documentation-based interfaces that are not enforced by the compiler by way of types. This gives flexibility and removes “boilerplate” code - at the price of less compile-time support for developers.
See also the section on Abstract Base Classes below.
Annotations
Annotations can be added to function params and return-values:
def f(x: 'my x', y: 'my y') -> 'my return'
The value following the colon or ->
is an object of any type - including a callable-object. In the above example, the objects used are just strings.
Annotations on function parameters and return-values were added in Python 3.5. Python 3.6 added support for annotations on variable-declarations.
Python itself does nothing with these values except store them in the function-object, for use by other code (in an object attribute called __annotations__
).
This feature can be used for many different purposes, including aspect-oriented programming, or quality-assurance, or documentation. However the standard use for these annotations is type-hinting; when this is not the case then the file (module) should explicitly mark itself with a comment of form “# type: ignore
” or use one of the other approved mechanisms (see PEP 484).
Type hinting is when the object following the colon represents a type (ie is a reference to a class). Standard-library classes such as int
and str
are obvious candidates. The Mypy
tool is a static-typechecker that scans Python programs and displays apparently inconsistent code. Other tools also exist which take advantage of type-hints. The @dataclass
annotation added to the Python 3.7 standard library (see later) depends on type hinting.
Because def
statements are actually code that is executed, the annotation can be an expression (including a function-call) that returns the annotation value. The expression is evaluated when the function-object representing the function is built, not each time the resulting function-object is invoked.
The Python standard library includes a module typing
which allows syntax like Sequence[int]
- which is actually just an alias for Sequence
as Python itself is not typed. The typing module supports using the more expressive declaration in source-code - in fact, it also supports “generics”, ie can define collections of other types - with covariant/contravariant constraints. None of this is validated at runtime.
Note that Java annotations use an “@Name
” syntax and thus look superficially like Python decorators (see later); however they are really more similar to Python annotations - passive data that the compiler attaches to code, for use by other code (static external tools or runtime internal code-transformers).
Descriptors
When looking up an attribute on an object, the Python base class object
(the ancestor of every Python object) has some special code to support descriptors.
A descriptor is any object that supports the descriptor protocol by implementing one or more of the following methods:
- “
__get__(self, obj)
” - “
__set__(self, obj, value)
” - “
__delete__(self, obj)
”
Storing an object that supports the descriptor protocol as a class attribute allows user code to write “obj.attr” as if it were accessing a simple data value, while Python internally invokes the associated getter or setter function. This feature can also be seen as allowing “virtual attributes” on a class, backed by functions.
When code invokes x = someobj.someattr
then that read-operation ends up within the base object class. The object class searches for the attribute, then (assuming it is found) it is checked to see if it has a get-method from the descriptor protocol; if so then the getter method is invoked, otherwise the attribute is returned directly.
Similarly, when code invokes someobj.someattr=val
then the base object class searches for an existing attribute (as if a read was being done), and (assuming it is found) checks whether the target of the assignment (someattr) has a set-method from the descriptor protocol; if so then the setter method is invoked otherwise the local namespace is updated to map someattr to the new value.
Note that it was described earlier how reads can reference mappings from ancestor namespaces, but writes are always done in the local namespace. Writing an attribute that is a Descriptor is subtly different; the write will be done via the setter on the descriptor instance even when it was found in an ancestor namespace. However in practice, descriptor setters will almost always actually store the data in the “self” object (ie the local namespace) anyway - after validation or whatever else the setter needs to do.
When a decorator provides set
and delete
methods, and is stored as an attribute on a class, then it is impossible to later replace that decorator - any attempt to delete or override the attribute will instead be delegated to the decorator itself!
A decorator is particularly useful when the class is already in use, and what used to be a plain attribute now needs associated logic on get or set; the approach means that such logic can be added without needing to change existing code.
Because decorators look like simple attribute-access, it is recommended that the functions also behave similarly to attribute-access (to not surprise callers), ie:
- be side-effect-free (ie not change application state)
- be relatively fast
IMO, descriptors are somewhat of a hack - they just don’t feel very elegant. However in practice they work, and are widely used - in user code most commonly via the @property
decorator (see below).
Interestingly, all methods on a class are actually descriptors; when someobj.somemethod()
is invoked, “somemethod” is usually an attribute on a parent class which implements the descriptor protocol. The get-method returns an object which is the underlying function with the first “self” parameter appropriately bound (in functional language, partial application has been performed). Thus obj.method(1)
is actually:
- find attribute method - which will be a descriptor object
- invoke
descriptor.__get__
- which returns a function-object where the first parameter is bound toobj
- invoke
fnobj(1)
- which delegates to the original function with two args:(obj, 1)
.
See later for more information on classes and the process that constructs them.
See also:
- https://docs.python.org/3/howto/descriptor.html
- https://www.smallsurething.com/python-descriptors-made-simple/
Decorators
The syntax @somecallable
can be applied to any declaration (method or class). After the referenced object has been created, the Python interpreter then passes the created object to the specified function, and stores the returned value (in the class or module namespace) rather than the original one.
This very generic tool is called a decorator and can be used for a wide range of code-transformations.
Decorator @property
is probably the most common decorator from the standard library; it is applied to a getter function on a class and replaces the function-object with an object that implements the decorator protocol (see above) and delegates get operations to the original function. A somewhat odd syntax can be used to similarly delegate set (or delete) operations. The result is that all instances of the class appear to have a specific data attribute (“obj.attr”) but this is in fact implemented via functions. This is also sometimes called “virtual attributes”, because what looks like a simple data-value stored on a class is actually produced by a function when read, and processed by a function when written.
Decorators @classmethod
and @staticmethod
are also available for use on functions declared within classes. As noted earlier, functions declared within a class body are usually transformed into descriptor objects during class initialization; the classmethod/staticmethod decorators immediately wrap functions in descriptors, providing their own logic and blocking the default transformation at the same time.
-
@staticmethod
wraps the target method in a descriptor that basically does nothing (except return the function from its get method); the primary point is to prevent creation of the usual descriptor which binds “self”. -
@classmethod
wraps the target method in a descriptor whose get-method returns the target function with the first parameter bound toself.__class__
.
A static method is effectively the same as invoking a module-level function, ie Foo.staticmethod(args)
is equivalent to staticmethod(args)
. So why would you want one? Well, primarily so that the method is easily available/accessible when you have imported type Foo, but potentially nothing else. It can also be invoked via an instance, ie foo.staticmethod(args)
and is therefore available even if you just have an object without an explicit type.
Decorators property
, staticmethod
and classmethod
are also available as plain functions. A class body is executed at module load-time, so calling these functions is just as easy as applying a decorator (though perhaps not quite as readable - depending on your tastes).
Decorator @dataclass was added in Python 3.7. It is applied to classes not to methods, and provides:
- an implicit
@property
annotation with backing hidden attribute for each typehint-annotation on the class (ie dataclasses rely on the variable-annotation support introduced in Python 3.6) - a bunch of standard methods (equals, lessthan, etc) that use the above properties
An example dataclass (from the official docs):
@dataclass
class InventoryItem:
'''Class for keeping track of an item in inventory.'''
name: str
unit_price: float
quantity_on_hand: int = 0
def total_cost(self) -> float:
return self.unit_price * self.quantity_on_hand
Callables
Any object with method __call__
can be invoked via syntax obj(params)
.
Python represents all standard callable objects (functions, lambas) as objects of type collections.abc.Callable
, but that type is not absolutely required to invoke an object.
The two standard ways of producing a callable object are:
- via
def fn(..)
- via
lamba x:..
A def
creates an object:
- whose
__name__
attribute is the function-name used in the declaration - whose
__module__
attribute is also set appropriately - which has a docstring in attribute
__doc__
if one is defined
A lambda-expression also produces a callable object, but the above attributes are not set - and the body of a lambda is limited to a single expression.
Closures
Python supports closures; when a callable object is created at runtime from a lambda or a “nested def”, then the object “captures” variables from the enclosing scope.
Example:
class destructable:
"""A class that creates objects which print a message when their refcount drops to zero"""
def __init__(self, name):
self.name = name
def __del__(self):
print("destroying {}".format(self.name))
def mkclosure(d):
def closure(x):
"""A callable object that captures a reference to object d"""
print("x={}, d={}".format(x, d.name))
return closure
# create some objects that report when their refcount drops to zero
d1 = destructable("d1")
d2 = destructable("d2")
# create an object which holds a reference to d2
fn = mkclosure(d2)
# show that the closure works
fn(12);
# remove references to the following objects from the current (module) namespace
print("removing d1 and d2 from module namespace..")
del d1 # refcount drops to zero, so destructor is called
del d2 # refcount does not drop to zero because closure "fn" holds a reference to it
# show that the closure still works, even though d2 is no longer accessible from this scope
fn(13)
# delete the closure, at which point the refcount for d2 drops to zero and its destructor is called..
print("deleting closure..")
del fn
print("done")
and the output is:
x=12, d=d2
removing d1 and d2 from module namespace..
destroying d1
x=13, d=d2
deleting closure..
destroying d2
done
The above example assumes CPython is being used, which uses refcounting rather than mark-and-sweep garbage collection, and thus destructors are called immediately. The concept of closures also works in other Python implementations, but the output might not show so clearly how things work, due to delayed invocation of the destructor (method __del__
).
Exceptions
Python’s support for exceptions is pretty standard - very much like other languages. Exceptions should be objects, and catch-expressions select specific exception-types using the type hierarchy.
Exception handling is also similar to other languages; Python’s try/except/finally is equivalent to Java’s try/catch/finally.
Somewhat unusually, Python’s standard libraries often use exceptions to implement flow-control, eg an iterator object indicates “no more data available” by throwing an exception.
It is considered good Python programming style to follow the EAFP principle - “easier to ask forgiveness than permission”. This means that it is preferable to perform an action that should succeed and catch a resulting exception if a problem occurred, than to check in advance whether the operation would succeed (LBYL - “look before you leap”).
Tuples
Python tuples are identical to those found in many other languages - a fixed-length immutable sequence of references to other objects. Elements are accessed by index-number, or by using
“tuple unpacking” (aka “destructuring assignment”) like (a, b, c) = (1, 2, 3)
which assigns the different components of the tuple (1, 2, 3)
to different variables.
They are only mentioned here briefly as the primary alternative to classes, which are discussed below.
Python also provides “named tuples” - basically a mix of tuples and classes; these are discussed later.
The Dict Type
A dict provides a mapping of key to value. Python refers to any such class as a “Mapping type”, and dict is the standard implementation.
Objects used as keys of a dict must be “hashable” and immutable; the dict class verifies these properties. A dict is therefore a constrained kind of mapping.
This is implemented as a class (naturally), where an attribute in the class dict is the “user dict” in which data is stored via syntax like mydict["foo"] = bar
.
Because the dict is so widely used in Python, its implementation is actually in native C code.
Python classes often use a dict-like structure to store custom (non-builtin) attributes - and even uses the special attribute name __dict__
for this. However IMO it is better to think of the class implementaton as something separate from general dicts.
Classes and Instances
All Objects
Every value in Python is an object. In CPython, every object is represented by a native (non-Python) in-memory structure that contains:
- a reference count
- a native field
__class__
which holds a reference to the object which represents the type of the object - zero or more additional native fields that depend upon the object type
These “native fields” are simply entries in a C “struct” declaration (for CPython at least). Class objects have different “native fields” in the in-memory representation than instances of that class, function objects have their own set of custom fields, etc. At the Python level, some native fields can be accessed from Python - in which case they have names with two leading and trailing underscores. See here for the full set of such fields.
Complex objects often have a native field named __dict__
which is a (key, value) mapping, used to store non-builtin (ie non-native) methods and attributes.
Class Objects
A class is an object which is “a factory for instances”. A class can also be used as a kind of mini-module, ie a namespace.
Python documentation often uses the term “type object” for what I call “class object”; it is the same thing - an in-memory structure that doesn’t itself represent user data, but instead represents a type associated with objects that do store such data.
It may be helpful to think of a class-object as having multiple separate roles:
- holding a reference to a “shared parent namespace” (mapping of
name -> value
where the values might be constants, variables, or method-descriptors) - holding a reference to a set of base classes that provide additional “shared parent namespaces”
- being a factory that creates new objects which use that shared parent namespace - and applies common “init logic” to each object
- being a “type tag” that can be used in “isinstance” calls at runtime to identify all objects created via the class
When a declaration “class SomeClass(SomeAncestor, metaclass=SomeMetaClass)
” is executed in a source-code file, then logically:
- a new (temporary) namespace is created, and made the default namespace
- the body of the class is executed (as a sequence of statements), adding various objects to the temporary namespace
- method
__new__(...)
is invoked on the metaclass, passing the temporary namespace as one of the parameters
The metaclass then allocates a suitable in-memory structure for the kind of object being created (ie with the appropriate “native fields”). It then initialises those native fields from the data in the (temporary) namespace, and if field __dict__
is part of this type, then copies data into it from the namespace.
For class objects, the native fields include:
-
__name__
- set to the class name -
__class__
- points to the metaclass name -
__bases__
- list of parent classes (one for single inheritance; more than one means multiple inheritance) -
__call__
- points to method__new__
on the class (if it exists), or in an ancestor type (base typeobject
provides a default implementation which is usually good enough) -
__doc__
- points to the “docstring” for the class (if present); see later -
__annotations__
- holds any annotations (eg typehints) declared in the class body -
__dict__
- holds a mapping that can store arbitrary (name, val) pairs - eg methods or class-level variables.
As part of the copying of data from the temporary namespace into the class dict, any objects of type function are wrapped in a descriptor which ensures that anyone trying to read the attribute does not actually get the original function (with self parameter) but instead a wrapper function where the first parameter is bound to the object on which the lookup was done. This ensures that obj.methodname
returns a function-object that is implicitly bound to the correct self-instance.
When the class object has been completely processed, it is then passed as a parameter to any class-decorators that may be attached to it. Finally, the resulting object is placed into the module’s namespace using the declared name.
Note that we have created an object which represents a class, with a set of attributes which are appropriate for a class type. The actual creation/initialisation of such an object is performed by the metaclass - ie it is the metaclass that decides what native fields exist. However in practice, metaclass type
is almost always the one responsible.
The actual CPython datastructure used to represent a class object (objects created by object type
) is documented here. The in-memory representation of a class object is somewhat “denormalized” - ie the memory structure allocates lots of fields which might not be poulated. However given that a program creates relatively few class objects at runtime, this is not an issue; instance objects are less “denormalized” in memory.
If no metaclass is explicitly specified (and no __metaclass__
attribute is defined) then the metaclass from the first base class is used. If no base class is explicitly specified, then class object
is used. Class object
has as metaclass class type
- and thus almost all classes have type
as their metaclass. See later for more discussion on metaclasses.
Instance objects can have a “constructor” defined on the factory-object that creates them, ie their class. Similarly, a class object can have “constructor logic” defined on the factory-object that creates them, ie their metaclass (methods __new__
and __init__
). However there are a couple of other ways to implement per-class-initialization-logic that are not available for instance objects:
- code in the body of the class
- an
__init_subclass__
method on a parent class
Remember that Python files are executed from top to bottom the first time a module is loaded. The “body” of a class is therefore actually executable code; what in other languages are “static member initialisation expressions” are simply code that is executed. Similarly, method-declarations (def
) are actually being executed, returning function objects, which are also then stored into the current namespace.
If the first item in the class declaration is a string, this becomes the “docstring” for the class (and is stored in member __doc__
).
Some object-oriented languages support the concept of “destructor methods” which are called when the object is “destroyed”. Python has something similar: method __del__(self)
on a class is called when refcount on an instance reaches zero.
Note that it is possible to be a good Python programmer without actually understanding metaclasses; it is seldom necessary to write a custom metaclass.
Instance Objects
To create an instance of the class, a class object (SomeClass in our example above) is used like a callable object.
Python invokes the __call__
method on the class, which usually maps to method __new__
on the class - which is usually simply inherited from base class object
. The default new method delegates to __new__
on the metaclass associated with the original class. The implementation of that method on class type
(the default metaclass) creates an in-memory structure to hold the new instance, and then calls method __init__
on the class (if such a method exists).
A class seldom overrides method __new__
. Method __init__
is very commonly overridden and is the equivalent of a “constructor” in other object-oriented programming languages - it initialises each instance as it is created.
Each instance has an in-memory structure containing some “native fields”, including at least __class__
, and usually __dict__
3. Note that:
- an instance object has a single associated class (its type, ie the factory that created it, and the holder of its “inherited attributes”)
- a class object has a single associated class (its metaclass, ie the factory that created it) and a list of base classes. It is the base classes which provide inherited properties for instances of that class, not the metaclass.
Note that objects which are instances of a class are created by calling their class like a function, and have a constructor method (__init__
) which is defined on their class. Classes, although they are also objects, are usually created via the class
keyword, and their constructor is defined on their metaclass (or a few other options; see above).
Interestingly, while native attributes on classes are generally (always?) immutable, some native attributes on instance objects can be modified after the object has been created - including __class__
!
Reading and Writing Object Attributes
Whenever an attribute on an object is read (“someobj.someattr”), Python first checks whether the name refers to one of the “native fields” of the object - ie those that are directly embeded in the in-memory structure that represents that object. If the attribute being read is not one of the “native fields”, and the object has a __dict__
native field, then the attribute is searched for as an entry in that dict. If not found in the dict associated with the instance, then it searches in the object referenced by native attribute __class__
.
The class in turn searches recursively for the attribute in each base-class (native attribute __bases__
). Thus, a new instance of a class implicitly has all methods of the associated class (and its ancestors) even when it has a nearly-empty dict (or no dict at all); just the __classes__
native attribute is sufficient to find those inherited methods (and attributes).
A method-call on an object is simply an attribute-lookup (as above) followed by an invocation operation on the returned object - ie the attribute lookup is expected to return a callable object.
However when an attribute on an object is written, it is always stored directly in the dict associated with the instance. If it has the same name (“key”) as an inherited method or attribute then the inherited value will be masked (though it can be accessed indirectly via obj.__class__.attrname
).
In summary, when a read operation is performed using dot-notation obj.attr
then Python looks:
- in the “hard-wired” attributes associated with the instance (
__class__
, etc). - in the
__dict__
of the local instance - in the object referenced by
__class__
- in the baseclasses of the ancestor class
Due to this chained-lookup, constants can be effectively defined at class-level, and used at instance-level (they are only read). The difference between read and write also means that primitive types can effectively be defined on a class to provide “defaults” to instances; when they are read then they get the inherited value but as soon as they are written, the instance gets its own copy. However mutable types defined at the class level are dangerous - all instances of the class will share the same state because reading returns the object (eg a map) and modifications of this object then change the shared instance.
For programmers used to Java and similar languages, it is particularly tempting to use “type-hinting” in a class declaration like declaration of instance variables:
class Foo:
name: str = "default"
age: int
mydata: dict = dict()
However these are effectively equivalent to Java “static” members, ie are class-level attributes (stored in the __dict__
associated with the class). There is no way to declare the members that an instance of a class has; these are only defined implicitly via assignment-statements executed in the constructor (init-method) and other methods of the class. Well, unless the __slots__
mechanism is used (see later).
There is an official syntax that allows a class to define which attributes the init-method should initialise, but this is just info for external validators - it has no effect at runtime (all declared fields with initial values are added to the class, not the instance, and all declared fields without values are just ignored).
Note however that the @dataclass decorator (since Python 3.7) does use typehints on the decorated class to define fields for instances. In other words, it uses the information stored in __annotations__
by the Python interpreter during initial code processing to guide itself when generating the init-method, getter/setter methods, etc.
Because methods and variables both live in the __dict__
, it is not possible to have a variable and a method with the same name.
Invoking the Superclass
Unlike some languages, a constuctor function (__init__(self)
) does not automatically invoke the constructor on the parent class; it is considered good style for every constructor function to start with:
super().__init__(self, ...)
When a class has multiple base classes, then the super calls need to be explicit:
Base1.__init__(self, ...)
Base2.__init__(self, ...)
Classes With Slots
Previously, we have described the default implementation of instance objects: that instance-level attributes are stored in a dictionary (map) structure named __dict__
.
However a class may define an attribute __slots__
which is a list of attribute names (somewhat like namedtuple). When instances of this class are created, then the “__new__
” method that allocates memory for each instance of the class ensures space is allocated directly in the core in-memory structure, next to the “native fields”. Descriptors are automatically added to the class (by its metaclass) so that code like obj.myattr
retrieves the value from the core in-memory structure rather than using the inherited behaviour of looking in the __dict__
.
When the slots list includes __dict__
then the instances also have a dict as usual. When the slots list does not include __dict__
then that object simply does not have a dict - and thus cannot have any “dynamically added” attributes or methods. However it still inherits methods and attributes from the type referenced via native field __class__
.
Note that at the class-level, __slots__
is just a list of strings. It is the instances of that type that have their in-memory layout significantly modified, not the class that creates the instances.
Class inheritance does still mostly work when the base class uses slots, but there are some quirks. See the Python reference manual for more details.
Classes and Method Objects
Declaring a function within a module produces a function object - an object with attributes (in dict) like name and doc.
Declaring a function within a class also produces a function object which is added to the class dict. However by default, the code that initialises the class object wraps the function in a descriptor, as described in the section on descriptors. The descriptor’s get-method returns a wrapper around the original function which automatically sets the first parameter (self).
This auto-self behaviour means that the returned function-object can be passed around - doSomething(obj.method1)
will give function doSomething
an invokable object which “knows” that it is bound to obj
.
This wrapping only occurs when the method is retrieved from the class (because the descriptor was set up during class instantiation). If a function-object is stored directly in the dict of an instance then there is no decorator, and so the object is returned without any post-processing. See instance methods for more info.
Multiple Inheritance
A class can have multiple ancestors, eg class Foo(base1, base2, base3): pass
.
Python has a moderately complicated algorithm called MRO (method resolution order) which takes these bases, and their ancestors, and produces a plain linear list of classes used when performing attribute lookups. In general, this is equivalent to a depth-first search of the base classes in the order they are listed. However when the same class is found multiple times in the ancestry tree, then it is added just once to the list - and the exact position in the list is carefully chosen to provide “stable lookup ordering” when changes to the hierarchy occur. In some cases, no reasonable “stable” ordering can be found, in which case an exception is thrown while defining the class.
Metaclasses
Every object has a reference to its type. But a type is also a class - and has its own reference to a type. A “metaclass” is simply the class that a class object references via __class__
. In most cases, classes (eg int, str, or your own custom classes) have the class “type” as their metaclass. However it is also possible to create a subclass of type
with special behaviour, and then specify that via “metaclass=” for new classes.
A new class object can be created programmatically by using class type
as a function - just like a custom SomeClass
is used to create instances of that class. The class
keyword in Python source code effectively does just that. Here is an example of creating a new class object (ie a new type) programmatically:
class Parent: pass
# create class in usual implicit manner
class Foo1(Parent):
i = 1
def m(self): print("m1")
# create class in more explicit manner
bases = (Parent,) # tuple with one member
d = dict()
d["i"] = 1
d["m"] = lambda self: print("m2")
Foo2 = type("Foo2", bases, d) # using "type" as the metaclass
# and verify behaviour
f1 = Foo1()
print(type(f1))
print(dir(f1))
f1.m()
f2 = Foo2()
print(type(f2))
print(dir(f2))
f2.m()
Custom metaclasses are primarily useful for changing the way the new/init methods work on a class, ie changing the way new instances of a class are created. In many cases, an alternative is to just create a base class with custom new/init methods and inherit from that. However that approach does sometimes change the “inheritance structure” in a way which might not be desirable.
A specific class is an instance of a metaclass, just like a specific string is an instance of the type str
. Or looking at this from another angle, class type
is a factory for objects which are “classes” - a factory of factories. And just like class str
is used to create many instances at runtime, class type
is used to create many different class-objects - which are then used to create objects which are “of that type”.
A class can specify its metaclass via several mechanisms; the obvious one is class Foo(base, metaclass=xyz): ...
. Python also allows a class to declare an attribute named __metaclass__
to control which class is invoked to transform the “temporary namespace” created by the class body into an actual class object. A value for __metaclass__
can also be declared as a module-scoped variable, affecting all classes that do not override it. See the official docs for details.
Interestingly, while an instance object does not inherit methods from metaclasses associated with its class (searching for methods only looks in base-classes), invoking a method directly on a class object searches for methods in the metaclass, not the base classes.
Note that it is possible to be a good Python programmer without actually understanding metaclasses; it is seldom necessary to write a custom metaclass.
The Difference Between a Class and an Instance
Because classes are objects, and instances of classes are also objects, it can be helpful to look at the differences between them:
Lifecycle:
- Instance objects are created during the program lifetime, as needed
- Instance objects are garbage-collected when their reference-count drops to zero.
- Class objects are (usually) created only when a module is loaded
- Theoretically, class objects can be garbage-collected, but they are usually referenced from modules and modules are rarely unloaded.
Initialisation:
- An instance object is initialised by an init-method on its class (a “constructor”)
- A class object can be initialised in several ways:
- code in the class body
- code in the metaclass
- code in a base-class (method
__init_subclass__
) - a class-decorator
Hierarchy:
- An instance object has exactly one class
- A class object has both a class (its metaclass) and a set of base classes
Lookups:
- When an attribute is queried on an instance object
- lookup is done on the object itself
- when the attribute is not found, lookup is delegated to its associated class
- When the attribute is still not found on the class object, lookup is delegated to its base classes
- When an attribute is queried on a class object
- lookup is done on the object itself
- when the attribute is not found, lookup is delegated to its class (ie the metaclass)
Slots:
- An instance object can store its data in
__slots__
to reduce memory usage when lots of instances exist - A class could maybe do this, but it would be pretty crazy - and not necessary, as there are relatively few class-objects in a program
Abstract Base Classes
Python’s use of types is a little bit odd for people (like me) coming from other programming languages.
Python deals with object types in three ways:
- Duck Typing
- Declaration-time Inheritance
- Abstract Base Classes
Duck typing is effectively no typing - a function expects that one of its parameters is an object with methods x(a, b)
and y(c)
, and as long as the parameter has those methods then all is good. The “isinstance” function is not useful, as the function simply does not care what actual type the parameter has.
Declaration-time inheritance is the kind of typing that programmers from C++, Java, and similar languages expect. When a class is defined, it declares its ancestor classes. Attributes and methods are inherited from those declared ancestors. The “isinstance” function can be used to determine whether a parameter is of the expected type (or a subtype thereof). The Python interpreter itself never performs any type-checks, but code can do so explicitly - and lots of standard-library functions do4.
Python’s Abstract Base Class feature (also known as ABC) is a very interesting third approach. Using metaclass collections.abc.ABCMeta
, a class can be declared as “abstract” - with abstract method declarations too. Then either:
- a class can be created with the abstract class declared as a parent (similar to Java, etc), or
- an existing class can be “registered” with the abstract class to retrospectively make it a subclass of the abstract class
The simple declaration-time subclassing is similar to most other object-oriented languages (Java, etc). Instances of a subclass cannot be created unless all methods declared abstract in the ancestor classes have been overridden with non-abstract implementations.
The register-based approach is more interesting; any class can be passed as a parameter and that class object is validated by the base class to ensure it implements all required methods (if not, an exception is thrown) and then the class is added to an internal list of “registered implementing classes”. Later, when isinstance(obj, someabstractclass)
is invoked, someabstractclass
just checks whether the object’s class is in its list of registered classes. Note that the registered class is not modified - it does not inherit any attributes or methods from the “abstract base” it was registered with, it just gets added to the “implementing classes list”.
The Abstract Base Class feature is used extensively throughout the Python standard libraries; there are ABCs for Number
(all numeric types), and various other useful groupings of classes. Standard library type declarations are often followed by one or more call to register
to link the types to their logical ancestors.
One nice use of ABCs is to create restricted subsets of existing types. For example the standard library might have a base class B with 3 methods, but you want a method that accepts any object that implements at least two of those methods. You can simply declare an ABC with those two methods, and register all relevant implementations of B with your new type; they are guaranteed to be compatible. You can then also provide your own classes that implement the same ABC.
The PEP document for ABCs states (reinterpreted by me) that with respect to types:
- using declaration-time inheritance trees provides “false negatives” - an object might well be suitable for a specific purpose even when it does not inherit from an expected base type
- duck-typing provides “false positives” - an object might NOT be suitable for a specific purpose even when it happens to have a method with the right name and param-count
The ABC approach gives flexibility - a specific abstract type may be required by called code, but the caller has the ability to mark any type as “being compatible” if they wish, regardless of its ancestry.
In Java and similar languages, when you have an object that really does provide the functionality required by a specific interface, but was not statically declared with the appropriate ancestor, then the adapter pattern needs to be applied - ie a trivial wrapper object needs to be created, which is somewhat ugly. With the “register” functionality of abstract base classes, no such adapter is needed; any code asking “is this object really an instance of the expected base type” gets the answer “yes” and can continue in relative confidence that the programmer did indeed pass an object of an appropriate type. Using isinstance
to test for types that are not abstract base classes should be avoided, as that does really force the caller to use an adapter or similar workaround. And as with any object-oriented language, excessive use of isinstance
suggests that perhaps polymorphism (ie a method with different implementations in different classes) might be a better solution. The author of the abc
spec suggests their best use is for “sanity checks”, in order to provide good error messages etc., and that creating new abc
types is probably not necessary - the core library types should be enough.
Interestingly, the widely-used Zope CMS (content management system) provides an implementation of traditional object-oriented interfaces on top of Python. And the well-known Twisted framework also uses Zope’s interface system. It appears that “duck typing” is not always considered the best solution. However one of the developers of Zope’s interfaces prefers them to Abstract Base Classes.
The namedtuple Module
Function collections.namedtuple(...)
takes a “data structure definition” as parameter and returns a class that can act as a factory for objects matching that definition.
Unlike tuples (and like classes) the namedtuple attributes can be referenced by name (tuple.attr
). This works because the class generated by function namedtuple
contains the “getter” methods needed to support lookup-by-name. The instance objects that the class creates have no __dict__
, just an embedded tuple; the instances are therefore very space-efficient.
However because tuples are immutable, the generated class does not provide “setter” methods to update individual attributes. Instead, it provides a number of helper methods, the most useful of which is _replace(keywordargs)
which generates a new instance, replacing each attribute specified in the keywordargs with the associated value. Don’t be fooled by the leading underscore - it is not there because the method is “private” but instead to indicate that the method is generated.
Enums
Most languages provide explicit support for “enumerated types”. Python provides this via a standard-library class:
Basic usage looks like the following:
from enum import Enum
class Color(Enum):
RED = 1
GREEN = 2
BLUE = 3
Lots of interesting options exist in the enum support; see the docs for details.
Local Variables
Every module has its own namespace (a dict of names to objects), as does every “def”. The function “locals()
” returns the namespace dict - which can be useful for looking up variables dynamically. Function locals()
is quite often used as a parameter to String.format
to elegantly interpolate the values of variables into format-strings:
def foo():
owner = "charlie"
dog = "snoopy"
print("{owner} is the owner of {dog}".format(\*\*locals()))
Function vars
is actually a superset of function local
:
-
vars(obj)
is equivalent toobj.__dict__
-
vars()
is equivalent tolocals()
, ie returns the dict for the current stack frame
Formatted strings were introduced in Python recently: these can reference names from the current namespace directly:
print(f"{owner} is the owner of {dog}")
Ternary Expressions
The C programming language supports “ternary expressions” of form a = (boolexpr) ? valwhentrue : valwhenfalse
, and this has been copied by many other languages. Python does not have this exact syntax, but similar effects can be achieved.
If-expressions are of form var = 10 if a else 20
. The order is a little weird (values left and right, with the boolean clause in the middle) but Python often uses weird ordering - another is “import foo” but “from bar import baz”.
An alternate approach is to use chains of and
and or
:
- in most languages,
a or b or c
evaluates to either true or false but in Python it evaluates to one of the values (a, b, c) - either the “first true-like value”, or the last value in the sequence - similarly,
a and b and c
evaluates to either the “first false-like value” or the last value in the sequence
Thus:
# x = (a=1) ? "yes" : "no"
x = (a==1) and "yes" or "no"
Functional Programming
Standard library modules functools
, itertools
and operator
provides utilities for various functional-style goodies such as:
map/filter/reduce
-
lambda
(anonymous functions) - closures
- partial function application
Static Typechecking
As noted earlier, Python does have a type-hierarchy, and typehints (Python 3.6) allow variables and parameters to be annotated with specific types.
It is therefore possible to write in a style where everything is annotated, and then use an external typechecker program such as Mypy to check the code for consistency. There are no plans for the CPython interpreter to support typechecking at runtime.
Persistence
Storing data on disk (persistence) is the goal of various features of the Python standard library:
- pickle (serializing Python objects to bytearrays, and deserializing them again)
- shelve (combines pickles with disk-storage)
- dbm (standard API for key/value stores)
- dbapi2/sqlite - a simple relational database bundled with the Python standard library!
The standard way of talking to a relational database from Python is the DBAPI2 api. This is similar to Java’s JDBC - although code is not quite so portable between databases. DBAPI2 is not actually part of the standard library; it is effectively a protocol (ie a documented convention).
There are various third-party ORM (object-relational mapping) libraries that make interacting with SQL databases from Python easier than using the very low-level DBAPI2; the best-known appear to be:
The sqlite
embedded database (and matching DBAPI2 driver) is included in the Python standard libraries, meaning basic SQL support is available “out of the box”.
Regular Expressions
The re
standard library module provides regular expression support - very similar to Java’s regular expression library. Unlike Perl or Javascript, regular expression support is not built into the language itself.
Logging
Python provides a standard logging module for code to emit debug/info/warn/error messages. The module is similar to the various logging libraries for Java (log4j, java.util.logging, etc).
Parsing External File Formats
Python’s standard library includes support for reading and writing a wide range of data formats, including:
- Delimiter-separated data with the
csv
module - Windows-style
.ini
files with theconfigparser
module - JSON data with the
json
module - XML (DOM and SAX approaches)
For configuration-files, it is also possible to just use a Python file, and use eval
to “read it”.
Context Managers
It is often necessary to run code of form setup/dosomething/cleanup. Ensuring the cleanup is always executed, even in the presence of errors or exceptions, can be tricky. Python’s answer to this is the context manager protocol and the with statement.
Example (using path.open which returns a file
object that is a valid context-manager):
with path.open() as file:
...
The expression after with
should return an object that implements the context manager protocol. This simply requires an enter
and an exit
method; method enter
is invoked at the start of the with-clause, and the returned object is assigned to the “as-variable”. When control leaves the scope of the with-statement for any reason, method exit
is invoked.
Sometimes the “enter” method just returns self, ie the expression following with
is both the resource to access, and a context-manager at the same time (as is the case for a file
object). In other cases, the context-manager is a “wrapper” around a resource.
This is equivalent to Java’s try-with-resources feature.
Generators
Warning: some of the details in this section regarding internal implementation of generators are speculation. Given the behaviour of generators, I have tried to guess how they are implemented because I grasp concepts better that way (understanding the underlying implementation) than simply memorizing what features generators provide. If you prefer to just stick to the “what” rather than the “why”, then see the official documenation on generators. And if you know more about how generators work than I do, then feedback/corrections are very welcome!
Normal functions get allocated a namespace when they are called, and that namespace is destroyed when the function returns. Any local variables that function defines are therefore lost at function return; the next call starts with a new namespace. A generator instead has a long-lived namespace; when the yield
operator is executed within a function, the caller receives the specified value (like a return-statement) but the current namespace is saved away. The function can later be resumed with the same set of local variables it previously had.
Anything a generator function can do can also be implemented as an object that stores state in its object attributes rather than in local variables - but the code is far less elegant.
When Python is about to create a “function object” as the result of executing a line of source-code starting with def
, then it first checks whether operator yield
occurs anywhere within the bytecode body of that function; if so then it does not return a normal function object, but instead a generator-constructor function (ie a function that is a factory for generator instances) which wraps that function.
When the generator-constructor-function is invoked for the first time, it just creates a generator object and returns it.
The generator object implements the following methods:
__iter__() # returns an iterator - actually just returns self, as a generator function object *is* an iterator object
__next__() # the iterator protocol - equivalent to send(None)
-
send(value) # causes the generator to run the target fn (on first call), or resume from
yield. On first call, only "None" is allowed as the sent value.
-
close() # causes the
yieldexpression within the generator to throw a GeneratorExit exception
-
throw(type, ..) # causes the
yieldexpression within the generator to throw the specified exception
The generator object also contains:
- an attribute holding the set of local variables (a dict) - initialised to the parameters passed to the generator-constructor-function
- an attribute holding the “last instruction address” - the offset within the function’s bytecode of the most recent yield function executed (initially zero)
The real function definition that the generator wraps is not directly accessible; it can be invoked only indirectly, via the generator methods __next__
or send
.
Originally, generators did not support send/close/throw - those features were added later (by PEP 342), in an attempt to support coroutines in Python (as was yield from). However the community eventually reached the conclusion that this was the wrong approach, and instead added keywords async
and await
to the language; async def
produces objects that are similar to generators, but nevertheless independent. As far as I can tell, the use of send/close/throw on generators (and yield from
) is effectively deprecated - I will therefore only mention very briefly how they work.
Note that programmers should not usually mess with system names of form “__name__
” - here, that means that __iter__
and __next__
should really not be called directly. Built-in functions iter(obj)
and next(obj)
internally delegate to the underscore-based methods, but are considered more elegant.
Generators as Producers of Data
A generator which uses yield
to output data, and expects its users to call __next__
to obtain such data is a “producer” of values. An example is:
def myrange(first, last):
curr = first
while curr <= last:
yield curr
curr += 1
# prints values 6,7,8,9,10,11,12
g = myrange(6, 12)
for i in g:
print(i)
Note that the line for in in g:
actually triggers a call to g.__iter__()
followed by repeated calls to g.__next__()
(that is how for-loops work).
When __next__
is invoked on a generator object, the generator object simply tells the Python interpreter to:
- set the “current namespace” to the saved variables held by the generator object, and
- start executing bytecode at the “last instruction offset”.
On the first call to __next__
, this means that execution starts at the beginning of the function, and the only local variables defined are the function-parameters. The function continues until it executes a yield
operation, at which point the offset is saved and the value passed to yield
is returned as the output of the __next__
call.
On the next call to __next__
, the function continues at the bytecode instruction following the yield
command - and with all its local variables intact.
A function with a single yield
statement is effectively a function with two entry-points: the function start, and the line after the yield
statement. It would therefore be possible to write a class whose instances behave similarly - all local vars would need to be self-attributes, and the function body would need an if-statement to select the appropriate “entry-point”. With N yield
statements, N+1 entry points are needed - also doable, though somewhat more complex. However the object-based implementation would just not look as elegant as the generator approach.
The most common use-case for generators is to provide “lazy data providers” - data is computed as needed.
An extension of a simple single “lazy data provider” is to build a chain of such generators, in order to create a “pull-based data processing workflow”. Each generator in the chain has a constructor that takes an iterable object representing a “data source”; the first generator receives a real iterator (eg a file-object providing lines from a file) while other generators receive a generator (which is an iterator) as parameter. Each generator function consumes data from its input iterator, and outputs data using yield
. The resulting code is readable, and looks like usual imperative programming logic. However at runtime, as the “main thread” of the application calls next on the last generator in the chain, this triggers a sequence of calls that effectively “pull” a single item of data through the processing chain. This is elegant, particularly when the end consumer may decide to stop processing when it encounters the data it needs. This resembles the default “lazy” behaviour of functions in Haskell.
Generators as Coroutines (deprecated)
The addition of the send
method to generators means that a generator function can look like:
def mygencoro(arg1, arg2):
count = 0
while True:
x = yield
count += 1
print("{}: {}".format(count, x.uppercase()))
mygen = mygencoro("val1", "val2")
mygen.send(None) # starts underlying function, which blocks at the first call to yield
mygen.send("hello") # resumes from yield
mygen.send("world") # resumes from yield
mygen.close()
Note that statement x = yield
vs the earlier generator example which contained yield curr
.
As usual, the initial call to the generator just saves the params away, and does not invoke the actual underlying function.
The first call to ‘send’ must always pass None as the value; the invoked function starts running at the beginning (rather than resuming from a previous yield
) so the send-value is ignored - and thus providing a value is non-sensical and almost certainly a bug. The function then starts at the beginning, and is suspended when it encounters the yield
operator, at which point the call to “send” returns.
On the next call to send
, x is set to whatever value was sent, and mygenfn
continues on the line following the yield
.
Note that all processing occurs on a single thread; when “send” is called, the current thread is used to “resume” the generator. And when the generator executes yield
then the current thread continues at the point after the send
.
Theoretically it is possible for the yield
operator here to have an associated expression (like the earlier generator example), in which case the expression becomes the return-value of the call to send
. However a tutorial on generators that I found warned about trying to both consume and produce data using yield
(ie code of form x = yield {expr}
) - it “may make your head explode”.
Methods close
and throw
cause the function associated with the generator to resume, just like a call to __next__
or send
does - but instead of executing the code following yield
, an exception is thrown. The function may catch the exception, and perform some cleanup (eg closing an “upstream data source” that the generator reads from). See the official docs for generators for more details.
PEP 342 gives a good description of the reasoning that drove creation of the send
and x = yield
syntaxes (to extend conventional generators to be useable as “coroutines”):
Coroutines are a natural way of expressing many algorithms, such as simulations, games, asynchronous I/O, and other forms of event-driven programming or co-operative multitasking. Python’s generator functions are almost coroutines – but not quite – in that they allow pausing execution to produce a value, but do not provide for values or exceptions to be passed in when execution resumes.
Early versions of module asyncio
then provided “asynchronous programming” APIs hased on these generator-based coroutines.
As noted earlier, my current understanding is that this approach is an evolutionary dead-end, and that keywords async/await, together with function async.run()
which is provided in the latest version of module asyncio
, should be used for all use-cases that this approach was intended to cover. See the following documents for details:
- PEP 492 - Coroutines with async/await - the currently recommended way to use coroutines
- PEP 342 - Coroutines via enhanced generators - the now-deprecated specification for generator-based coroutines
Async Coroutines
Prefixing a function declaration with async
causes a “coroutine constructor function” to be returned; the standard def
behaviour is executed, ie a function-object is created and then another function-object is created which wraps the original. This behaviour is similar to how a “generator constructor function” is created to wrap a function-object. However to support generators, Python needs to inspect the bytecode of every function to see whether yield
is present; the presence of async
marks coroutines clearly (and triggers the appropriate wrapping logic) without need for such “bytecode scanning”.
Invoking the coroutine constructor function creates and returns a coroutine object - without invoking the underlying function. This is similar to invoking a generator constructor function and getting back a generator object.
To actually invoke the underlying (wrapped) function of a coroutine object, the application needs to start an “event loop” via asyncio.run(initial_coroutine)
(or run some third-party framework that provides a similar event-loop). That initial coroutine can then invoke other coroutines by:
- using
await some_coroutine_obj
- passing the coroutine object to
asyncio.create_task(..)
- using
async for
Async-based coroutines (ie async/await keywords) were added to Python in version 3.5. The event-loop support in module async
was added in version 3.7.
See below for brief coverage of asynchronous programming with coroutines and this separate article on asynchronous programming with coroutines in Python for more details.
Concurrent Programming
There are many operations that a program can do which will “block” waiting for some operating system operation to complete; file or network IO are typical examples. However a program may well have other things it could do while waiting for that blocking operation to complete. There are three different approaches to doing this:
- os-level threads
- multiprocessing
- user-level “asynchronous programming”
Python’s support for these alternatives are discussed below.
Threading
Threading allows code to be written in a linear style. Calls to operating-system level functions that may block can just be written in the obvious way; the code will indeed block (be suspended) until the operation completes.
Obviously, while a thread is blocked, it is making no progress. To make an application efficiently use the CPU or CPUs, while still allowing code to perform blocking operations, the application must start multiple threads. Starting threads is easy; the hard part is safely exchanging data between threads.
Threads allow a correctly-designed program to take advantage of all the CPUs on a host system. It also provides “preemption”, ie makes sure that each (runnable) thread gets a fair share of the CPU.
However threads also have some problems;
- each thread has significant overhead related to memory usage (stack)
- each thread has overhead related to cpu-usage (context-switching)
- setting up a thread has significant overhead, ie it is not advisable to start and stop threads rapidly
- safely exchanging data between code in different threads requires synchronization (mutexes, etc).
Sometimes other approaches end up being more efficient (particularly when only one physical CPU exists) - eg asyncio.
The Python standard library provides functions for creating threads, and provides various “synchronization primitives” for exchanging data between threads (a threadsafe queue, mutexes, etc).
However there is a problem with threads that is specific to CPython - the Python interpreter itself is not threadsafe. Any time a thread tries to perform an operation such as looking up a variable, it must first become owner of the global interpreter lock (GIL). In practice, this means that when multiple threads try to run Python code at the same time, all but one of them will be blocked waiting for the GIL. The result is that the application uses at most one CPU at a time. The threaded program is still faster than a non-threaded one, because while one thread is waiting for a blocking operation to complete, another thread can be using the CPU. But really scalable, it is not.
When a Python thread calls into a native-code library, and that library is doing CPU-intensive work that does not involve calling back into the Python interpreter, then threading can be efficient. An example is the Numpy numeric library; multiple threads that call from Python into the numpy native code can make good use of multiple CPUs.
Multiprocessing
Given the problems with threads, a good option for scalable processing of data is sometimes to just start multiple operating-system-level processes which communicate with each other over sockets or similar. The startup overhead is of course high, but as each process has its own complete Python interpreter, the GIL is no longer a limiting factor.
The Python standard library includes some features that makes building multiprocessing applications a little easier; see:
- module
multiprocessing
- class
concurrent.futures.ProcessPoolExecutor
– provides an interesting API for distributing workload across a pool of processes rather than threads.
PEP 554 (currently in draft) is proposing an interesting variant on multiprocessing - starting multiple independent Python interpreter instances within the same operating-system process. Each such interpreter has its own GIL, and so there is no competition for the lock. Of course, Python code in one interpreter cannot see variables in another interpreter, but the proposal also includes a mechanism for inter-interpreter communication.
Asynchronous Programming with Coroutines
The goal of asynchronous programming is not to run Python code in multiple threads, but instead to make the best possible use of a single thread. Because multiple os-level threads are not involved, the CPython GIL is not a limiting factor when using async programming.
Async programming is quite useful for a number of common problems, including:
- implementing an http server which needs to handle multiple concurrent http requests
- implementing a message-broker server which needs to handle multiple concurrent client requests
- handling incoming requests which trigger calls to a remote database
As noted previously, async code is still limited to 1 CPU - it just makes better use of that CPU than blocking-style programming. However when a system has N cpus, then spawning N instances of the Python app and using async programming within each instance is an effective way to get good performance.
Asynchronous programming requires code be broken up into chunks of non-blocking code joined together with blocking operations. A “scheduler” then runs each chunk; when the blocking operation is reached, that is delegated to a pool of background threads and the “scheduler” thread switches to running a different chunk which is not waiting for any blocking operations. When a blocking operation (managed by a background thread) completes, any chunks that are waiting for that operation to complete are then marked as “runnable” and will (eventually) be executed by the scheduler thread.
Support for asynchronous programming in Python has gone through four major revisions:
- using callback functions (not coroutines)
- using coroutines based on generators with
generator.send
- improved coroutines based on generators with
yield from
- new implementation of coroutines based on new keywords async/await
Module async
has provided support for all these approaches, as have various third-party libraries.
Hopefully the most recent approach based on keyword async
will be a long-term solution; the chances are good as the async
approach has been copied from other languages where it has been successful (eg in node.js). Keywords async/await were added in Python 3.5; function async.run()
was added in Python 3.7.
See this separate article on asynchronous programming with coroutines in Python for more details.
Garbage Collection
CPython reclaims no-longer-used memory using reference-counting combined with cycle-detection. In practice this means that objects are destroyed immediately after their last reference goes out of scope - unlike systems with mark-and-sweep garbage collection, where destruction time is indeterminate. Nevertheless, relying on prompt destruction of objects is a bad idea, as it is not guaranteed in the Python language - and other Python implementations (eg Jython) do use mark-and-sweep.
Unit Testing
The Python standard library includes a unit-testing framework similar to JUnit/NUnit. The usual convention is for a directory containing Python modules (ie a package) to have a subdirectory named tests
containing unit-test files with names of form test_*.py
, ie directory structure looks like:
mypackage/
__init__.py
__main__.py
mymodule1.py
mymodule2.py
tests/
test_mymodule1.py
etc
The standard library also includes an interesting framework doctest
which looks for unit-tests in doc-strings; there is a standard format which makes such code look like examples for documentation purposes while also being executable/validatable. This feature is not really meant for testing, but rather for ensuring that example code does not get out-of-date.
There are a few third-party libraries for unit-testing which are popular with the Python community.
Monkeypatching
This term refers to dynamically adding/removing/replacing methods on classes at runtime.
Methods are normally declared once when the class is created:
class Foo:
def method1(self, *args): ...
However due to Python’s dynamic nature, it is possible to add extra methods:
Foo.method2 = lamba self, val: ...
or to replace methods with alternative implementations:
Foo.method1 = lambda self, *args: ...
Monkeypatching can be useful for testing.
It can also sometimes be useful to force existing libraries to do things they weren’t designed to do. This can be good (flexible) and horrible (very difficult to debug and understand code) at the same time.
Virtual Environments (venv)
When running (or developing) multiple different Python applications on the same machine, it is very difficult to have just a single global pool of all third-party Python libraries. Different versions of libraries have different versions and bugs, and can be incompatible with each other. Linux distributions (eg RedHat, Debian) do manage to define a base set of Python libraries, and applications that use those libraries, but it requires significant effort. As soon as you wish to run an arbitrary Python app that is not part of the standard packages for that distribution, conflicts can occur.
The solution is venv
: a command that sets up an isolated pool of libraries. A command-shell simply needs to specify which pool is “currently active” and all Python code run in that environment then sees only libraries in that pool.
Module venv
is implemented in the Python standard libraries (since v3.4). There were previously other similar commands, but I won’t talk about historical solutions here.
Python has two primary locations it looks in for modules: the standard library dir, and the site-packages dir. Anything installed via setuptools, pip, rpm/dpkg, etc. goes into a site-packages directory.
When command “{python} -m venv {dirname}” is executed, the venv module creates the target directory, and fills it with a small number of files:
- bin/
- activate – a shellscript that “activates this virtual environment”
- python3 – a symlink to the Python interpreter that was run
- pip – a simple script that runs pip
- and a few other commands
- lib/
- {pythonversion}/
- site-packages
- {pythonversion}/
- pyvenv.cfg
It then automatically runs “python -m ensurepip” which results in pip and setuptools being installed into site-packages; see later.
Running “source {venvdir}/bin/activate
” modifies environment-variables in the current shell; primarily it inserts directory {venvdir}/bin
into the front of environment variable $PATH
so that running python3
then uses the symlink from that virtual environment.
That’s about all that needs to occur to create (and enable) a virtual environment.
Standard library file site.py
(eg /usr/lib/python3.6/site.py
on my Ubuntu system) has the necessary logic to support using the “virtual environments” that venv
creates. When Python is started, site.py
is automatically loaded (see comments in the file). At this point, Python has already defined sys.path
(the module search path) to include the standard libraries and site.py
then prepends additional locations to it. The algorithm is (simplified):
- determine the path through which Python was started
- look in the parent directory for a file named “pyvenv.cfg”
- if found, prepend subdir
lib/{python-version}/site-packages
tosys.path
- if found, prepend subdir
When no venv is active, then command python3
is a symlink in some global location (eg /usr/bin/python3
) and the corresponding (global) site-packages directory for that Python installation is used.
When a venv is active, then python3
is a symlink in some venv directory-structure (see earlier), and the corresponding site-packages dir is used instead.
When packages are installed via pip, it uses the same method to determine where to write its packages - ie when a venv is active then “pip install ..” puts files in the site-packages directory for the active environment. When no virtual environment is active, files are written into the site-packages directory associated with the “global” pip install (which might require sudo to write into..).
When packages are installed via tools like apt install python3-{modulename}
then they simply go into the global site-packages directory.
Packaging and Distributing Python Code
As users of code, we want to download libraries or applications written by others. As developers of code, we want to be able to bundle our code for installation in production environments, or for others to use.
This topic is too complex to include in this (already large) article; I have therefore summarized what I have learned about this topic in a separate article.
IDEs
I have been using Jetbrains Intellij IDEA (licenced) with the Python plugin as an IDE, and it works great. Of course, it is a bit ..odd.. to install a Java-based IDE to develop Python if you are not using IDEA for other reasons.
The Python Guide page on IDEs is a good source of info on alternatives.
Installing Python Manually
The easiest way to install Python is of course to use your Linux distribution’s packages, eg “apt install python3
”. However sometimes:
- you are not admin on the system, or
- the version of python provided by the distribution is not what you want (too old or too new)
The options I am aware of are:
Note that Ubuntu 18.04 LTS comes with python3.6
by default (and python3
is an alias for python3.6
). However python3.7
is available if desired; sudo apt install python3.7
will make command python3.7
available - although python3
remains an alias for python3.6
as other Python software on that distribution version expects the slightly older version.
To install from source, you will need a c-compiler available, and some time.
The conda
approach instead downloads prebuilt binaries from the conda
project. Note that conda
is more than just a provider of binary builds of Python - it also:
- hosts its own builds of most of the Pypi package archive
- and includes its own package-manager
conda
(ie alternative to pip)
However after installing Python using conda
, you can use pip to install packages instead of the conda
package manager.
To quote the conda
docs:
If you have used pip and virtualenv in the past, you can use conda to perform all of the same operations. Pip is a package manager, and virtualenv is an environment manager. Conda is both.
Both the from-source and from-conda installation approaches just write into your local user directory, and are reasonably easy to remove if desired. The conda
solution optionally modifies your user’s path, thus redirecting command python3
to the conda-managed installation rather than the system default installation - but as alternative, just choose not to modify bashrc
during install, then source ~/miniconda3/bin/activate
when you want to use the conda
Python install.
Other
The Python type int
is not limited to 64 bits. Instead, the wrapper instance dynamically changes the representation of the data depending on its size - 64 bits if the value fits, otherwise a more complex structure. This change in representation happens automatically.
Python has two distinct implementations: float
(64-bit native representation) and the library type decimal.Decimal
. The type float
does NOT work like the integer type (no automatic expansion of the number of bits available).
The easiest way to copy a list is with list(src)
where src is any iterable (including a list). Similarly, dict(src)
. These are shallow-copies; for deep copies use standard library module copy
.
Keyword del
simply invokes __delete__
on some object. When invoked without an obvious object, it is invoked on the current namespace dict, ie the local scope of names (thus removing a reference to the object).
A list-comprehension is just a short-hand for a (filter, map) pair of operations. Similiarly for dict-comprehensions.
Flow-control structures for
, while
, and try
all support an optional else:
clause. They all work the same if somewhat counter-intuitively: if the “main clause” runs to completion, then the else-clause is executed. If the main clause terminates in an unusual way (eg via the break
or return
keyword, or an exception) then the else is not executed. This is actually quite useful.
The CPython interpreter can run in various modes:
- “standard”
-
-O
- discards asserts from the code -
-OO
- discards asserts and docstrings from the code
Python has inbuilt support for the “assert” keyword.
Function id(obj)
returns a unique identifier for each object (based on its address in memory).
Two adjacent strings are automatically concatenated. This is useful when a long string literal needs to be split across multiple lines, and the triple-quote format is for some reason not desired.
Python does not support “function overloading”, ie treating two methods with the same name but different signatures as different methods. Python’s positional-args, keyword-args, and default values for params allows a lot of flexibility that in most cases makes “function overloading” unnecessary. Decorator @functools.singledispatch
does provide a way to do basic function-overloading if you really want it.
It is possible to put multiple Python statements on the same line by separating them with semicolons, eg: msg="hello"; print(msg, "world")
. All statements are considered to be “in the same suite”, ie effectively at the same indentation.
Module array
provides methods for storing arrays of objects in a more memory-dense format than a Python list.
Python has no switch/case statement - use a chain of if .. elif .. elif .. else
instead.
All functions return a value; “falling off the end of a function” effectively returns None. There is no explicit check that every path in a function returns the same datatype.
Python does not support Tail Call Optimisation.
References
- The Python Language Reference – the official language specification
- Real Python: modules and packages
- Python Packaging
- StackOverflow: What Good are Python Function Annotations?
- PEP 8: Style Guide for Python - also contains some good “best practice” hints
- Stackoverflow: what are metaclasses in Python?
- Fluent Python, Ramalho, O’Reilly, 2015 - good coverage of moderately advanced topics such as
abc
, coroutines (yield-based) and asyncio (all discussed above)
Footnotes
-
Python tends to use exceptions for “flow control” as well as error-signalling, something that other languages warn against. As an example, iterator implementation throw an exception to terminate iteration. ↩
-
For Python versions earlier than 3.4, the
__init__.py
file is mandatory for packages; from v3.4 onwards it is optional. ↩ -
Instances of types like
int
andfloat
do not have a__dict__
, instead embedding the value directly into their in-memory struct. Type likestr
also does not have a__dict__
, although the actual string data is not in-line. ↩ -
Although Python promotes “duck typing” over “static typing”, the fact that the standard libraries are full of isinstance checks perhaps shows that static types are not so useless after all.. ↩