Java Serialization and Synthetic Methods

Categories: Java

(back to main post)

Introduction

In this article I describe why manually setting serialVersionUID is generally a bad idea, and instead the JVM should be allowed to compute the value itself.

In my opinion, this is still a sound approach. However a few projects may need the same sourcecode built with different compilers to be serialization-compatible; in this case a small subset of classes may need explicitly-defined serialVersionUID values. The rest of this article explains why, and how to identify these classes.

The Problem

The javadoc for class java.io.Serializable states the following:

.. the default serialVersionUID computation is highly sensitive to class details that may vary depending on compiler implementations, and can thus result in unexpected InvalidClassExceptions during deserialization. Therefore, to guarantee a consistent serialVersionUID value across different java compiler implementations, a serializable class must declare an explicit serialVersionUID value.

Note that the problem is not related to different JVMs at runtime; this kind of incompatibility is instead triggered by different compilers at compiletime, together with an odd decision by the designers of the Java serialization specification.

Normally, this is not a problem : it is not common for the same source code to be compiled by different compilers, and objects instantiated from the resulting binaries then exchanged via java serialization. It is far more common for a software producer (commercial or opensource) to generate a binary jarfile which both client and server applications then use. However there are a few cases where this can happen; examples are:

  • when a developer compiles a server application with one tool (eg javac via maven) and compiles a client application with another tool (eg the Eclipse IDE).

  • when somebody applies a bugfix to a software version, compiles a jarfile from it, and then attempts to communicate with a non-patched (ie original) version. If the person who generated the patched jar has used a different compiler, then problems can occur

  • when the binary release of the latest version of a library is compiled with a newer compiler (or possibly a totally different compiler) than the previous version

What actually happens

The Java Serialization specification tightly defines the algorithm used to compute a serialVersionUID. And, strangely, the algorithm includes the names of all nonprivate methods on the serializable class.

The specification doesn’t explicitly state whether “synthetic” methods (ie methods added automatically by a compiler) should be included, but in practice the Oracle (formerly Sun) jvm has always done this.

The result of these decisions is that if a compiler decides (quite reasonably) to automatically add some “synthetic” methods to the class for its own personal reasons, then the names of these methods affect the serialVersionUID output. And sadly, different compilers generate different methods (or just choose different names). As the javadoc quote above states, even different versions of compilers from the same organisation can theoretically output different synthetic methods.

When a compiler does generate a method automatically (ie adds a method which was never in the source-code), it sets a special flag in the generated bytecode to mark that method as “synthetic”. This can be seen at runtime via java.lang.reflect.Method.isSynthetic(). It also usually (always?) uses a ‘$’ character as part of the method name (this character is reserved in the Java specification for exactly this sort of use); this shows up clearly in the output of the javap command.

Why would synthetic methods be needed?

According to the Java language specification, a class is permitted to declare a nested inner class, and then access private properties or methods of that inner class from within methods of the outer (containing) class. However the JVM doesn’t support this! To the JVM, an inner class is simply a normal classes with a name like Outer$Inner. And no class can ever access private methods or properties of another class. To work around this mismatch between language and runtime, a java compiler must add package-scoped “helper” methods to the inner class and rewrite the code that accesses private properties or methods to instead call the helper methods.

A compiler may also choose to add methods for performance optimisation reasons. When a method contains a long switch-statement, a compiler may choose to generate a static “lookup table” which is then indexed by the variable being switched on. This table must then somehow be made accessable to the method in which the switch statement exists; one solution is to add a synthetic method which returns the table.

A few other source-code structures may cause a compiler to emit new methods into the generated bytecode; in a large application I have found that about 1% of serializable classes are affected.

UPDATE: Java 11 adds the concept of “nests” (JEP 181) that makes it possible for an inner class to access private properties and methods of its enclosing class without needing to go via a synthetic helper method added by the compiler. Nests are a bytecode-level feature, not a sourcecode-level feature - ie a compiler can mark outer/inner classes as belonging to the same “nest”.

Why does the serialization algorithm include synthetic method names when computing the serialVersionUID?

If anyone has a good answer to this question, I’d very much like to hear it.

Changing the java specification so that method-names are not included at all would seem like a major improvement to me. At the very least, excluding methods marked as “synthetic” would be sensible and do absolutely no harm at all. However it is unlikely to ever happen, as changing the algorithm used by a JVM version will break serialization compatibility with any other JVM that still uses the old algorithm. Possibly the ObjectOutputStream could take a ‘serialVersionUID algorithm version’ parameter, so that at least applications that know the receiver is running on a new enough JVM could opt in to the newer approach. This is not likely to get to the head of the JVM team’s feature list in the near future though..

Workarounds

The approach I’ve used is to create custom ObjectOutputStream/ObjectInputStream classes that override ObjectOutputStream.annotateClass and ObjectInputStream.resoveClass to throw an exception if the class being written or read:

  • has one or more synthetic methods, and
  • has no explicit serialVersionUID

Then running the integration-test suite points out any problems; such classes (sadly) are updated with manually-assigned serialVersionUID values.

Note that it isn’t necessary to test libraries compiled with different compilers against each other; the problem can be detected simply by looking for synthetic methods on any serializable class. Possibly this could even be detected statically (ie by direct analysis of the generated classfiles) but it requires finding every class which has a Serializable type in its ancestry.

Summary

The original implementers of Java’s Serialization have unfortunately made life rather difficult for developers. They made an incorrect design decision (to include all method names in the serialVersionUID calculation) and then state the solution is to manually maintain serialVersionUID values on classes - a task that is impossible for any real-world project. Yay.

However it is possible to detect the 1% of problem classes and add explicit serialVersionUID values on only those classes, with the remaining classes using default serialVersionUIDs (the sane approach for most projects).

See section 4.6 of the Java Object Serialization Specification : all methods except private methods are included as part of the serialVersionUID calculation.

References