Categories: Java, Programming
Here you can find an algorithm for taking a sequence of tokens of form “term OP term OP term OP …” and evaluating it in memory, where some operators (OP) bind more tightly (have higher precedence) than others. While the solution isn’t terribly complicated, I didn’t find it trivial to build an efficient solution either, so I’m documenting it here in case I run up against this myself in a few years time - and maybe you’ll find it useful or interesting too.
The actual problem I needed to deal with was evaluating an SQL where-clause in memory, but the solution is fairly generic. In my case, the “term” objects were either simple tests of form “id = 1” or “name like foo%” or “length is not null”, or a nested term eg “(id=1 or id=2)”. The OP items were simply AND and OR.
In SQL, AND binds more tightly than OR, so “a or b and c” must be evaluated as “a or (b and c)”. As an additional twist, boolean operators have “short circuit” behaviour; “true or X” is always true and “false and X” is always false without having to evaluate X.
I suspect this approach also works for evaluating things like arithmetic expressions such as “1 + 4 / 2 - (5 + 3)”.
The solution was developed in Java, but the same approach could be used in many languages.
Note that if the input datastructure is a tree whose structure already models the precedence of the operators (ie a standard syntax tree), then this kind of processing isn’t needed; walking the tree is sufficient. And if the input is a list in postfix (reverse polish) format, then evaluation is also trivial. The code described here is for the case where the input is a simple ‘infix’ format that does not directly reflect the precedence of the operators it contains.
A general description
A stack keeps track of the terms found so far. A separate stack keeps track of “unreduced” operators encountered.
When a new operator is processed, the top operator on the stack is evaluated for as long as its precedence is higher than or equal to the new operator. The two top terms on the term-stack are popped and used as the LHS and RHS of the operator popped from the stack, and the result is pushed back onto the stack. The new operator is then also pushed onto the stack.
Encountering end-of-input is equivalent to reading an operator with the lowest possible priority - which ensures that all operators remaining on the stack are evaluated, and the “term stack” ends up being reduced to holding a single item - which is the final output.
With this approach, the maximum depth of the operator stack is equal to the number of different operator precedence levels - just two in the case where OP is limited to AND and OR. The term stack max depth is limited to twice this value (in the case where all operators are binary, eg AND and OR).
Performing Short-Circuit evaluation
In my particular use-case, the “term” parts were already represented as nodes in a linked list. Therefore a reference to the node could simply be pushed onto the term stack - without actually evaluating the term at that time. When an AND operator is applied, it is necessary to actually evaluate the LHS term. However if the result of the LHS term is FALSE then a trivial term-node representing the literal FALSE value pushed back onto the term stack as the “result” - without bothering to evaluate the RHS term at all. Similarly, when the operator is OR and the LHS evaluates to TRUE then a literal TRUE can be pushed onto the term stack without needing to evaluate the RHS.
Avoiding evaluating the RHS term of an operator can be a significant saving; it may be a “compound term” such as (id=1 or name like a% or …), and the whole thing can simply be skipped over.
The Code
You can find a complete class implementing this here: