Next: Probabilistic de Bruijn Up: The calculus of Previous: Factors for Rule

A geometrical representation

Regular expressions represent sequences of symbols---letters taken from some alphabet. When the sequence is relatively short, the sequence is readily perceived and its properties may be examined. The notation is especially nice for summarizing cycles which can be repeated an arbitrary number of times, since a single star signifies this infinite collection of alternatives. When the regular expression is complicated, and particularly when it subsumes several distinct cycles at the same time, it rapidly becomes more complicated, both to write down completely and to comprehend once it has been written.

Although difficult of typography, the diagram representing a regular expression is preferable for expressions of moderate complexity, although eventually the many crossing lines and increased density of nodes take their toll, and the figure can no longer be comprehended.

Another representation has some advantages, although it is typographically even more difficult to deal with. To begin with, suppose that the symbols are chosen to be consecutive numbers, just as we are accustomed to assuming when dealing with automata. If a decimal point were to be placed in front of a finite sequence, it could be interpreted as a number with a positional representation in base k if there were k symbols running from 0 to Finite sequences would be finite decimals, infinite sequences could be interpreted after the fashion of real numbers. The principal difference is that in base 10 and are ordinarily considered to be the same real number; but have to be understood to be distinct regular expressions--- in the first case and in the second.

The advantage of this representation is that every regular expression, no matter how complicated, is a point in the interval and so is represented in a finite region of space. The counterbalancing penalty is that this region can be filled more densely than is convenient, often too densely for a clear presentation, since the individual points are always fairly bulky and can never be presented visually with the mathematical precision of occupying zero area.

We are interested in regular expressions describing configurations of cellular automata, but these are sequences infinite in both directions. To capture both directions in such a string, break it at some point, then write the left hand side backwards. Inserting a figurative decimal point as before now produces two intervals of unit length; the natural thing is to treat them as two coordinates generating a unit square. The vertical, or y-coordinate represents the left half of the sequence, the horizontal or x-coordinate represents the right half.

Shifting, or changing the decision as to where to break the sequence, will naturally change the point which represents the sequence; in any event it is a geometric operation whose general nature can be described. Indeed this is a fundamental operation on sequences: a right shift splits the unit square into k horizontal strips and reassembles them into k vertical strips after stretching them in one direction and compressing them in the other.

The numerical value associated with each sequence defines a distance, and consequently the distance between two sequences is just the sequence of differences. At this point the difference between this representation and that of the real numbers manifests itself; although different terms are included in the sum in the normal way, there are no carries from one term to another and so there is no confusion that a number may have two different representations as a repeating decimal.

In mathematical terms, suppose that there are two sequences

Their distance is defined as

The distance is small, or the sequences are close, when their central regions agree; the closer they are, the longer the matching core must be.

To have an idea of how this representation works, let us consider Rule 18 for a moment. We have seen that the sequence 111, amongst others, is excluded from the second generation. If we divide the interval into eighths, any sequence beginning with .111 will occupy some position in the last eighth of this interval. This interval will include sequences of the form , but sequences of the form will occupy the last eighth of the interval . Recursively, the last eighth of any binary subdivision of the unit interval is excluded. Similar comments apply to sequences extending to the left, so that as the unit square represents various doubly infinite sequences the excluded sequences will come to resemble a Scotch plaid.

Additionally it is necessary to consider sequences whose central point lies amongst the three excluded 1's. For example 1.11 will refer to points within the upper half of the square, but in the rightmost quarter; 11.1 to the upper quarter but right half. These regions have to be added to the plaid, as well as the further subdivisions implied by further decimals. Thus the same basic pattern is repeated recursively in each quadrant, 16-ant, and so on of the unit square.

Sequences of three 1's are not the only configuration excluded from the second generation, since it is also impossible to encounter a single 1 in a sequence bounded by pairs of 1's. Taking into account all of the excluded sequences simply gives an even finer structure to the plaid. In general the plaid is a two dimensional version of the classical Cantor set, which is the residue remaining after the middle thirds of the unit interval and surviving subintervals have been removed. In arithmetic terms, the Cantor set is the set of all decimals in the unit interval which contain no 1's when they have been displayed with respect to the base 3; which is one of the simplest possible examples of an excluded sequence.

The plaid can be used as the starting point for further discussions. For example the third generation must occupy a subset of the permitted portion of the second generation plaid, and so on as the automaton evolves through further generations; it is even meaningful to think of a limiting set which consists of just those points, if any, which remain after an infinite number of generations. The limiting set will include all the points representing cycles of whatever period, as well as gliders, fuses, and a variety of structures which may have no finite or cyclical representation.

Metrizing regular expressions, and in general, has the advantage of creating a topological space, to which the whole machinery of analysis may be applied. As a compact topological space, even stronger conclusions, such as the necessary existence of certain limits, can often be drawn. An extensive literature concerns the application of these ideas to probability and symbolical dynamical systems; but there is no reason that it cannot be broadened to include regular expressions as well. In particular, the interaction between probability measures and linear cellular automata can be studied through this model, to the mutual enlightenment of both disciplines.

Next: Probabilistic de Bruijn Up: The calculus of Previous: Factors for Rule

Harold V. McIntosh
E-mail:mcintosh@servidor.unam.mx