Friday, December 20, 2013

Abstractive power

each extensible language is surrounded by an envelope of possible extensions reachable by modest amounts of labor by unsophisticated users.
Thomas A. Standish, "Extensibility in Programming Language Design", SIGPLAN Notices 10 no. 7 (July 1975) [Special Issue on Programming Language Design], p. 20.

I said in an earlier post I should blog about abstractive power "eventually".  This is it.

This material is very much a work in progress.  Throughout this post I'll emphasize insight and intuition — but while the post starts out non-technical, it will get more mathematical by increments as it goes along.  The relation between the math and the insights works both ways:  insights into abstraction guide the mathematical development, and the mathematical development is pursued partly in hopes of eventual further insights into abstraction.  I also hope, by presenting the mathematical development in an intuitive form (as opposed to the much drier form in my 2008 techreport) to get insights into the mathematical development.  The post ends, as does the current state of the work, with an elementary test case to show feasibility.

Contents
The idea
The goal
There is no semantics
The second derivative of semantics
Recasting expressiveness
Abstractiveness
Test case
The idea

The extensible languages movement peaked around 1970, and was on its way out when Standish wrote the above.  Extensibility enthusiasts had hoped, frankly, that by means of language-extension mechanisms it would become possible for everyone to use a single base language and transform it into anything anyone needed for any particular purpose.  Standish was noting that the extension mechanisms primarily used by the movement — macro preprocessors and perhaps the ability to add new syntax rules — had a limited range, after which it became quite difficult to extend the language further.

Macro preprocessing, in particular, cannot easily be used to build a series of extensions, one on top of another, because as extension follows extension, the programmer is rapidly overcome by accumulating complexity.  In order to use a macro, you have to be able to see whatever underlying facility the macro uses.  Thus, to add a second layer of macros to a base language, you have to understand and account for the base language and all of its first layer of macros; to add a third layer of macros, you have to understand the base language, the first layer of macros, and the second layer of macros; and so on.  The visibility of the underlying layers also limits how different the extended language can be from the base language.

The extensibility movement was supplanted by the abstraction movement, which had a more semantic focus, and came to be dominated — at least for a while — by the Object-Oriented Paradigm.  Something of the spirit of the new movement is visible in this remark on lexical scoping from Steele and Sussman's 1978 The Art of the Interpreter; or, The Modularity Complex (p. 24):

What is interesting about this is that we can write procedures which construct other procedures.  This is not to be confused with the ability to construct S-expression representations of procedures; that ability is shared by all of the interpreters we have examined.  The ability to construct procedures was not available in the dynamically scoped interpreter.  In solving the violation of referential transparency we seem to have stumbled across a source of additional abstractive power.

Abstraction helps with the problem of accumulating complexity, because you can — at least, ideally — use an extension without having to worry about all the details of what underlies it.  There is still some accumulated complexity, though.  I noted this in my earlier blog post on types:

In mathematics, there may be several different views of things any one of which could be used as a foundation from which to build the others.  That's essentially perfect abstraction, in that from any one of these levels, you not only get to ignore what's under the hood, but you can't even tell whether there is anything under the hood.  Going from one level to the next leaves no residue of unhidden details: you could build B from A, C from B, and A from C, and you've really gotten back to A, not some flawed approximation of it that's either more complicated than the original, more brittle than the original, or both.
The central point of that blog post is that typing, which is evidently meant to help us manage complexity, can easily become itself a source of complexity.  The same danger applies to other tools we use to manage complexity; the tools become effectively part of the language, and are thus added complexity and subject to accumulation of further complexity.

Another four decades of experience (since Standish's post-mortem on the extensible languages movement) suggests that all programming languages have their own envelopes of reachable extensions.  However, some languages have much bigger, or smaller, envelopes than others.  What factors determine the size, and shape, of the envelope?  What, in particular, can a programming language designer do to maximize this envelope, and what are the implications of doing so?

As a useful metaphor, I call the breadth of a language's envelope its radius of abstraction.  Why "abstraction"?  Well, consider how the languages in this envelope are reached.  Starting from the base language, you incrementally modify the language by using facilities provided within the language.  That is, the new (extended) language is drawn out from the old (base) language, in which the new language had been latently present.  (Latin abs-, "out", and trahere, "pull/draw")  The terminology of layers of abstraction, in programming, goes back to the 1970s.  One also finds this use of abstraction in philosophy, for drawing out something latently present; here's a passage from Locke's 1689 An Essay Concerning Human Understanding (you may recognize this, as it's quoted at the front of the Wizard Book):

The acts of the mind, wherein it exerts its power over its simple ideas, are chiefly these three :  (1) Combining several simple ideas into one compound one ; and thus all complex ideas are made.  (2) The second is bringing two ideas, whether simple or complex, together, and setting them by one another, so as to take a view of them at once, without uniting them into one ; by which way it gets all its ideas of relations.  (3) The third is separating them from all other ideas that accompany them in their real existence : this is called abstraction : and thus all its general ideas are made.
Our programming-language use of the term abstraction does take a bit of getting used to, because we usually expect something abstracted to be smaller than what it was abstracted from.  The abstract of a paper is a short summary of it.  An abstract thought has left behind the details of concrete instances — though it seems the abstract thought may be somehow "bigger" than the more concrete thoughts from which it was drawn.  In our case, the extended language is probably equi-powerful with the base language, and therefore, even if some specific implementation details are hidden during extension, the two languages still feel as if they're the same size.  This is not really strange; recall Cantor's definition of an infinite set — a set whose elements can be put in one-to-one correspondence with those of a proper subset of itself.  Since we rarely work with finite languages, it shouldn't surprise us if we abstract from one language another language just as "big".

Ironically, though, the reason we're discussing this at all is that, despite our best efforts, the extended language is smaller than the base in the sense that its "envelope" of reachable extensions is smaller.  We'd really rather it weren't smaller.

What general principles govern radius of abstraction?  My first candidate is smoothness, a term I borrowed from M. D. McIlroy, one of the founders of the extensible languages movement.  I mean by it the property of a language that its abstractive facilities apply to the language in a free and uniform way.  This concept is also close kin to Strachey's first-class objects, and van Wijngaarden's orthogonality.  I proposed the following principle in my dissertation:

(Smoothness Conjecture)  Every roughness (violation of smoothness) in a language design ultimately bounds its radius of abstraction.
When a base language contains a defect of smoothness, I suggest, successive extensions magnify the defect, creating unbounded complexity that drags down the programmer.

The goal

The Smoothness Conjecture is a neat expression of a design priority shared by a number of programming language designers; but, looking at programming language designs over the decades, clearly many designers either don't share the priority, or don't agree on how to pursue it.

What, though, if we could develop a mathematical framework for studying the abstractive power of programming languages — a theory of abstraction.  One might then have an objective basis for discussing design principles such as the Smoothness Conjecture, that to date have always been largely a matter of taste.  I wouldn't expect the Smoothness Conjecture itself to be subject to formalization, let alone proof, at least not until the study of the subject reached quite a mature phase; but the Conjecture may inspire any number of more specific claims that could then be weighed objectively.

This, in my humble opinion, would be very cool and, as an added bonus, immensely useful.

It is not, however, a short-term goal.  For a ballpark estimate, say people have been pursuing abstractive power (under whatever name) since the founding of the extensible languages movement, circa 1960.  When I got into the game, it had already been going on for about three decades.  The more extreme OOP advocates were making claims for it that could have been lifted nearly verbatim from the more extreme extensibility advocates of two decades earlier, and by my assessment then (understandably unpopular with the OOP advocates) there was still more we didn't know than that we did.  I didn't expect to tie it all up quickly; but I'm still excited about the prospects, because every few years my thinking on it has moved (forward, I hope) slightly — and by my estimate of the difficulty, any progress at all is a very encouraging sign.

There is no semantics

Shortly after I started thinking on abstractive power, Matthias Felleisen's classic paper "On the Expressive Power of Programming Languages" was published (in the proceedings of ESOP '90), and I was encouraged by this evidence that I wasn't the only person in the world crazy enough to try to mathematize traditionally informal aspects of language design.  Felleisen's treatment has been quite a successful meme in the years since, and has some features of interest for abstraction theory — both features that apply to abstractive power, and features that offer insight because of why they don't apply to abstractive power.

Felleisen's expressiveness works roughly thus:  A programming language is a set of programs together with a partial mapping from programs to semantic values.  Language A can express language B if there is a simple way to rewrite B-programs as A-programs that preserves the overall pattern of semantics of programs — not only the semantic values of valid programs, but which programs are valid, i.e., halt and which are not valid/don't halt.  (In this case, the overall pattern of behavior is sufficiently captured by the pattern of halting/not-halting, so one might as well say there is just a single semantic value, or technically replace the "partial mapping from programs to semantic values" with a "subset of programs designated as halting".)

That is, A can express B when there exists a decidable function φ mapping each B-program p to an A-program φ(p) such that φ(p) halts iff p halts.  A can weakly express B when φ(p) halts if p halts (but not necessarily only if p halts).

How readily A can express B depends on how disruptively φ is allowed to rearrange the internal structure of p.  Felleisen's paper particularly focuses on the class of "macro", a.k.a. "polynomial", transformations φ, which correspond to Landin's notion of syntactic sugar.  Each syntactic operator σ in language B is replaced by a polynomial (a "macro", or "template") σφ in language A; thus,

φ(σ(e1, ... en))  =  σφ(φ(e1), ... φ(en))
When φ is of this form, one says A can macro-express B.

I described the criterion for A can express B as preserving the "overall pattern of semantics of programs".  I meant to suggest that this is more than each individual mapping p ↦ φ(p) preserving semantics; it involves preserving, across mapping φ, how the shape of program texts affects their semantics.  This preservation-of-shape is more apparent when considering macro-expressiveness, which demands similarities of program shape between p and φ(p), because this implies that the similarities and differences between φ(p1) and φ(p2) would be akin to the similarities and differences between p1 and p2; but it's not clear that polynomial/macro rewriting would be the only useful measure of similarity of program shape.  (Cf. Steele and Sussman's 1976 Lambda: The Ultimate Imperative.)  To explore more general aspects of expressiveness, one might parameterize the theory by what class of transformations are allowed.

In preparing for an analogous treatment of abstractiveness, the first thing to recognize is that while expressiveness views each program as inducing a semantic value, abstractiveness views each program as inducing a programming language.  When comparing two B-programs p1 and p2, we don't just ask whether they induce the same programming language, because they almost certainly do not.  Rather, we want to compare the induced programming languages to each other, probably using some measurement at least as sophisticated as expressiveness.

Consider what this means for the definition of programming language.  Picture a base language as the center of a web of languages connected by directed arrows —abstractions— each arrow labeled by a program text.  The whole thing is a sort of state machine, where the states are languages, the state transitions are abstractions, and the input "alphabet" is the set of program texts.  We could also integrate semantics into this model, by adding transitions labeled with reserved "observable" terms — and then there isn't really any need for the states of the machine at all.  Everything we could ever want to know about a given programming language is contained in the set of all possible sequences of labels on paths starting from that language; so we might as well define a language to be that set of sequences.  That is,

(D1)  A programming language over set of terms T is a set of sequences of terms P ⊆ T* such that for all sequences of terms x and y, if xy ∈ P then x ∈ P.
This approach also appeals to the recognition that although computation theory tends to look only at computations that halt, a great many of our software processes are open-ended.

This purely syntactic, unbounded view of programming languages is foundational.  The expectation of halting — what one might call the terminal-semantic assumption — is ubiquitous:  the assumption, hardwired into one's core definitions, that a computation is meant to get an answer and stop.  Denotational semantics is a terminal-semantic model.  Theory of computation, and complexity theory, are founded on the terminal-semantic assumption.

To my mind, an essential difficulty with the terminal-semantic approach is that, patently, it prefers to disregard properties that relate to unbounded sequences of future developments.  Abstractive power is directly concerned with such sequences, but one suspects all computation really should take them into account, as most macroscopic software processes are interactive (in one or another sense) and open-ended rather than self-contained and merely producing a final result.  (Cf. Dina Goldin et al.)

The unbounded-syntax approach does not, of course, really "eliminate" semantics; but it does cause semantics to become a dependent concept, grounded in syntax.  For abstraction theory as I'm currently developing, semantics is a set of sequences of terms; in RAGs (from my master's thesis), semantics is a nonterminal symbol of a grammar.  (In a modern treatment of RAGs I'd be inclined to replace the term "metasyntax" with "co-semantics"; but I digress... sort-of.)

Note:  I've described RAGs in a later blog post, here.
The second derivative of semantics

The expressiveness of a programming language —what we ask φ to conserve when mapping B to A— is about the contours of the overall pattern of semantics of programs.  That is, it's about how variations in the text of a program change the semantics induced by the program; in short, expressiveness is the first derivative of semantics.

The abstractiveness of a programming language —which we will want conserved when we assert that one language is "as abstractively powerful" as another— is about how variations in the text of a program change the expressiveness of the language induced by the program.  Thus, as expressiveness looks at variations in semantics, and is in this sense the first derivative of semantics, abstractiveness looks at variations in expressiveness, and is thus the second derivative of semantics.

As I've set out to mathematize this, I've found the treatment becomes rather off-putting-ly elaborate (a trend I mean to minimize in this post).  I observe a lesser degree of the same effect even in Felleisen's paper, which was apparently trying to stick to just a few simple ideas and yet somehow got progressively harder to keep track of.  Some of this sort of thing is a natural result of breaking new ground:  appropriate simplifications may be recognized later.  I omitted one substantial complication from my description of Felleisen's treatment, that I suspect was simply a consequence of techniques he'd inherited from other purposes.  However, I've also introduced one new complication into expressiveness, namely parameterization by the class of transformations allowed — and I'm about to introduce a second complication, as I adapt Felleisen's treatment to my unbounded-syntax strategy.

Recasting expressiveness

In adapting expressiveness to the unbounded-syntax definition of programming language (D1), the first consideration is that a mapping φ between two languages of this sort has to respect the all-important prefix structure of the languages:

(D2)  For programming languages P and Q, a morphism from P to Q is a function φ : P → Q that preserves prefixes.
That is, for every xy ∈ P, there exists z such that φ(xy) = φ(x)z.  We write P/x for the language reached from language P by text sequence x ∈ P; and similarly, when φ : P → Q, we write φ/x for the corresponding morphism from P/x to Q/φ(x).  Thus,  P/x = { y | xy ∈ P }  and  φ(xy) = φ(x) (φ/x)(y).

As remarked earlier, we parameterize expressiveness relationships by the class of allowable morphisms.  We won't allow arbitrary classes of morphisms, though.

(D3)  A category (over programming languages with terms T) is a set C of morphisms between languages over T, closed under composition and including the identity morphism of each language over T.
For given terms T, some useful categories:
Any = category of all morphisms.
Map = category of morphisms that perform a term transformation uniformly, φ(t1...tn) = φ(t1)...φ(tn).
Macro = category of map morphisms whose term transformation is macro/polynomial.
Inc = category of inclusion morphisms, φ : P → Q with φ(x)=x
Id = category of identity morphisms, φ : P → P with φ(x)=x
Where Felleisen could avoid complicated semantic values by relying on halting behavior, we need a set of observable terms, O ⊆ T.
(D4)  Morphism φ : P → Q respects observables O if
  • φ maps observables, and nothing else, into those same observables — that is, (φ/x)y=o iff y=o — and
  • φ transforms each derived language into a language with exactly the same observables — that is, o∈(P/x) iff o∈(Q/φ(x)).
Morphism φ : P → Q weakly respects observables O if it satisfies all the conditions for respecting observables O except that o∈(P/x) implies o∈(Q/φ(x))  (rather than implication going both ways).
ObsO = category of all morphisms that respect observables O
WObsO = category of all morphisms that weakly respect observables O
Our basic expressiveness relation is then
(D5)  For category C and languages A and BA can C-express B for observables O if there exists φ : B → A with φ ∈ C ∩ ObsO.
We say A is as C,O-expressive as B.  Weak C,O-expressiveness uses WObsO in place of ObsO.  Evidently, the as C,O-expressive as relation is transitive, as is its weak variant.  Expressiveness implies weak expressiveness (because ObsO ⊆ WObsO).

There is a simple theorem that the expressiveness relation is preserved if the category is made bigger, and if the set of observables is made smaller.  That is, if A is as C1,O1-expressive as B, category C1C2, and observables O2O1, then A is as C2,O2-expressive as B.

Note:  Although the notation here is simplified from the techreport, it still poorly handles the codomain of a derived morphism, producing such un-self-explanatory expressions as Q/φ(x).  Belatedly I see that by adopting for φ: P → Q the additional notation φ(P)=Q, one would have the more mnemonic φ/x : P/x → φ(P)/φ(x).
Abstractiveness

As expressiveness depends on φ : B → A preserving the semantic landscape of B, abstractiveness should depend on φ preserving the expressive landscape of B.  A semantic landscape for B arose from specifying a set of observables.  An expressive landscape for B arises from specifying a category of expressiveness morphisms by which to compare different languages B/x.

If it wasn't entirely certain what category to use when comparing expressiveness (hence the introduction of parameter C into the expressiveness relation, where Felleisen's treatment only looked at two choices of C), the choice of expressiveness category becomes much more obscure when considering abstractiveness.  Note in Standish's remark the phrase by modest amounts of labor; dmbarbour has remarked that multiple degrees of difficulty are of interest, and this suggests that no one choice of category will give us a full picture.  One might even imagine that a φ : B → A of one category would map expressive relations on B of a second category onto expressive relations on A of yet a third category.  The point is that, without prior experience with this mathematics, we have no clear notion which of the myriad imaginable relations we want to look at.  We hope that experience working with it may give us insight into what special case we really want, but meanwhile the endeavor seems to require a lot of parameters.  It would look rather silly to have an abstractiveness relation with, say, four parameters, so we redistribute the parameters by bundling the choice of expressive category with the language.

(D6)  A programming language with expressive structure over terms T is a pair A=〈S,C〉 of a programming language S over T with a category C over programming languages with terms T.
We call these expressive languages for short.  We may write seq(A) and cat(A) for the components of expressive language A.

For φ to preserve the expressive landscape from B to A is more involved than preserving the semantic landscape.  The expressive landscape of B=〈S,C〉 is a web of relations between languages derived from S.  Suppose gC with g : S/x → S/y.  Through φ : 〈S,C〉 → 〈R,D〉, derived languages S/x and S/y correspond to R/φ(x) and R/φ(y).  Moreover, we already have three morphisms between these four derived languages:

g : S/x → S/y
φ/x : S/x → R/φ(x)
φ/y : S/y → R/φ(y)
If only we had a fourth morphism hD with h : R/φ(x) → R/φ(y), we could ask these four morphisms to commute, h ∘ φ/x = φ/y ∘ g.
(D7)  For expressive languages P and Q,  a morphism from P to Q is a morphism φ : seq(P) → seq(Q) such that
  • for all g ∈ cat(P) with g : seq(P)/x → seq(P)/y, there exists h ∈ cat(Q) with h ∘ φ/x = φ/y ∘ g, and
  • for all x,y ∈ seq(P), if there is no g ∈ cat(P) with g : seq(P)/x → seq(P)/y, then there is no h ∈ cat(Q) with h : seq(Q)/φ(x) → seq(Q)/φ(y).
For category C and expressive languages A, BA can C-express B for observables O if there exists φ : B → A with φ ∈ C ∩ ObsO.
Note:  I've corrected a typo in this definition:  the second condition read for all x,y ∈ seq(Q) rather than seq(P). (24 April 2014)
We say A is as C,O-abstractive as B; and weak C,O-abstractiveness uses WObsO in place of ObsO.  The as C,O-abstractive as relation (and its weak variant) is transitive, and is preserved by making C bigger or O smaller.  Abstractiveness implies weak abstractiveness (because, again, ObsO ⊆ WObsO).  The techreport proves several other rather generic theorems; for example, if A is as C,O-abstractive as B, and cat(A) respects observables O, then cat(B) respects observables O.

When the expressive languages get their expressive structure from the identity category, C,O-abstractiveness devolves to C,O-expressiveness.  That is, for all languages A and B, category C, and observables O,  〈A,Id〉 is as C,O-abstractive as 〈B,Id〉  iff  A is as C,O-expressive as B.

Test case

An obvious question at this point is, now that we've got this thing put together, does it work?  In the techreport, my sanity check was a test suite of toy languages, chosen to minimally capture the difference between a language in which local declarations are globally visible, and one in which local declarations can be private.  The wish list at the end of the techreport, of results to try for, was much more impressive — I targeted encapsulation of procedures; hygienic macros; fexprs; and strong typing — but the sanity check, as the only example thus far actually worked out in full detail, has some facets of interest.

In language L0, records are declared with constant fields, and the fields can then be queried.  When a query occurs in a text sequence, the next text in the sequence reports the value contained in the queried field; these reports are the observable terms.  Language Lpriv is identical except that individual fields may be declared "private"; private fields cannot be queried, though they can be used when specifying the value of another field in the same record.

It's easy to transform an Lpriv text sequence into a valid L0 sequence:  just remove all the "private" keywords from the field declarations.  This is a macro/polynomial transformation, and all the queries that worked in Lpriv still work and give the same results in L0.  Therefore, L0 is weakly as Macro-expressive as Lpriv.

Whether or not L0 is as Macro-expressive as Lpriv (without the "weakly" qualifier) depends on just how one sets up the languages.  Each query is a separate text in the sequence, and the only text that can follow a query is a report of the result from the query.  Expressiveness hinges on whether or not it is permissible to query a field that is not visible; and the reason this matters is that although strong expressiveness requires each derived language B/x map into a derived language A/φ(x) that has no extra observable terms, it does not require that there be no extra non-observable terms.  To see how this works, suppose u ∈ Lpriv is a sequence of declarations, after which query q1 is valid but q2 attempts to query a private field.  Following u, the result of q1 is r1, and the result of q2 would be r2 if the field were made public.  Let v be the result of removing all "private" keywords from u.

In the techreport, a query is not permitted unless its target is visible.  Therefore, Lpriv sequences include uq1 and uq1r1, but not uq2L0 sequences include vq1, vq1r1, vq2, and vq2r2.  But expressiveness does not object to vq2r2 ∈ L0 unless there is some x ∈ Lpriv such that φ(x)=vq2 but xr2 ∉ Lpriv.  Since there is no such x, the observable r2 in vq2r2 causes no difficulty.  Although it is true that vq2 ∈ L0 even though uq2 ∉ Lpriv, this does not interfere with expressiveness because q2 is not an observable.  The upshot is that in the techreport, L0 is as Macro-expressive as Lpriv.

Alternatively, one could allow queries regardless of whether they're valid; this is essentially a "dynamic typing" approach, where errors are detected lazily.  In that case, uq2 ∈ Lpriv; the allowance of the observable r2 in L0 is then a violation of strong expressiveness, and L0 is not as Macro-expressive as Lpriv (although it's still weakly as Macro-expressive).

What about Macro-abstractiveness?  Given the relative triviality of the languages, I chose to use weak category Inc for the expressive structure of the languages:  〈L0,Inc〉 and 〈Lpriv,Inc〉.  L0 ⊆ Lpriv, so using the inclusion morphism φ(x)=x from L0 to Lpriv, evidently 〈Lpriv,Inc〉 is as Inc-abstractive as 〈L0,Inc〉; and consequently, as a weaker result, 〈Lpriv,Inc〉 is as Macro-abstractive as 〈L0,Inc〉.

As for the other direction, we've already observed that L0 is as Macro-expressive as Lpriv, which is to say that 〈L0,Id〉 is as Macro-abstractive as 〈Lpriv,Id〉.  So if 〈L0,Inc〉 isn't as Macro-abstractive as 〈Lpriv,Inc〉, the difference has to be in the additional expressive structure.  This expressive structure gives us a g : L/x → L/y when, and only when, L/x ⊆ L/y.  Therefore, in order to establish (contrary to our intention) that 〈L0,Inc〉 is as Macro-abstractive as 〈Lpriv,Inc〉, we would have to demonstrate a MacroObsO morphism φ : Lpriv → L0 that preserves all of the subsetting relationships between derived languages in Lpriv.

For example, the transformation remove-all-"private"-keywords, which gave us Macro-expressiveness, will not work for Macro-abstractiveness, because it washes out subsetting relationships.  Suppose u is a sequence of declarations in Lpriv that contains some private fields, and v is the result of removing the "private" keywords from u.  Since v can be followed by some things that u cannot, Lpriv/u is a proper subset of Lpriv/v.  However, the remove-"private" transformation maps both of these languages into L0/v, which evidently isn't a proper subset of itself; so remove-"private" doesn't preserve the expressive structure.

In fact, no Macro morphism will preserve the expressive structure.  Consider a particular private field declaration, call it d.  When d occurs within a record declaration in an Lpriv sequence, it can affect later expansions of the sequence, through its use in defining other, non-private fields of the same record, changing the results of queries on those non-private fields.  The only thing that can affect later queries in L0 is a field declaration; therefore, a Macro morphism φ must map d into one or more field declarations.  But if d is embedded in a record declaration that doesn't use it, it won't have any effect at all on later queries — and the only way that can be true in L0 is if φ(d) doesn't declare any fields at all.  So there is no choice of φ that preserves the expressive structure (which is to say, the subsetting structure) in all cases.

So 〈Lpriv,Inc〉 is strictly more Macro-abstractive than 〈L0,Inc〉.

Wednesday, July 31, 2013

Explicit evaluation

...if you cannot say what you mean, you can never mean what you say.
— Centauri Minister of Intelligence, Babylon 5.

Ever since I started this blog, I've had in mind to devote a post to the relationship between the strong theory of vau-calculus and the no-go theorem of Mitchell Wand's 1998 paper The Theory of Fexprs is Trivial.  Some of my past attempts to explain this relationship have failed miserably, though, so before trying again I wanted some new insight into how to approach the explanation.  Following my mention of this issue in my recent post on bypassing no-go theorems, and some off-blog discussion it provoked, I'm going to take a stab at it.

The trouble has always been that while my result is very simple, and Wand's result is very simple, and both treatments use quite conventional mathematical techniques (playing the game strictly according to Hoyle), they're based on bewilderingly different conceptual frameworks.  There's not just one difference, but several, and if you try to explain any one of them, your audience is likely to get hung up on one or more of the others.

Here are some highlights of how these conceptual differences stack up; I'll look at things in more detail hereafter.

  • Wand's paper assumes that the only equivalences of interest are those between source expressions; he acknowledges this explicitly in the paper.
  • Wand's paper isn't really about fexprs — it's about computational reflection, as can be seen in the Observations and Conclusions section at the end of the paper.
  • Wand's operational semantics has no terms that aren't source expressions.  This isn't a realistic depiction of fexprs as they occur in Lisp, but it is consistent with both his exclusive interest in the theory of source expressions and his interest in computational reflection.
  • Wand's operational semantics uses implicit evaluation rather than explicit evaluation:  that is, to distinguish between an expression to be evaluated and an expression not to be evaluated, he uses special contexts marking expressions that aren't to be evaluated, rather than special contexts marking expressions that are to be evaluated.  From a purely mathematical standpoint, a rewriting system using implicit evaluation always has a trivial theory, while a rewriting system using explicit evaluation can have quite a strong theory.  However, Wand's conceptual commitments prevent him both from using explicit evaluation and from caring about its consequences.  He cannot use explicit evaluation because the explicit evaluation contexts would not be representable in source expressions, violating his prohibition against non-source terms.  And, even if such expressions were allowed, all of the nontrivial equivalences in the theory would involve non-source terms, so that all the equivalences Wand is interested in would still be trivial (though for a different technical reason).

The deceptively simple math

Suppose we want a term-rewriting system in which any source expression S may sometimes appear as data and other times appear as program code to be executed.  We can do this in either of two ways:  either use a special context to mark a subterm that isn't to be evaluated, or use a special context to mark a subterm that is to be evaluated.

What Wand calls contextual equivalence, and I call operational equivalence, is a binary relation on terms defined thus (up to unimportant differences between treatments):

T1 ≅ T2  iff for every context C and observable V,  C[T1] ↦* V  iff   C[T2] ↦* V
But this formal definition interacts differently with the two different approaches to marking whether S is to be evaluated.

Suppose we use a special context to mark a subterm that isn't to be evaluated.  I call this strategy implicit evaluation, because a subterm is implicitly to be evaluated unless we go out of our way to say otherwise.  This strategy would naturally occur to a traditional Lisp programmer, who is used to quotation.  If this special context is identified by an operator, likely we'd call it "quotation"; but whatever we call it, suppose Q is a context that marks its subterm as data.  This means, formally, that Q[S] is itself an observable, and that Q[S1] and Q[S2] are different observables unless S1 and S2 are syntactically identical.  It immediately follows, from our definition of operational equivalence, that no two source terms S1 and S2 can ever be operationally equivalent unless they are syntactically identical.  When you add all the trappings in Wand's paper, this trivialization of theory might look more complicated, but it still comes down to this:  if there's a context that converts arbitrary terms into irreducible observables, then the operational equivalence relation is trivial.

But suppose we use the second strategy, a special context to mark a subterm that is to be evaluated.  I call this strategy explicit evaluation because a subterm is only evaluated if it's immediately surrounded by a context that explicitly says to evaluate it.  Then the cause of Wand's trivialization of theory "simply" vanishes.  Except that the consequences aren't at all simple.  The operational equivalence relation is really measuring something different under explicit evaluation than it did under implicit evaluation (under explicit evaluation, source terms have trivial theory because they're irreducible, whereas under implicit evaluation they have trivial theory even though they aren't irreducible).  So now I'll go back to square one and build the mathematical machinery with more care.

Dramatis personae

In classical small-step operational semantics, we have a term set, and a number of binary relations on terms.

The term set includes both source expressions and the observable results of computations, and possibly also some terms that represent states of computation that are neither source expressions nor observables.

Six key properties that a binary relation on terms might have: 

  • Reflexive:  T1 > T1
  • Transitive:  if  T1 > T2  and  T2 > T3  then  T1 > T3
    The reflexive transitive closure ("zero or more steps") is written by suffixing "*" to the relation, as  T1 >* T2
  • Commutative:  if  T1 > T2  then  T2 > T1
    A reflexive transitive commutative relation is called an equivalence.
  • Compatible:  if  T1 > T2  then for all contexts C,  C[T1] > C[T2]
    A compatible equivalence is called a congruence.
  • Church-Rosser:  If  T1 >* T2  and  T1 >* T3  then there exists T4 such that  T2 >* T4  and  T3 >* T4
  • Deterministic:  For each T1, there is at most one T2 such that T1 > T2
    This is not usually a mathematically desirable property, because it interferes with other properties like compatibility.  However, determinism may be semantically desirable (i.e., we want a program to behave the same way each time it's run).
Four key binary relations in the small-step semantics of a programming language:
  • Operational step, written  ↦
    A directed relation meant to "obviously" describe what programs in the language are supposed to do.  Because it's meant to be obviously right, it doesn't try to have nice mathematical properties like Church-Rosser-ness.  Expected to be deterministic and non-compatible.
  • Calculus step, written  →
    A relation meant to have lots of nice mathematical properties, like the reduction step relation of lambda-calculus.  The calculus step is generally defined to be compatible, and if it isn't Church-Rosser, something's wrong.
  • Calculus equality, written  =
    The reflexive transitive commutative closure of the calculus step (thus, the smallest equivalence relation containing the calculus step).
  • Operational equivalence, written  ≅
    T1 ≅ T2  iff for every context C and observable V,  C[T1] ↦* V  iff   C[T2] ↦* V
The master theorems one wants to prove about these relations are
  • completeness:
    ↦*  implies  →*
  • Church-Rosser-ness:
    →  is Church-Rosser
  • soundness:
    =  implies  ≅
(There's another "master theorem" we could include on this list, called standardization, but that really is way more infrastructure than we need here.)

Wand's operational semantics

The syntax of terms in Wand's semantics is:

T   ::=   x | (λx.T) | (TT) | (fexpr T) | (eval T)
That is, a term is either a variable, or a lambda-expression, or a combination, or a fexpr, or a call to eval.  Wand explained he didn't want to have two different binding constructs, one for ordinary procedures and one for fexprs.  Well, I didn't either; it's characteristic of our different approaches, though, that where Wand made his binding construct a constructor of applicatives (that evaluate their argument), and put a wrapper around it to cause the operand to be quoted, I made my binding construct a constructor of operatives (i.e., fexprs), and put a wrapper around it to cause the operand to be evaluated.

Wand has three rewriting rules: one for applying a λ-expression, one for applying a fexpr, and one for applying eval.  If we were being naive, we might try to define a calculus step like this:

((λx.T)V)   →   T[x ← V]
(fexpr V)T   →   (V encode(T))
(eval T)   →   decode(T)
Here, T is a term, and V is a value, that is, a term that's been reduced as much as it can be.  (Obscure point:  a value is a term that's irreducible by the operational step.  Don't worry about it.)

The encode/decode functions require some explanation.  The point of not evaluating certain subterms is to be able to use those unevaluated subterms as data.  So somehow you have to represent them in an accessible data form.  One way to do this would be to use a special quotation operator, and then have accessors that act on quoted expresions, like  (car (quote (T1 . T2))) → (quote T1).  However, when you're proving a no-go theorem, as Wand was doing, you want to minimize the assumptions you have to make to demonstrate the no-go result.  So Wand would naturally want to avoid introducing those extra operators, quote car and so on.  Instead, Wand used a Mogensen-Scott encoding, which, given an arbitrary term T, produces a term built up out of lambda-expressions that is not itself reducible at all but which can be queried to extract information about T.

Unfortunately this definition of a calculus step doesn't work, exactly because these rules assume → is compatible, and the whole point of Wand's exercise is that subterm rewriting isn't allowed in all contexts.  However, we can define the operational semantics step, which isn't compatible anyway.  For the deterministic operational step, we define an "evaluation context", which is a context that determines where the next redex (reducible expression) can occur.  We have

E   ::=   ⎕ | (ET) | ((λx.T)E) | (fexpr E) | (eval E)
That is, starting from the top of the syntax tree of a term, we can reduce a redex at the top of the tree, or we can descend into the operator of a combination, or into the operand of a λ-expression, or into the body of a fexpr, or into the operand of eval.  Using this, we can define the operational step.
E[((λx.T)V)]   ↦   E[T[x ← V]]
E[(fexpr V)T]   ↦   E[(V encode(T))]
E[(eval T)]   ↦   E[decode(T)]

Preprocessing as a cure for trivialization

Wand credited Albert Meyer for noting the trivialization effect.  Meyer, though, noted that quote-eval (as opposed to fexprs) need not cause trivialization.  (Meyer made this supplementary observation at least as early as 1986; see here, puzzle 3.)

The key to this non-trivialization by quotation is that quotation in traditional Lisp is a "special form", which is to say that the quotation operator, which identifies its argument as not to be evaluated, is fixed at compile-time (that is, before program evaluation begins).  So, given a Lisp source program, we can run a preprocessor over it and rewrite each quoted expression to avoid quoting more than a single symbol at a time.  (For example, one might rewrite  ($quote (+ 1 2))  as  (cons ($symbol-quote +) (cons 1 (cons 2 ()))), a style of rewriting that requires only quotation of individual symbols.)  And then the term-rewriting system doesn't have to have quotation in it at all (only, in the example, symbol-quotation, a much more restricted facility that would only trivialize the formal theory of individual symbols, not of arbitrary source expressions.)

Wand's Mogensen-Scott encoding is a similar transformation to this preprocessing of Lisp expresions using cons etc. (setting aside how the encoding treats variables, which is a whole other can of worms because lambda-calculus variables behave very differently from Lisp symbols).

However, the encoding in Wand's paper doesn't prevent trivialization, because it's used during term reduction.  Our Lisp preprocessor eliminated general quotation before term reduction ever started, so that general quotation didn't even have to be represented in the term syntax for the operational semantics.  If you wait until reduction has already started, it's too late for the encoding to matter to trivialization; you already have a trivializing context, and the encoding merely facilitates data access.

This however raises a curious possiblity:  what if we preprocessed Wand's language by encoding the arguments to all the operators?  Then, instead of encoding an operand when applying a fexpr to it, we decode an operand when applying a non-fexpr to it (or when evaluating it, a case that Wand already provides for — except that when evaluating a term, we would only decode the operators, not their operands, since we don't decode operands until we know they're to be evaluated).

One thing about this:  if we're going to decode an operand of a non-fexpr, we need to keep track of that by somehow changing the operator to indicate we've already done the decoding.  The simplest way to do this is to wrap non-fexprs (as my treatment does) instead of wrapping fexprs (as Wand's does).  There are other, clumsier ways to do it without altering Wand's syntax, but I'll go ahead and change the syntax.

T   ::=   x | (λx.T) | (TT) | (wrap T) | (eval T)
Notice the wrapper is now called wrap instead of fexpr, but the basic binding constructor of combiners is still called λ.  Actually, in my dissertation it's called vau rather than lambda (hence "vau-calculus"), but there's also an interesting point to be made by leaving it as λ (setting aside that unicode doesn't have a character for the way I write a lower-case vau).  It's just arbitrarily chosen syntax, after all... right?

Our evaluation contexts for the operational step change slightly because, beyond the switch from fexpr to wrap, we're now willing for the operational step to reduce an operand whenever the operator is a value, regardless of whether the operator is a fexpr.

E   ::=   ⎕ | (ET) | (VE) | (wrap E) | (eval E)
The operational step rules differ mainly in their treatment of encoding and decoding.  There is no encoding during reduction.  As for decoding, it is only partial.  We decode everything except the operands of combinations.  (Specifying this precisely would be tedious and contribute nothing to the discussion, but since that was already true of the encoding/decoding, we haven't specified it in the first place.)
E[((λx.T)V)]   ↦   E[T[x ← V]]
E[(wrap V)T]   ↦   E[(V partial-decode(T))]
E[(eval T)]   ↦   E[partial-decode(T)]
Note that the first rule, for applying a λ-expression to an operand, hasn't changed at all, even though a λ-expression is now an operative where in Wand's treatment it was an applicative.

Given a source program, before we set the operational step loose on the term, we preprocess the program by fully encoding it, and then partial-decoding it.  Hence the complete absence of encoding in the rules of the operational semantics.  This has the expected, but to me still rather stunning, effect that there is no longer any context in the rewriting system that has to be off-limits to rewriting.  We can therefore now define a valid compatible calculus step relation by simply removing the evaluation contexts from the operational step rules:

((λx.T)V)   →   T[x ← V]
(wrap V)T   →   (V partial-decode(T))
(eval T)   →   partial-decode(T)
I'll go only slightly out on a limb and say that all three master theorems probably hold for this arrangement — completeness, Church-Rosser-ness, and soundness.  (Completeness is obvious by construction.)  But what's really interesting here is that this arrangement both violates Wand's restriction on the term set and produces an equivalence relation that doesn't correspond to the one in his paper.

The term-set restriction is violated because source expressions, which Wand was studying exclusively, are now the things that are input to the preprocessor, and the preprocessor maps those source expressions into a proper subset of the terms in our semantics.  And the contextual equivalence on source expressions that Wand studied is concerned with these source expressions that don't even effectively belong to our term set at all, since they're mapped through the refractive lens of this encode/partial-decode preprocessor.  Yes, our ≅ is no longer a trivial equivalence — but Wand's notion of contextual equivalence of source expressons would require those source expressions to have interchangeable full encodings (in case they both appear as an operand to a fexpr), and that still won't happen unless the source expressions before preprocessing were syntactically identical.  The source-expression equivalence Wand was interested in is still trivial, and our non-trivial ≅ is simply not relevant under the treatment used in his paper.

One last thought about this preprocessed, pseudo-explicit-evaluation variant of Wand's semantics.  The first rule of the calculus step is the call-by-value β-rule.  So this calculus is (if I've not made a goof somewhere) a conservative extension of call-by-value lambda-calculus, and its equational theory includes all the equations in the theory of call-by-value lambda-calculus plus others.

Vau-calculus

Although my pseudo-explicit-evaluation variant of Wand's system does demonstrate some of the principles involved, the encoding/decoding, and the preprocessor, are obstacles to extracting clear insights from it.  Unlike Wand, I'm not trying to prove a no-go theorem; I'm showing that something can be done, so my interest lies in making it clear how to do it, and I'll happly introduce more syntax in order to avoid a complication such as Mogensen-Scott encoding, or preprocessing.

Instead of distinguishing data from to-be-evaluated expressions by means of a deep encoding (in which the entire term is transformed, from its root to its leaves), I simply introduce syntax for fully representing Lisp source expressions as data structures, entirely separate from the syntax for things like combinations and procedures (noting that procedures cannot be expressed by Lisp source expressons:  you write a source expression that would evaluate to a procedure, but until you commit to evaluating it, it's not a procedure but a structure built out of atomic data, symbols, pairs, and the empty list.)

S   ::=   d | ()
T   ::=   S | s | (. T)
Here, d is any atomic datum (archetypically, a numeric constant), and s is any symbol (which is another kind of datum, and not to be confused with a variable which is not a source-code element at all).  I've separated out the atomic data and the empty list into a separate nonterminal S, mnemonic for "Self-evaluating", because these kinds of terms will be treated separately for purposes of evaluation.

All of the terms constructed by the above rules are data, by which I mean, they're irreducible observables; none of those terms can ever occur on the left side of a calculus step.  (Therefore, given the master theorems, the operational theory of those terms is trivial.)  There are just two syntactic contexts that can ever form a redex; I'll use nonterminal symbol A for these, mnemonic for "active":

S   ::=   d | ()
A   ::=   [eval T T] | [combine T T T]
T   ::=   S | s | (. T) | A
The intent here (to be realized by calculus step rules) is that  [eval T1 T2]  represents scheduled evaluation of term T1 in environment T2, while  [combine T1 T2 T3]  represents scheduled calling of combiner T1 with parameters T2 in environment T3.

There's also one other class of syntax:  computational results that can't be represented by source code.  There are three kinds of these:  environments (which we will clearly need to support our evaluation rules), operatives, and applicatives.  We'll use "e" for environments and "O" for operatives.

O   ::=   [vau x.T] | ...
S   ::=   d | () | e | O
A   ::=   [eval T T] | [combine T T T]
T   ::=   x | S | s | (. T) | [wrap T] | A
The syntactic form of operative expressions, O, is interesting in itself, and in fact even the inclusion of traditional λ-expressions (which is what you see there, although the operator is called "vau" instead of "λ") is worthy of discussion.  I figure those issues would distract from the focus of this post, though, so I'm deferring them for some future post.  [Those issues arise in a later post, here.]

Here, then, are the rules of the calculus step (omitting rules for forms of operatives that I haven't enumerated above).

[eval S e]   →   S
[eval s e]   →   lookup(s,e)     if lookup(s,e) is defined
[eval (T1 . T2) e]   →   [combine [eval T1 e] T2 e]
[eval [wrap T] e]   →   [wrap [eval T e]]
[combine [vau x.T] V e]   →   T[x ← V]
[combine [wrap T0] (T1 ... Tn) e]   →   [combine T0 ([eval T1 e] ... [eval Tn e]) e]
In the fully expounded vau-calculus, the additional forms of operatives do things like parsing the operand list, and invoking primitives (such as car, and of course $vau).

Under these rules, every redex —that is, every term that can occur on the left side of a calculus step rule— is either an eval or a combine:  reduction occurs only at points where it is explicitly scheduled via eval or combine; and if it is explicitly scheduled to occur at a point in the syntax tree, that reduction cannot be prevented by any surrounding context.

The calculus step rules are, in essence, the core logic of a Lisp interpreter.

This calculus allows us to make much more fluent statements about the evaluation-behavior of terms than mere operational equivalence.  We can say, for example, that two terms T1 and T2 would be indistingusihable if evaluated in any environment, no matter what that environment is (and even though T1 and T2 may not themselves be operationally equivalent).

For all environments e,   [eval T1 e] ≅ [eval T2 e]
Thanks to the definition of ≅, when we say  [eval T1 e] ≅ [eval T2 e], we mean
For all contexts C and observables V,   C[eval T1 e] →* V  iff  C[eval T2 e] →* V
This is evidently not the question Wand's treatment asks with its contextual equivalence.  However, we can also define Wand's relation in this framework.  Let e0 be a standard environment.  Then,
T1∼T2   iff   for all contexts C,  [eval C[T1] e0] ≅ [eval C[T2] e0]
Relation ∼ is Wand's contextual equivalence.  And, indeed, for all source expressions S1 and S2,  S1∼S2  iff  S1 and S2 are syntactically identical.

Finally, note what happens if we omit from this calculus all of the syntax for symbols and lists, along with its associated machinery (notably, eval and environments; the only change this makes to the remaining elements is that we drop the third operand to [combine ...]).

O   ::=   [vau x.T]
S   ::=   d | O
A   ::=   [combine T T]
T   ::=   x | S | A
[combine [vau x.T] V]   →   T[x ← V]
Do you recognize it?  You should.  It's the call-by-value lambda-calculus.

[edit: besides some minor typos here and there, this post had originally omitted the vau-calculus rule for combining an applicative, [combine [wrap T0] ...].

Thursday, July 18, 2013

Bypassing no-go theorems

This is not at all what I had in mind.
— Albert Einstein, in response to David Bohm's hidden variable theory.

A no-go theorem is a formal proof that a certain kind of theory cannot work.  (The term no-go theorem seems to be used mainly in physics; I find it useful in a more general context.)

A valid no-go theorem identifies a hopeless avenue of research; but in some cases, it also identifies a potentially valuable avenue for research.  This is because in some cases, the no-go theorem is commonly understood more broadly than its actual technical result.  Hence the no-go theorem is actually showing that some specific tactic doesn't work, but is interpreted to mean that some broad strategy doesn't work.  So when you see a no-go theorem that's being given a very broad interpretation, you may do well to ask whether there is, after all, a way to get around the theorem, by achieving what the theorem is informally understood to preclude without doing what the theorem formally precludes.

In this post, I'm going to look at four no-go theorems with broad informal interpretations.  Two of them are in physics; I touch on them briefly here, as examples of the pattern (having explored them in more detail in an earlier post).  A third is in programming-language semantics, where I found myself with a result that bypassed a no-go theorem of Mitchell Wand.  And the fourth is a no-go theorem in logic that I don't actually know quite how to bypass... or even whether it can be bypassed... yet... but I've got some ideas where to look, and it's good fun to have a go at it:  Gödel's Theorem.

John von Neuman's no-go theorem

In 1932, John von Neumann proved that no hidden variable theory can make all exactly the same predictions as quantum mechanics (QM):  all hidden variable theories are experimentally distinguishable from QM.  In 1952, David Bohm published a hidden variable theory experimentally indistinguishable from QM.

How did Bohm bypass von Neumann's no-go theorem?  Simple, really.  (If bypassing a no-go theorem is possible at all, it's likely to be very simple once you see how).  The no-go theorem assumed that the hidden variable theory would be local; that is, that under the theory, the effect of an event in spacetime cannot propagate across spacetime faster than the speed of light.  This was, indeed, a property Einstein wanted out of a hidden variable theory:  no "spooky action at a distance".  But Bohm's hidden variable theory involved a quantum potential field that obeys Schrödinger's Equation — trivially adopting the mathematical infrastructure of quantum mechanics, spooky-action-at-a-distance and all, yet doing it in a way that gave each particle its own unobservable precise position and momentum.  Einstein remarked, "This is not at all what I had in mind."

John Stewart Bell's no-go theorem

In 1964, John Stewart Bell published a proof that all local hidden variable theories are experimentally distinguishable from QM.  For, of course, suitable definition of "local hidden variable theory".  Bell's result can be bypassed by formulating a hidden variable theory in which signals can propagate backwards in time — an approach advocated by the so-called transactional interpretation of QM, and which, as noted in my earlier post on metaclassical physics, admits the possibility of a theory that is still "local" with respect to a fifth dimension of meta-time.

Mitchell Wand's no-go theorem

In 1998, Mitchell Wand published a paper The Theory of Fexprs is Trivial.

The obvious interpretation of the title of the paper is that if you include fexprs in your programming language, the theory of the language will be trivial.  When the paper first came out, I had recently hit on my key insight about how to handle fexprs, around which the Kernel programming language would grow, so naturally I scrutinized Wand's paper very closely to be sure it didn't represent a fundamental threat to what I was doing.  It didn't.  I might put Wand's central result this way:  If a programming language has reflection that makes all computational states observable in the syntactic theory of the language, and if computational states are in one-to-one correspondence with syntactic forms, then the syntactic theory of the language is trivial.  This isn't a problem for Kernel because neither of these conditions holds:  not all computational states are observable, and computational states are not in one-to-one correspondence with syntactic forms.  I could make a case, in fact, that in S-expression Lisp, input syntax represents only data:  computational states cannot be represented using input syntax at all, which means both that the syntactic theory of the language is already trivial on conceptual grounds, and also that the theory of fexprs is not syntactic.

At the time I started writing my dissertation, the best explanation I'd devised for why my theory was nontrivial despite Wand was that Wand did not distinguish between Lisp evaluation and calculus term rewriting, whereas for me Lisp evaluation was only one of several kinds of term rewriting.  Quotation, or fexprs, can prevent an operand from being evaluated; but trivialization results from a context preventing its subterm from being rewritten.  It's quite possible to prevent operand evaluation without trivializing the theory, provided evaluation is a specific kind of rewriting (requiring, in technical parlance, a redex that includes some context surrounding the evaluated term).

Despite myself, though, I was heavily influenced by Wand's result and by the long tradition in which it followed.  Fexprs had been rejected circa 1980 as a Lisp paradigm, in favor of macros.  A rejected paradigm is usually ridiculed in order to rally followers more strongly behind the new paradigm (see here).  My pursuit of $vau as a dissertation topic involved a years-long process of gradually ratcheting up expectations.  At first, I didn't think it would be worth formulating a vau-calculus at all, because of course it wouldn't be well-enough behaved to justify the formulation.  Then I thought, well, an operational semantics for an elementary subset of Kernel would be worth writing.  Then I studied Plotkin's and Felleisen's work, and realized I could provide a semantics for Kernel that would meet Plotkin's well-behavedness criteria, rather than the slightly weakened criteria Felleisen had used for his side-effectful calculus.  And then came the shock.  When I realized that the vau-calculus I'd come up with, besides being essentially as well-behaved as Plotkin's call-by-value lambda-calculus, was actually (up to isomorphism) a conservative extension of call-by-value lambda-calculus.  In other words, my theory of fexprs consisted of the entire theory of call-by-value lambda-calculus plus additional theorems.

I was boggled.  And I was naively excited.  I figured, I'd better get this result published, quick, before somebody else notices it and publishes it first — because it's so incredibly obvious, it can't be long before someone else does find it.  Did I say "naively"?  That's an understatement.  There's some advice for prospective graduate students floating around, which for some reason I associate with Richard Feynman (though I could easily be wrong about that [note: a reader points out this]), to the effect that you shouldn't be afraid people will steal your ideas when you share them, because if your ideas are any good you'll have trouble getting anyone to listen to them at all.  In studying this stuff for years on end I had gone so far down untrodden paths that I was seeing things from a drastically unconventional angle, and if even so I had only just come around a corner to where I could see this thing, others were nowhere close to any vantage from which they could see it.

[note: I've since written a post elaborating on this, Explicit evaluation.]
Kurt Gödel's no-go theorem

Likely the single most famous no-go theorem around is Gödel's Theorem.  (Actually, it's Gödel's Theorems, plural, but the common informal understanding of the result doesn't require this distinction — and Gödel's result lends itself spectacularly to informal generalization.)  This is what I'm going to spend most of this post on, because, well, it's jolly good fun (recalling the remark attributed to Abraham Lincoln: People who like this sort of thing will find it just the sort of thing they like).

The backstory to Gödel was that in the early-to-mid nineteenth century, mathematics had gotten itself a shiny new foundation in the form of a formal axiomatic approach.  And through the second half of the nineteenth century mathematicians expanded on this idea.  Until, as the nineteenth century gave way to the twentieth, they started to uncover paradoxes implied by their sets of axioms.

A perennial favorite (perhaps because it's easily explained) is Russell's Paradox.  Let A be the set of all sets that do not contain themselves.  Does A contain itself?  Intuitively, one can see at once that if A contains itself, then by its definition it does not contain itself; and if it does not contain itself, then by its definition it does contain itself.  The paradox mattered for mathematicians, though, for how it arose from their logical axioms, so we'll be a bit more precise here.  The two key axioms involved are reductio ad absurdum and the Law of the Excluded Middle.

Reductio ad absurdum says that if you suppose a proposition P, and under this supposition you are able to derive a contradiction, then not-P.  Supposing A contains itself, we derive a contradiction, therefore A does not contain itself. Supposing A does not contain itself, we derive a contradiction, therefore —careful!— A does not not contain itself. This is where the Law of the Excluded Middle comes in:  A either does or does not contain itself, therefore since it does not not contain itself, it does contain itself.  We have therefore an antinomy, that is, we've proved both a proposition P and its negation not-P (A both does and does not contain itself).  And antinomies are really bad news, because according to these axioms we've already named, if there is some proposition P for which you can prove both P and not-P, then you can prove every proposition, no matter what it is.  Like this:  Take any proposition Q.  Suppose not-Q; then P and not-P, which is a contradiction, therefore by reductio ad absurdum, not-not-Q, and by the Law of the Excluded Middle, Q.

When Russell's Paradox was published, the shiny new axiomatic foundations of mathematics were still less than a human lifetime old.  Mathematicians started trying to figure out where things had gone wrong.  The axioms of classical logic were evidently inconsistent, leading to antinomies, and the Law of the Excluded Middle was identified as a problem.

One approach to the problem, proposed by David Hilbert, was to back off to a smaller set of axioms that were manifestly consistent, then use that smaller set of axioms to prove that a somewhat larger set of axioms was consistent.  Although clearly the whole of classical logic was inconsistent, Hilbert hoped to salvage as much of it as he could.  This plan to use a smaller set of axioms to bootstrap consistency of a larger set of axioms was called Hilbert's program, and I'm remarking it because we'll have occasion to come back to it later.

Unfortunately, in 1931 Kurt Gödel proved a no-go theorem for Hilbert's program:  that for any reasonably powerful system of formal logic, if the logic is consistent, then it cannot prove the consistency of its own axioms, let alone its own axioms plus some more on the side.  The proof ran something like this:  For any sufficiently powerful formal logic M, one can construct a proposition A of M that amounts to "this proposition is unprovable".  If A were provable, that would prove that A is false, an antinomy; if not-A were provable, that would prove that A is true, again an antinomy; so M can only be consistent if both A and not-A are unprovable.  But if M were able to prove its own consistency, that would prove that A is unprovable (because A must be unprovable in order for M to be consistent), which would prove that A is true, producing an antinomy, and M would in fact be inconsistent.  Run by that again:  If M can prove its own consistency, then M is in fact inconsistent.

Typically, on completion of a scientific paradigm shift, the questions that caused the shift cease to be treated as viable questions by new researchers; research on those questions tapers off rapidly, pushed forward only by people who were already engaged by those questions at the time of the shift.  So it was with Gödel's results.  Later generations mostly treated them as the final word on the foundations of mathematics:  don't even bother, we know it's impossible.  That was pretty much the consensus view when I began studying this stuff in the 1980s, and it's still pretty much the consensus view today.

Going there

Having been trained to think of Gödel's Theorem as a force of nature, I nevertheless found myself studying it more seriously when writing the theoretical background material for my dissertation.  I found myself discoursing at length on the relationship between mathematics, logic, and computation, and a curious discrepancy caught my eye.  Consider the following Lisp predicate.

($define! A ($lambda (P) (not? (P P))))
Predicate A takes one argument, P, which is expected to be a predicate of one argument, and returns the negation of what P would return when passed to itself.  This is a direct Lisp translation of Russell's Paradox.  What happens when A is passed itself?

The answer is, when A is passed itself, (A A), nothing interesting happens — which is really very interesting.  The predicate attempts to recurse forever, never terminating, and in theory it will eventually fill up all available memory with a stack of pending continuations, and terminate with an error.  What it won't do is cause mathematicians to despair of finding a solid foundation for their subject.  If asking whether set A contains itself is so troublesome, why is applying predicate A to itself just a practical limit on how predicate A should be used?

That question came from my dissertation.  Meanwhile, another question came from the other major document I was developing, the R-1RK.  I wanted to devise a uniquely Lisp-ish variant of the concept of eager type-checking.  It seemed obvious to me that there should not be a fixed set of rules of type inference built into the language; that lacks generality, and is not extensible.  So my idea was this:  In keeping with the philosophy that everything should be first-class, let theorems about the program be an encapsulated type of first-class objects.  And carefully design the constructors for this theorem type so that you can't construct the object unless it's provable.  In effect, the constructors are the axioms of the logic.  Modus ponens, say, is a constructor:  given a theorem P and a theorem P-implies-Q, the modus-ponens constructor allows you to construct a theorem Q.  As desired, there is no built-in inference engine:  the programmer takes ultimate responsibility for figuring out how to prove things.

Of the many questions in how to design such a first-class theorem type, one of the early ones has to be, what system of axioms should we use?  Clearly not classical logic, because we know that would give us an inconsistent system.  This though was pretty discouraging, because it seemed I would find myself directly confronting in my design the very sort of problems that Gödel's Theorem says are ultimately unsolvable.

But then I had a whimsical thought; the sort of thing that seems at once not-impossible and yet such a long shot that one can just relax and enjoy exploring it without feeling under pressure to produce a result in any particular timeframe (and yet, I have moved my thinking forward on this over the years, which keeps it interesting).  What if we could find a way to take advantage of the fact that our logic is embedded in a computational system, by somehow bleeding off the paradoxes into mere nontermination?  So that they produce the anticlimax of functions that don't terminate instead of the existential angst of inconsistent mathematical foundations?

Fragments

At this point, my coherent vision dissolves into fragments of tentative insight, stray puzzle pieces I'm still pushing around hoping to fit together.

One fragment:  Alfred Tarski —who fits the aforementioned profile of someone already engaged by the questions when Gödel's results came out— suggested post-Gödel that the notion of consistency should be derived from common sense.  Hilbert's program had actually pursued a formal definition of consistency, as the property that not all propositions are provable; this does have a certain practicality to it, in that the practical difficulty with the classical antinomies was that they allowed all propositions Q to be proven, so that "Q can be proven" ceased to be a informative statement about Q.  Tarski, though, remarked that when a non-mathematician is told that both P and not-P are true, they can see that something is wrong without having to first receive a lecture on the formal consequences of antinomies in interaction with reductio ad absurdum.

So, how about we resort to some common sense, here?  A common-sensical description of Russell's Paradox might go like this:

A is the set of all sets that do not contain themselves.  If A contains itself, then it does not contain itself.  But if it does not contain itself, then it does contain itself.  But if it does contain itself, then it does not contain itself.  But if it does not contain itself...
And that is just what we want to happen:  instead of deriving an antinomy, the reasoning just regresses infinitely.  A human being can see very quickly that this is going nowhere, and doesn't bother to iterate beyond the first four sentences at most (and once they've learned the pattern, next time they'll probably stop after even fewer sentences), but they don't come out of the experience believing that A both does and does not belong to itself; they come out believing that there's no way of resolving the question.

So perhaps we should be asking how to make the conflict here do an infinite regress instead of producing a (common-sensically wrong) answer after a finite number of steps.  This seems to be some sort of deep structural change to how logical reasoning would work, possibly not even a modification of the axioms at all but rather of how they are used.  If it does involve tampering with an axiom, the axiom tampered with might well be reductio ad absurdum rather than the Law of the Excluded Middle.

This idea — tampering with reductio ad absurdum rather than the Law of the Excluded Middle — strikes a rather intriguing historical chord.  Because, you see, one of the mathematicians notably pursuing Hilbert's program pre-Gödel did try to eliminate the classical antinomies by leaving intact the Law of the Excluded Middle and instead modifying reductio ad aburdum.  His name was Alonzo Church —you may have heard of him— and the logic he produced had, in retrospect, a curiously computational flavor to it.  While he was at it, he took the opportunity to simplify the treatment of variables in his logic, by having only a single structure that binds variables, which (for reasons lost in history) he chose to call λ.  Universal and existential quantifiers in his logic were higher-order functions, which didn't themselves bind variables but instead operated on functions that did the binding for them.  Quite a clever device, this λ.

Unfortunately, it didn't take many years after Church's publication to show that antinomies arose in his system after all.  Following the natural reflex of Hilbert's program, Church tried to find a subset of his logical axioms that could be proven consistent — and succeeded.  It turned out that if you left out all the operators except λ you could prove that each proposition P was only equivalent to at most one irreducible form.  This result was published in 1936 by Church and one of his students, J. Barkley Rosser, and today is known as the Church–Rosser Theorem (you may have heard of that, too).  In the long run, Church's logic is an obscure historical footnote, while its λ-only subset turned out to be of great interest for computation, and is well-known under the name "λ-calculus".

So evidently this idea of tampering with reductio ad absurdum and bringing computation into the mix is not exactly unprecedented.  Is it possible that there is something there that Alonzo Church didn't notice?  I'd have to say, "yes".  Alonzo Church is one of those people who (like Albert Einstein, you'll recall he came up in relation to the first no-go theorem I discussed) in retrospect appears to have set a standard of intelligence none of us can possibly aspire to — but all such people are limited by the time they live in.  Einstein died years before Bell's Theorem was published.  Heck, Aristotle was clearly a smart guy too, and just think of everything he missed through the inconvenience of being born about two millennia before the scientific revolution.  And Alonzo Church couldn't, by the nature of the beast, have created his logic based on a modern perspective on computation and logic since it was in part the further development of his own work over many decades that has produced that modern perspective.

I've got one more puzzle piece I'm pushing around, that seems like it ought to fit in somewhere.  Remember I said Church's logic was shown to have antinomies?  Well, at the time the antinomy derivation was rather baroque.  It involved a form of the Richard Paradox, which concerns the use of an expression in some class to designate an object that by definition cannot be designated by expressions of that class.  (A version due to G.G. Berry concerns the twenty-one syllable English expression "the least natural number not nameable in fewer than twenty-two syllables".)  The Richard Paradox is naturally facilitated by granting first-class status to functions, as λ-calculus and Lisp do.  But, it turns out, there is a much simpler paradox contained in Church's logic, involving less logical machinery and therefore better suited for understanding what goes wrong when λ-calculus is embedded in a logic.  This is Curry's Paradox.

I'll assume, for this last bit, that you're at least hazily familiar with λ-calculus, so it'll come back to you when you see it.

For Curry's Paradox, we need one logical operator, three logical axioms, and the machinery of λ-calculus itself.  Our one logical operator is the binary implication operator, ⇒.  The syntax of the augmented λ-calculus is then

T   ::=   x | c | (λx.T) | (TT) | (T⇒T)
That is, a term is either a variable, or a constant, or a lambda-expression, or an application, or an implication.  We don't need a negation operator, because we're sticking with the generalized notion of inconsistency as the property that all propositions are provable.  Our axioms assert that certain terms are provable:
  1. For all terms P and Q, if provably P and provably (P⇒Q), then provably Q.    (modus ponens)
  2. For all terms P, provably P⇒P.
  3. For all terms P and Q, provably ((P⇒(P⇒Q))⇒(P⇒Q)).
The sole rewriting axiom of λ-calculus, lest we forget, is the β-rule:
(λx.F)G → F[x ← G]
That is, to apply function (λx.F) to argument G, substitute G for all free occurrences of x in F.

To prove inconsistency, first we need a simple result that comes entirely from λ-calculus itself, called the Fixpoint Theorem.  This result says that for every term F, there exists a term G such that FG = G (that is, every term F has a fixpoint).  The proof works like this:

Suppose F is any term, and let G = (λx.(F(xx)))(λx.(F(xx))), where x doesn't occur in F.  Then G = (λx.(F(xx)))(λx.(F(xx))) → (F(xx))[x ← (λx.(F(xx)))] = F((λx.(F(xx)))(λx.(F(xx)))) = FG.
Notice that although the Fixpoint Theorem apparently says that every F has a fixpoint G, it does not actually require F to be a function at all:  instead of providing a G to which F can be applied, it provides a G from which FG can be derived.  And —moreover— for most F, derivation from G is a divergent computation (G → FG → F(FG) → F(F(FG)) → ...).

Now we're ready for our proof of inconsistency:  that for every term P, provably P.

Suppose P is any term.  Let Q = λx.(x⇒P).  By the Fixpoint Theorem, let R be a term such that QR = R.  By writing out the definition of Q and then applying the β-rule, we have QR = (λx.(x⇒P))R → (R⇒P), therefore R = (R⇒P).

By the second axiom, provably (R⇒R); but R = R⇒P, so, by replacing the right hand R in (R⇒R) with (R⇒P), provably (R⇒(R⇒P)).

By the third axiom, provably ((R⇒(R⇒P))⇒(R⇒P)); and we already have provably (R⇒(R⇒P)), so, by modus ponens, provably (R⇒P). But R = (R⇒P), so provably R.

Since provably R and provably (R⇒P), by modus ponens, provably P.

Note: I've had to fix errors in this proof twice since publication; there's some sort of lesson there about either formal proofs or paradoxes.
So, why did I go through all this in detail?  Besides, of course, enjoying a good paradox.  Well, mostly, this:  The entire derivation turns on the essential premise that derivation in the calculus, as occurs (oddly backwards) in the proof of the Fixpoint Theorem, is a relation between logical terms — which is to say that all terms in the calculus have logical meaning.

And we've seen something like that before, in my early explanation of Mitchell Wand's no-go theorem:  trivialization of theory resulted from assuming that all calculus derivation was evaluation.  So, if we got around Wand's no-go theorem by recognizing that some derivation is not evaluation, what can we do by recognizing that some derivation is not deduction?