Concepts of Programming LanguagesParsing, Interpretation, Compilation, Type Systems, and Programming Paradigms


[link]  1. Introduction, Background, and Motivation

[link]  1.1. Overview

A large variety of modern technology, from mobile devices and personal computers to datacenters and entire infrastructures, are programmable. These entities are controlled, operated, maintained, and monitored using a variety of interfaces and languages, which might take the form of graphical user interfaces, libraries, APIs, embedded languages, domain-specific languages, and general-purpose programming languages. Many of these interfaces and languages are defined and designed according to principles and conventions that have been developed over the last several decades. There are three major themes around which this course is organized. Each of the concepts, examples, and problems discussed in this course will relate to one or more of these themes:
  1. Defining and working with programming languages: We will define and become familiar with the basic concepts that underlie the definition and design of common programming languages, and we will learn to identify and implement the common components found in static analysis tools, interpreters, and compilers.
  2. Programming paradigms: You should now be familiar with the imperative and object-oriented programming paradigms. In this course, we will provide a high-level picture of the landscape of different programming paradigms, and will focus in particular on the declarative and functional programming paradigms. We will learn to identify problems that can be solved more efficiently, more concisely, and with greater generality using programming languages that support the functional and declarative paradigms. This includes gaining experience taking advantage of features such as comprehensions, algebraic data types, static type checking, and higher-order functions, and techniques such as recursion.
  3. Languages as solutions: The purpose of a programming language is to provide a data representation and storage format, a communication protocol, or a control interface that can be used across time and space, between different humans and/or devices. Thus, as with any language, any programming language can be viewed as a solution to a representation, communication, or control problem. We will learn to identify when a programming language may be an appropriate solution for a problem, what trade-offs should be considered when deciding whether to design a point solution or a language, and what are some of the appropriate mathematical concepts and software tools for doing so.
Throughout this course, we use formal mathematical conventions and notations developed by the community of computer scientists and mathematicians to communicate with one another about the definitions and implementations of programming languages. Some of these notations are used throughout mathematics and computer science, while others are more obscure and are used primarily by the community of programming language theorists. Throughout your career, you will be required to learn a variety of new notations for a variety of new technologies. This course provides an opportunity to practice learning new notations quickly. This is a valuable skill to possess even if your area of expertise lies or will lie outside the topics covered in this course.

[link]  1.2. Background and prerequisites

Basic logic. It is expected that the reader is familiar with the basic concepts of mathematical logic, including formulas, predicates, relational operators, and terms.
Basic set theory. It is expected that the reader is familiar with the notion of a set (both finite and infinite), set size, and the set comprehension notation, e.g.,
{1,2,3,4} is a set
{ x | x ℤ, x > 0 }
=
{1, 2, 3, ...}
and the membership and subset relations between elements and sets, e.g.,
1
{ 1, 2, 3 }
{2, 3}
{ 1, 2, 3 }
Programming using imperative and object-oriented language. It is expected that the reader is familiar with at least one imperative, object-oriented language, such as Java, C++, or Python.

[link]  2. Defining Programming Languages

In order to define and reason about programming languages, and in order to write automated tools and algorithms that can operate on programs written using programming languagues, we must be able to define formally (i.e., mathematically) what is a programming language and what is a program.

[link]  2.1. Sets of character strings

In computer science, one common way to mathematically model a formal language is to introduce a finite set of symbols (an alphabet). A language is then any subset of the set of all strings consisting of symbols from that alphabet.
Definition: An alphabet is a finite set A. We will call elements a A of the set characters. The typical alphabet we will use in this course is the set of 128 ASCII characters. However, any finite set can be an alphabet.
Definition: Given an alphabet A, a character string or string is any ordered finite sequence of characters from that alphabet. We will denote the empty string (containing no characters) using the symbol ε, and we will denote the length of a string s using the notation |s|.
Example: The set A = {A,B,C} is an alphabet. AAA, AB, B, CBA, and ε are all examples of strings built using the alphabet A.
Definition: Given an alphabet A, let U be the set of all finite strings that can be built using characters from A (including the empty string, which we will call ε). In other words, we can say:
U
=
{ s | s is a finite string of characters from alphabet A }
Any subset LU is a language. That is, any set of character strings (whether the set is finite or infinite) is a language.
Example: The set A = {X, Y, Z} is an alphabet. The set of strings { ε, X, Y, Z } is one example of a language. The infinite set of strings { ε, X, XX, XXX, XXXX, ... } is also a language. The infinite set of strings { YZ, YZYZ, YZYZYZ, ... } is also a language.
Example: The set A = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is an alphabet. The following set of strings can be a language for representing positive integers between 1 and 9999 (inclusive):
L
=
{ s | s is a string of characters from A, s does not begin with 0, 1 ≤ |s| ≤ 4 }

[link]  2.2. Regular expressions

If languages are merely subsets of the set of all finite character strings that can be built using the characters in an alphabet, then we can always specify any finite subset (and thus language) by simply writing down every string in the language. However, what if we want to write down the definition of that subset using a shorter representation, or we want to specify an infinitely large subset of character strings (e.g., all strings consisting of one or more copies of the character a: {a, aa, aaa, ...})?
Regular expressions are a notation for precisely and concisely defining certain sets of character strings.
Definition: Given an alphabet A, a regular expression over A can contain any of the characters in A as well as any number of the following special characters:
  • the "or" symbol | consisting of single vertical bar;
  • the grouping symbols ( and ), which must always be balanced the same way that parentheses must be balanced in arithmetic expressions;
  • the + and * symbols.
The subset of character strings that a regular expression defines can be determined in the following way:
  • a regular expression r that contains no special characters defines a set of strings containin just one string: {r};
  • if a regular expression r1 defines a set S1 and a regular expression r2 defines a set S2, then the regular expression r1 | r2 defines the set of character strings S1S2;
  • if a regular expression r defines a set S, then the regular expression (r) defines the set of character strings S (i.e., there is no difference);
  • if a regular expression r defines a set S, then the regular expression r+ defines the set of character strings S' that consists of all possible concatenations of any of the strings in S;
  • if a regular expression r defines a set S, then the regular expression r* defines the set of character strings S' that consists of all possible concatenations of any of the strings in S, as well as the empty string.
Example: Assume that our alphabet consists of all alphanumeric characters. For each of the following regular expressions, describe verbally what set of character strings they define:
  • abcd
  • (abcd)+
  • red | green | blue
  • (red | green | blue)+
  • (0+) | (1+)
  • (0 | 1)+
Example: Assume that our alphabet consists of all alphanumeric characters. For each of the following verbal descriptions of sets of character strings, find a regular expression that defines the described set:
  • the set of character strings that consist entirely of vowels;
  • the set of all integers (negative and positive);
  • the set of all arithmetic expressions involving binary numbers, addition, and subtraction;
  • the set of all words that begin with a vowel and end with a vowel;
  • the set of all numbers of even length.
Example: In practice, programming languages that provide libraries of functions and procedures for working with regular expressions also support other special characters. For example, Python regular expressions may contain some of the following special characters:
  • the \(, \) special characters make it possible to include parentheses in expressions in a way that does not cause them to interpreted as regular expression grouping symbols;
  • the special symbol \s matches a single whitespace character;
  • the special symbol [0-9] matches a single numeric digit;
  • the special symbol [a-z] matches a single lowercase letter;
  • the special symbol [A-Z] matches a single uppercase letter;
  • the special symbol [a-zA-Z] matches a single letter;
  • the special symbol [a-zA-Z0-9] matches a single alphanumeric character.
Example: In the Python programming language, the re module provides functionality for automatically checking whether a string matches a particular regular expression. In order to check whether a string exactly matches a regular expression, it is necessary to wrap the regular expression in parentheses and then add special markers to ensure that the regular expression matches from the beginning of the string to the end of the string.
If a match succeeds, a match object is returned that contains additional information (e.g., the position of the match); otherwise, None is returned.

>>> import re

>>> re.match(r"^(a|b|c)$", "a")   # Succeeds.
<_sre.SRE_Match object; span=(0, 1), match='a'>

>>> re.match(r"^(a|b|c)$", "def") # Fails.
None

>>> re.match(r"^((red|green|blue)+)$", "redgreenblueredblue")
<_sre.SRE_Match object; span=(0, 19), match='redgreenblueredblue'>

>>> re.match(r"^([a-zA-Z0-9]+)$", "redgreenblueredblue")
<_sre.SRE_Match object; span=(0, 19), match='redgreenblueredblue'>

>>> re.match(r"^([a-zA-Z0-9]+)$", "!@#$")
None
        
Fact: Regular expressions are not powerful enough to describe many common languages in which we may be interested. Examples of sets of character strings that cannot be defined using regular expression include:
  • the set of arithmetic expressions with balanced parentheses;
  • the set of all palindromes;
  • the set of all strings in which the same string is repeated exactly two times in a row.
For languages such as the above, a more powerful tool for describing sets of character strings is needed.

[link]  2.3. Sets of token sequences

Unlike human languagues, programming languages usually have a relatively small collection of symbol strings (e.g., commands or instructions) that are used to construct programs. Thus, we can adjust the definition of what constitutes a language to account for this.
Definition: Given an alphabet A, a token is a finite, non-empty (usually short) string of characters from that alphabet.
Definition: Given a set of tokens T, let U be the set of all finite sequences that can be built using tokens from T (including the empty sequence, which we will call ε). In other words, we can say:
U
=
{ s | s is a finite sequence of tokens from T }
Any subset LU is a language. That is, any set of token sequences (whether the set is finite or infinite) is a language.
Example: Consider the following set of tokens:
T
=
{ true, false, or, and, not, (, ), , }
The set of token sequences that represent valid boolean formulas is a language:
L
=
{
    
or ( false , and ( true , false ) ),
    
    
and ( true , false ),
    
    
not ( false ),
    
    
true,
    
    
false,
    
    
... }
If a language is just a subset of the set of all possible token sequences, how do we mathematically specify interesting subsets?

[link]  2.4. Language syntax and Backus-Naur Form (BNF) notation

If a language is just a set of finite token strings, then defining a language amounts to defining this set. How can we define this set? By introducing a collection of rules or constraints governing how characters and/or tokens can be assembled to form strings or sequences in the language.
Definition: Given a token set T and a language L consisting of finite sequences of tokens from T, the syntax of L is the set of rules defining exactly which token sequences are in L. These rules are sometimes also called syntactic rules, a formal grammar, or simply a grammar.
There are many possible ways by which we could represent syntactic rules, and these rules can be classified, or stratified, according to their expressive power. A more extensive coverage of this topic is beyond the scope of this course, and is normally presented in courses on the theory of computation and automata. In this course, we will focus on two particular kinds of grammar: regular grammars and context-free grammars. The most common representation for such grammars is Backus-Naur Form, abbreviated henceforward as BNF.
Definition: A grammar definition consists of one or more productions (or production rules). Each production has the following form:
non-terminal
::=
terminal_or_non-terminal ... terminal_or_non-terminal
|
terminal_or_non-terminal ... terminal_or_non-terminal
|
terminal_or_non-terminal ... terminal_or_non-terminal
The left-hand side (to the left of the ::= symbol) in each production is called a non-terminal. The right-hand side of each production is an unordered collection of choices separated by the | symbol. Each choice is a sequence of non-terminals (which must appear once on the left-hand side of a production) or terminals (a terminal is a token).
These production rules in a grammar's BNF representation can be viewed both as a way to construct an element (i.e., a token sequence that is a program) in the language, or as a way to break down a token sequence piece by piece until nothing is left.
Example: Let T be a token set:
T
=
{ true }
The following is a very simple programming language that contains only a single possible token sequence consisting of the single token true:
program
::=
true
In this case, the language is finite and small, so we can actually write it down as a set:
L
=
{ true }
Example: We can extend the language by adding another token:
T
=
{ true, false }
The following BNF grammar definition now contains two choices (each choice is a sequence consisting of a single terminal):
program
::=
true
|
false
This programming language now contains two token sequences:
L
=
{ true, false }
Example: We can extend the language definition further:
T
=
{ true, false, or, and, not, (, ), , }
The following BNF grammar definition now contains five choices (each choice is a sequence consisting of non-terminals and terminals):
program
::=
true
|
false
|
and ( program , program )
|
or ( program , program )
|
not ( program )
This programming language now contains infinitely many finite token sequences:
L
=
{
    
or ( false , and ( true , false ) ),
    
    
and ( true , false ),
    
    
not ( false ),
    
    
true,
    
    
false,
    
    
... }
Example: Let us consider another example: a language of positive integers.
T
=
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
We can define the following grammar:
number
::=
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
|
1 number | 2 number | 3 number | 4 number
|
5 number | 6 number | 7 number | 8 number | 9 number
However, the above does not allow us to have a 0 in any number with more than one digit. One way to fix this (there are many other ways) is to introduce more productions into the grammar:
nozero
::=
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
digit
::=
0 | nozero
digits
::=
digit | digit digits
number
::=
digit | nozero digits
Note that the grammar may contain multiple productions. Any non-terminal defined in a production can appear on the right-hand side of any of the productions.
Example: We can extend the language by adding a production for statements, and allowing a program to be a sequence of one or more statements.
T
=
{ true, false, or, and, not, (, ), ,, print, skip, ; }
The following is the new grammar definition for this language.
program
::=
statement
|
statement program
statement
::=
print formula ;
|
skip ;
formula
::=
true
|
false
|
not ( formula )
|
and ( formula , formula )
|
or ( formula , formula )

[link]  3. Parsing

Given a programming language definition, we want to have the ability to operate on programs written in that language using a computer. To do so, we need to convert the character string representations of programs in that programming language into instances of a data structure; each data structure instance would then be a representation of the program as data.

[link]  3.1. Concrete and abstract syntaxes

Definition: Given an alphabet, token set, and grammar definition (e.g., represented using BNF notation), we define the concrete syntax to be the set of all character strings that conform to the grammar definition. We call a particular character string that conforms to the grammar definition a concrete syntax instance.
Definition: For a particular programming language definition, we define the abstract syntax to be the set of all data structure instances that correspond to a character string that conforms to the grammar definition for that language. An instance of the abstract syntax is sometimes called a parse tree.
Example: Consider again the language that conforms to the following grammar:
program
::=
true | false | not ( program )
|
and ( program , program ) | or ( program , program )
The following is the character string of one possible program in the language. This character string is an instance of the concrete syntax of the language.

and (or (and (true, false), not(false)), or (true, false))
        
The above character string might be converted into a structured representation of the program within some host language (i.e., the programming language being used to operate on these programs: checking for errors, interpreting, or compiling the program). Below, we present one possible Python representation of the instance of the abstract syntax (i.e., the parse tree) corresponding to the concrete syntax instance above. This representation uses nested Python dictionaries to represent the parse tree, with strings being used to represent node labels and leaves.

{ "And": [
    { "Or": [
        { "And": ["True","False"]}, 
        { "Not": ["False"]}
      ]
    }, 
    { "Or": ["True","False"]}
  ]
}
        

[link]  3.2. Lexing (a.k.a., tokenizing) and parsing

Definition: A lexer or tokenizer is an algorithm that converts an instance of the concrete syntax of a language (i.e., a character string that conforms to the grammar definition for the language) into a sequence of tokens.
Example: Consider again the language that conforms to the following grammar:
program
::=
true | false | not ( program )
|
and ( program , program ) | or ( program , program )
The following Python implementation of a tokenizing algorithm for this language uses regular expressions to split a character string into a sequence of individual tokens from the token set.

import re
def tokenize(s):
    # Use a regular expression to split the string into
    # tokens or sequences of zero or more spaces.
    tokens = [t for t in re.split(r"(\s+|true|false|and|or|not|,|\(|\))", s)]

    # Throw out the spaces and return the result.
    return [t for t in tokens if not t.isspace() and not t == ""]
        
Below is an example input and output.

>>> tokenize("and (or (and (true, false), not(false)), or (true, false))")

['and', '(', 'or', '(', 'and', '(', 'true', ',', 'false', ')', ',', 'not', 
 '(', 'false', ')', ')', ',', 'or', '(', 'true', ',', 'false', ')', ')']
        
Definition: A parser is an algorithm that converts a token sequence into an instance of the abstract syntax (i.e., a parse tree).
The tokenizer and parser are then composed to transform a character string into a parse tree.
character string
(concrete syntax)
lexer/
tokenizer
token
sequence
parser parse tree
(abstract syntax)
Often, the tokenizer and parser are together called a parser. In situations where this can cause confusion, we will refer to the actual process that converts token sequences into parse trees as the parsing algorithm.
character string
(concrete syntax)
lexer/
tokenizer
token
sequence
parsing
algorithm

parser
parse tree
(abstract syntax)
The BNF representation of a grammar can be converted into a parsing algorithm that turns a token sequence into an abstract syntax data structure instance (i.e., a parse tree). How easily this can be done depends on the properties of the grammar.
Fact: Given a BNF representation of a grammar, if for every production in the grammar, each choice begins with a unique terminal (i.e., a terminal that is not the first terminal in any other choice within that production), then we say the grammar is in LL(1) form, and we can implement a predictive recursive descent parsing algorithm to parse any token sequence that conforms to this grammar (note that these are only sufficient conditions for the grammar to be in LL(1) form; less restrictive conditions also exist).
A predictive recursive descent parser can effectively run in linear time; it decomposes the token sequence from left to right while assembling a parse tree.
Example: Consider again the language that conforms to the following grammar:
program
::=
true | false | not ( program )
|
and ( program , program ) | or ( program , program )
The following Python implementation of a predictive recursive descent parsing algorithm for this language builds a parse tree using the nested dictionary representation seen in a previous example. This recursive algorithm takes a single argument: a sequence of tokens. It returns two results: a parse tree, and the remainder of the token sequence. Note that the order in which the choices in the production are being handled is not determined by the order of the choices in the production. The choices in a production are unordered; any parser implementation that captures all the possible choices conforms to the grammar definition.

def parse(tokens):
    if tokens[0] == 'true':
        return ('True', tokens[1:])

    if tokens[0] == 'false':
        return ('False', tokens[1:])

    if tokens[0] == 'not' and tokens[1] == '(':
        (e1, tokens) = parse(tokens[2:])
        if tokens[0] == ')':
          return ({'Not':[e1]}, tokens[1:])

    if tokens[0] == 'or' and tokens[1] == '(':
        (e1, tokens) = parse(tokens[2:])
        if tokens[0] == ',':
            (e2, tokens) = parse(tokens[1:])
            if tokens[0] == ')':
                return ({'Or':[e1,e2]}, tokens[1:])

    if tokens[0] == 'and' and tokens[1] == '(':
        (e1, tokens) = parse(tokens[2:])
        if tokens[0] == ',':
            (e2, tokens) = parse(tokens[1:])
            if tokens[0] == ')':
                return ({'And':[e1,e2]}, tokens[1:])
        
Below is an example input and output. Notice that no tokens are left in the token sequence once the result is returned.

>>> import json
>>> (tree, tokens) =\
      parse(tokenize(\
        "and (or (and (true, false), not(false)), or (true, false))"\
        )\
      )
>>> print(json.dumps(tree, sort_keys=True, indent=2))
{
  "And": [
    {
      "Or": [
        {
          "And": [
            "True", 
            "False"
          ]
        }, 
        {
          "Not": [
            "False"
          ]
        }
      ]
    }, 
    {
      "Or": [
        "True", 
        "False"
      ]
    }
  ]
}
>>> print(tokens)
[]
        
Fact: If we relax the conditions on the grammar definition that make it LL(1) by dropping the requirement that the first terminal in each sequence within a production must be unique, we can no longer use predictive recursive descent parsing algorithm to parse a language corresponding to this grammar because the first terminal in each sequence within a production no longer uniquely determines which choice within a production should be used to continue parsing a token sequence. However, as long as every sequence within every production starts with a terminal, we can implement a backtracking recursive descent parsing algorithm to parse token sequence in the language.
Example: In the previus example, the grammar only allowed prefix logical operators. Suppose we wanted to parse token sequences for a grammar with infix operators.
program
::=
true | false | not ( program )
|
( program and program ) | ( program or program )
It is no longer possible to implement a predictive recursive descent parser. We must instead employ backtracking, and we must also keep track of whether we have consumed all the tokens.

def parse(tmp, top = True):
    tokens = tmp[0:]
    if tokens[0] == 'true':
        tokens = tokens[1:]
        if not top or len(tokens) == 0:
            return ('True', tokens)

    tokens = tmp[0:]
    if tokens[0] == 'false':
        tokens = tokens[1:]
        if not top or len(tokens) == 0:
            return ('False', tokens)

    tokens = tmp[0:]
    if tokens[0] == 'not' and tokens[1] == '(':
        tokens = tokens[2:]
        r = parse(tokens, False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == ')':
                tokens = tokens[1:]
                if not top or len(tokens) == 0:
                    return ({'Not':[e1]}, tokens)

    tokens = tmp[0:]
    if tokens[0] == '(':
        tokens = tokens[1:]
        r = parse(tokens, False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == 'or':
                tokens = tokens[1:]
                r = parse(tokens, False)
                if not r is None:
                    (e2, tokens) = r
                    if tokens[0] == ')':
                        tokens = tokens[1:]
                        if not top or len(tokens) == 0:
                            return ({'Or':[e1,e2]}, tokens)

    tokens = tmp[0:]
    if tokens[0] == '(':
        tokens = tokens[1:]
        r = parse(tokens, False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == 'and':
                tokens = tokens[1:]
                r = parse(tokens, False)
                if not r is None:
                    (e2, tokens) = r
                    if tokens[0] == ')':
                        tokens = tokens[1:]
                        if not top or len(tokens) == 0:
                            return ({'And':[e1,e2]}, tokens)
        
The above code is far too repetitive. However, we can take the parts that repeat and turn them into a loop body that loops over all the possible choices in the production.

def parse(tmp, top = True):
    seqs = [\
        ('True', ['true']), \
        ('False', ['false']), \
        ('Not', ['not', '(', parse, ')']), \
        ('And', ['(', parse, 'and', parse, ')']), \
        ('Or', ['(', parse, 'or', parse, ')']) \
        ]

    # Try each choice sequence.
    for (label, seq) in seqs:
        tokens = tmp[0:]
        ss = [] # To store matched terminals.
        es = [] # To collect parse trees from recursive calls.
        
        # Walk through the sequence and either
        # match terminals to tokens or make
        # recursive calls depending on whether
        # the sequence entry is a terminal or
        # parsing function.
        for x in seq:
            if type(x) == type(""): # Terminal.

                if tokens[0] == x: # Does terminal match token?
                    tokens = tokens[1:]
                    ss = ss + [x]
                else:
                    break # Terminal did not match token.

            else: # Parsing function.

                # Call parsing function recursively
                r = x(tokens, False) 
                if not r is None:
                    (e, tokens) = r
                    es = es + [e]

        # Check that we got either a matched token
        # or a parse tree for each sequence entry.
        if len(ss) + len(es) == len(seq):
            if not top or len(tokens) == 0:
                return ({label:es} if len(es) > 0 else label, tokens)
        

[link]  3.3. More parsing examples and building parsers for other classes of grammar

Exercise: Consider the following grammar definition:
command
::=
start
|
suspend
|
wake
|
terminate
|
reboot
|
if condition then command
|
if condition then command else command
|
repeat command
|
while condition then command
condition
::=
power low
|
temperature high
|
temperature very low
|
user input
Suppose we want to implement a parser for the above grammar. A partial implementation of a parser for the above grammar (what has been presented in lecture so far) is provided below.

def command(tokens, top = True):
    seqs = {\
       ("Start", "start"),\
       ("Suspend", "suspend"),\
       ("Wake", "wake"),\
       ("Terminate", "terminate"),\
       ("Reboot", "reboot")\
       }
    for (key, value) in seqs:
        if tokens[0] == value:
            tokens = tokens[1:]
            if not top or len(tokens) == 0:
                return (key, tokens)
    
    if tokens[0] == 'repeat':
        r = command(tokens[1:], False)
        if not r is None:
            (e, tokens) = r
            if not top or len(tokens) == 0:
                return ({"Repeat": [e]}, tokens)

    if tokens[0] == 'while':
        r = condition(tokens[1:], False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == 'then':
                 r = command(tokens[1:], False)
                 if not r is None:
                     (e2, tokens) = r           
                     if not top or len(tokens) == 0:
                         return ({"While": [e1,e2]}, tokens)

def condition(tokens, top = True):
    seqs = [\
       ("PowerLow", ["power", "low"]),\
       ("TempHigh", ["temperature", "high"]),\
       ("UserInput", ["user", "input"])\
       ]
    for (key, seq) in seqs:
        if tokens[0] == seq[0]:
            if tokens[1] == seq[1]:
                tokens = tokens[2:]
                if not top or len(tokens) == 0:
                    return (key, tokens)
        
It is possible to replace the condition() function with a generic parser for base cases that contain an arbitrary number of terminals in each sequence. Any call to condition(...) can then be replaced with a call to parseBaseCases(seqsCondition, ...).

seqsCondition = [\
       ("PowerLow", ["power", "low"]),\
       ("TempHigh", ["temperature", "high"]),\
       ("TempVeryLow", ["temperature", "very", "low"]),\
       ("UserInput", ["user", "input"])\
       ]

def parseBaseCases(seqs, tokens, top = True):
    for (key, seq) in seqs:
         # Check if token sequence matches sequence.
        i = 0
        for terminal in seq:
            if terminal == tokens[i]:
                pass
            else:
                break
            i = i + 1
                
        # Check if the previous loop succeeded.
        if i == len(seq):
            tokens = tokens[len(seq):]
            if not top or len(tokens) == 0:
                return (key, tokens)
        
Example: Consider the following grammar definition:
formula
::=
true
|
false
|
not formula
|
( formula )
|
formula and formula
|
formula or formula
Implementing a naive recursive descent parser, predictive or backtracking, would not work for this grammar. Consider what would happen if we ran the following code on any input:
 
def parse(tokens):
    if tokens[0] == 'true':
        return ('True', tokens[1:])

    if tokens[0] == 'false':
        return ('False', tokens[1:])

    # ...

    # Recursive call, but no tokens consumed.
    (e1, tokens) = parse(tokens) 
      if tokens[0] == 'and':
        (e2, tokens) = parse(tokens[1:])
        return ({'And':[e1,e2]}, tokens[1:])

    # ...
        
The above code never terminates, and after a large number of recursive calls are made, the Python interpreter returns an error indicating it is out of stack space. To get around this problem, one option is to perform left-recursion elimination on the grammar so that a recursive call never occurs first for any of the choices.
formula
::=
left and formula
|
left or formula
|
left
left
::=
true
|
false
|
not formula
|
( formula )
The above is usually acceptable if the operator, such as and, is commutative or right-associative. However, if the operator is left-associative, the above strategy would not necessarily lead to a correct parse tree. Can you explain why? In such a scenario, other techniques would need to be employed. If the operator is indeed associative or right-associative, however, the parser implementation could then look something like the following:
 
def formula(tmp):
    tokens = tmp[0:]
    (e1, tokens) = left(tokens)
      if tokens[0] == 'and':
        (e2, tokens) = formula(tokens[1:])
        return ({'And':[e1,e2]}, tokens[1:])

    # ...

def left(tokens):
    if tokens[0] == 'true':
        return ('True', tokens[1:])

    if tokens[0] == 'false':
        return ('False', tokens[1:])

    # ...
        
Note that performing left-recursion elimination does not necessarily change the definition of the concrete syntax, the definition of the abstract syntax of a language (i.e., the set of parse trees), or the meaning of a language. Left-recursion elimination is a strategy for converting a grammar definition into a definition that is easier to implement using a recursive descent parser; it is an implementation strategy. Thus, the resulting parse trees should contain no record or indication that left-recursion elimination was performed on the grammar before the parser was implemented.
Also note that the resulting parser implementation formula() uses backtracking because the formula production rule has three choices that all begin with the same non-terminal left. It would be possible to perform left factoring on the formula production rule as follows, which would then make it possible to implement formula() as a predictive parser:
formula
::=
left rest
rest
::=
and formula
|
or formula
|

[link]  4. Semantics, Evaluation, and Interpretation

[link]  4.1. Formally defining an abstract syntax

While the abstract syntax of a programming language is the set of data structure instances that represent programs, it is also useful to model the abstract syntax as a mathematical object in its own right. This makes it possible to define formally (i.e., mathematically, independently of any implementation language, platform, operating system, and so on) the meaning of a language, and how programs can be run. It also makes it possible to formally define analyses on programs, as well as properties of transformations over programs.
Typically, an abstract syntax definition will closely match a concrete syntax definition, except that there is no need to specify the token set, and redundant syntactic constructs and syntactic sugar will be eliminated. For example, parentheses are a syntactic convention that is not necessary if one is working with parse trees because parse trees are already grouped implicitly due to the tree structure of the abstract syntax instance.
Example: The following is an example of an abstract syntax definition.
formula
::=
true | false | not formula | formula and formula | formula or formula
Notice the omission of the parentheses. Also, there is no need to be concerned with the fixity (i.e., infix vs. prefix) of binary operators, since this definition is not being used to implement a parsing algorithm.

[link]  4.2. Denotational semantics and operational semantics

The abstract syntax of a programming language is a set of symbolic objects (i.e., the abstract syntax instances, such as programs) that have no meaning unless a meaning is assigned to them. There are two ways in which we can assign meaning to these objects. We can choose to assign a mathematical object to each abstract syntax instance, or we can define a collection of deterministic transformations that specify how we can convert each abstract syntax instance into another abstract syntax instance. Roughly speaking, assigning a mathematical object to each program tells us what it means, while specifying symbolic converion rules tells us how to run the program.
Definition: The denotational semantics of an abstract syntax is a mapping from the set of abstract syntax instances A to some mathematical set of objects D, which is often called a semantic domain or just domain. The mapping from A to D itself is often denoted using the circumfix Oxford double bracket notation [[ ... ]], and the definition of a denotational semantics of A (i.e., the definition of this mapping [[ ... ]]) is often specified using a collection of inference rules.
Definition: Let A be an abstract syntax of a programming language. The operational semantics of an abstract syntax is a set of rules that specify how each abstract syntax instance a A can be transformed some kind of object that represents the result of performing the computation described by A.
There are distinct kinds of operational semantics, such as small-step semantics and big-step semantics (also known as natural semantics). In this course, the operational semantics we will be using is closest to big-step semantics, with some simplifications. We adopt this particular approach to defining operational semantics because it corresponds more closely to a functional, recursive implementation of an algorithm for interpreting programs.

The operational semantics for a programming language represents a contract, a set of constraints, or a set of requirements that an algorithm that implementing an interpreter or compiler of that language must respect in order to be considered correct. However, whoever builds an implementation of an interpreter or compiler for a language has full freedom and flexibility in how they choose to implement the interpreter in all other aspects as long as it conforms to the operational semantics. This is what makes it possible to introduce optimizations into interpreters and compilers (such as optimizations to improve performance or reduce use of memory) while preserving the correctness of their behavior. The operational semantics for a programming language is defined using a collection of inference rules.
Definition: An inference rule is a notation used within mathematics and computer science to define relationships between mathematical facts and formulas. Each inference rule consists of a horizontal line with zero or more logical formulas above the line and one logial formula below the line. The logical formulas above the line are called premises, and the formula below the line is called the conclusion.
[Name-of-Inference-Rule]
 premise           premise 
 conclusion 
[Example]
 sun is out           sky is clear 
 it is not raining 
An inference rule can be interpreted as a portion of a larger algorithm. The premises specify the recursive calls, or calls to other functions, that may need to be made, and the results that are obtained from those invocations. The conclusion specifies what inputs can be handled by that inference rule, and what outputs should be returned given those inputs and the premises.
[Algorithm-Case]
 input1output1           input2output2 
 input0output0 
Note that in the above, input1 and input2 may depend on input0, and output0 may depend on output1 and output2. In other words, one could rewrite an inference rule in the following way using natural language:
[Algorithm-Case]
 invoking this or another algorithm with input1 yields output1 
 given input0, and if premises above are true, then output output0 

[link]  4.3. Evaluation of expressions

The abstract syntax, or a subset of the abstract syntax, of a programming language is considered to be a set of expressions if the language's operational semantics do not impose any restrictions on the order in which a computation can operate on the expression to produce a result, called a value. This is possible because expressions usually represent operations with no side effects, such as emitting output to a screen, reading or writing files, looking at a clock, controlling a device, and so on.
Definition: Let A be an abstract syntax of a programming language, and let V be some subset of A that we will call the value set. This set will represent the possible meanings of parse trees in A, and it will represent the possible results of evaluating parse trees in A. Values that can occur directly within abstract syntax trees of the language (e.g., numeric and string literals, constructors, and so on) are usually called constants.
Definition: An evaluation algorithm converts any abstract syntax tree that represents an expression into an abstract syntax tree that represents a value.
expressions
(abstract syntax)
evaluation
algorithm
values
(abstract syntax)
Example: Define the abstract syntax according to the following grammar, with A consisting of all formula abstract syntax instances, and V = {true, false} consisting of all value abstract syntax instances:
formula
::=
value | not formula | formula and formula | formula or formula
value
::=
true | false
The following is a definition of an operational semantics for this language.
[True]
  
 truetrue 
[False]
  
 falsefalse 
[Not-True]
 ftrue 
 not ffalse 
[Not-False]
 ffalse 
 not ftrue 
[And-True-True]
 f1true           f2true 
 f1 and f2true 
[And-True-False]
 f1true           f2false 
 f1 and f2false 
[And-False-True]
 f1false           f2true 
 f1 and f2false 
[And-False-False]
 f1false           f2false 
 f1 and f2false 
[Or-True-True]
 f1true           f2true 
 f1 or f2true 
[Or-True-False]
 f1true           f2false 
 f1 or f2true 
[Or-False-True]
 f1false           f2true 
 f1 or f2true 
[Or-False-False]
 f1false           f2false 
 f1 or f2false 
Example: The rules in the above example are numerous, and having this many rules in a definition becomes impractical (especially with more complex operators and language constructs). To address this, we can define a meta-language on the set of values V = {true, false}.
¬ true
=
false
¬ false
=
true
truetrue
=
true
truefalse
=
false
falsetrue
=
false
falsefalse
=
false
truetrue
=
true
truefalse
=
true
falsetrue
=
true
falsefalse
=
false
The definition then becomes more concise.
[True]
  
 truetrue 
[False]
  
 falsefalse 
[Not]
 fv 
 not f ⇓ ¬ v 
[And]
 f1v1           f2v2 
 f1 and f2v1v2 
[Or]
 f1v1           f2v2 
 f1 or f2v1v2 
We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this simple language. The functions vnot, vand, and vor correspond to ¬, ∧, and ∨, respectively, in the operational semantics.

def vnot(v):
    if v == 'True':  return 'False'
    if v == 'False': return 'True'

def vand(v1, v2):
    if v1 == 'True'  and v2 == 'True':  return 'True'
    if v1 == 'True'  and v2 == 'False': return 'False'
    if v1 == 'False' and v2 == 'True':  return 'False'
    if v1 == 'False' and v2 == 'False': return 'False'

def vor(v1, v2):
    if v1 == 'True'  and v2 == 'True':  return 'True'
    if v1 == 'True'  and v2 == 'False': return 'True'
    if v1 == 'False' and v2 == 'True':  return 'True'
    if v1 == 'False' and v2 == 'False': return 'False'

Node = dict
Leaf = str

def evaluate(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Not':
                f = children[0]
                v = evaluate(f)
                return vnot(v)
            elif label == 'And':
                # Notice that we can make the recursive calls
                # below in any order we like; the order of the
                # calls is not specified by the operational
                # semantics.
                f1 = children[0]
                v1 = evaluate(f1)
                f2 = children[1]
                v2 = evaluate(f2)
                return vand(v1, v2)
            elif label == 'Or':
                f2 = children[1]
                v2 = evaluate(f2)
                f1 = children[0]
                v1 = evaluate(f1)
                return vor(v1, v2)
    elif type(e) == Leaf:
        if e == 'True':
            return 'True'
        if e == 'False':
            return 'False'
        
In fact, we could convert the above so that it returns a Python value directly, rather than an abstract syntax tree corresponding to a value.

def evaluate(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Not':
                f = children[0]
                v = evaluate(f)
                return not v # Use the Python not operator.
            elif label == 'And':
                f1 = children[0]
                v1 = evaluate(f1)
                f2 = children[1]
                v2 = evaluate(f2)
                return v1 and v2 # Use the Python and operator.
            elif label == 'Or':
                f2 = children[1]
                v2 = evaluate(f2)
                f1 = children[0]
                v1 = evaluate(f1)
                return v1 or v2 # Use the Python or operator.
    elif type(e) == Leaf:
        if e == 'True':
            return True # Use the Python True constant.
        if e == 'False':
            return False # Use the Python False constant.
        

[link]  4.4. Execution of sequences of statements

The abstract syntax, or a subset of the abstract syntax, of a programming language is considered to be a set of statements if the operational semantics for that syntax impose a sequential ordering on the computation they represent. This usually corresponds to the notion of a single point of control flow traversing the program's abstract syntax as the program is executed (which is a good model of a computer architecture that contains only a single processor that can perform one instruction at a time).

Languages that have statements and a notion of control flow are called imperative. Languages that have only expressions and no statements can be called pure or functional, although that terminology will often have additional meanings in such a context that we will go over in subsequent sections.
Definition: Let A be an abstract syntax of a programming language, and let V be some subset of A. Let O be a set of possible outputs (e.g., a text terminal or file).
  • In this course, we will denote a possible output in O using the symbol o.
  • We denote the empty output using o0.
  • We denote an output o preceded by the value v as v;o (that is, v is the first value in the output v;o, and o is the rest).
  • Given two outputs o1 and o2, o1;o2 represents two outputs in sequence: o1 followed by o2.
Definition: An execution algorithm converts any abstract syntax tree that represents a sequence of statements into an abstract syntax tree that represents an output.
statement sequence
(abstract syntax)
execution
algorithm
output
(record of side effects)
Thus, we can represent an output in Python using, for example, a list of values.
Example: We present an operational semantics for a simple programming language with statements.
program
::=
print formula ; program | end ;
formula
::=
value | not formula | formula and formula | formula or formula
value
::=
true | false
Recall that O represents the set of outputs (e.g., to the terminal), with o0 representing the empty output and v;o representing a terminal o with the value v as its first output.
[Statement-Print]
 po           fv 
 print f ; pv;o 
[Statement-End]
  
 end ;o0 
[Formula-True]
  
 truetrue 
[Formula-False]
  
 falsefalse 
[Formula-Not]
 fv 
 not f ⇓ ¬ v 
[Formula-And]
 f1v1           f2v2 
 f1 and f2v1v2 
[Formula-Or]
 f1v1           f2v2 
 f1 or f2v1v2 
We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this simple language. We represent outputs in O as Python lists of values, because lists preserve order. Note that we refer to the implementation of evaluate() from a previous example.

def execute(s):
    if type(s) == Leaf:
        if s == 'End':
            return []
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(f) # Implemented elsewhere.
                o = execute(p)
                return [v] + o
        
As before, we can actually use the Python print() function and turn this into an interpreter that uses the Python runtime system directly to interpret the program. However, we must be careful about where we put the call to print(): it must be invoked in a way that conforms to a pre-order traversal of the syntax tree.

def execute(s):
    if type(s) == Leaf:
        if s == 'End':
            return []
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(f)
                print(v) # Must appear before recursive call.
                o = execute(p)
                return [v] + o
        
In an operational semantics for a language with variables and variable assignment, a data structure called an environment is usually used to represent what variables are in scope, as well as the values assigned to variables. We introduce a mathematical definition of an environment so that we can employ it within our operational semantics definitions.
Definition: Given a programming language definition with a set of values V and a set of possible variable names X (as defined by the concrete syntax), an environment or context is a mapping from a subset of the variables X to the set of values V.
  • In this course, we will usually denote an environment using the symbol Σ.
  • Given a variable name x X, we denote by Σ(x) the value that the environment Σ assigns to x.
  • An empty environment (one that does not map any variable names to any values) is denoted using Σ0.
  • Given an environment Σ, a variable x X, and a value v V, we denote by Σ ⊎ {xv} the extension of the environment Σ with a new mapping from x to v (i.e., if Σ' = Σ ⊎ {xv} then Σ'(x) = v).
We can update our diagram for an execution algorithm.
environment statement sequence
(abstract syntax)
execution
algorithm
output
(record of side effects)
Within implementations of interpreters and compilers, environments will often be represented with finite maps, associative lists, dictionary, or stacks, depending on the structure and semantics of the language.
Example: We present an operational semantics for a simple programming language with variables and variable assignment.
program
::=
print formula ; program | assign variable := formula ; program | end ;
formula
::=
value | variable | not formula | formula and formula | formula or formula
variable
::=
any valid variable name as defined by the concrete syntax
value
::=
true | false
Notice the use of an environment Σ, also known as a context, that maps variable names to values.
[Statement-Assign]
 Σ ⊎ {xv}, po           Σ, fv 
 Σ, assign x := f ; po 
[Statement-Print]
 Σ, po           Σ, fv 
 Σ, print f ; pv;o 
[Statement-End]
  
 Σ, end ;o0 
[Formula-True]
  
 Σ, truetrue 
[Formula-False]
  
 Σ, falsefalse 
[Formula-Not]
 Σ, fv 
 Σ, not f ⇓ ¬ v 
[Formula-And]
 Σ, f1v1           Σ, f2v2 
 Σ, f1 and f2v1v2 
[Formula-Or]
 Σ, f1v1           Σ, f2v2 
 Σ, f1 or f2v1v2 
[Formula-Variable]
 Σ(x) = v 
 Σ, xv 
We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this language. We represent environments as dictionaries in which variable names are labels, and these labels map to values in the language. The functions vnot, vand, and vor are the same as those in a previous example.

def evaluate(env, e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Not':
                f = children[0]
                v = evaluate(env, f)
                return vnot(v)
            elif label == 'And':
                f1 = children[0]
                v1 = evaluate(env, f1)
                f2 = children[1]
                v2 = evaluate(env, f2)
                return vand(v1, v2)
            elif label == 'Or':
                f2 = children[1]
                v2 = evaluate(env, f2)
                f1 = children[0]
                v1 = evaluate(env, f1)
                return vor(v1, v2)
            elif label == 'Variable':
                x = children[0]
                if x in env:
                    return env[x]
                else:
                    print(x + " is unbound.")
                    exit()
    elif type(e) == Leaf:
        if e == 'True':
            return 'True'
        if e == 'False':
            return 'False'

def execute(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return []
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(env, f)
                o = execute(env, p)
                return [v] + o
            if label == 'Assign':
                children = s[label]
                x = children[0]['Variable'][0]
                f = children[1]
                p = children[2]
                v = evaluate(env, f)
                env[x] = v
                o = execute(env, p)
                return o
        
If an abstract syntax parse tree node for a statement can have two or more children that are also statement sequences, and if the operational semantics of the language require that statements must be executed consecutively (i.e., not in parallel), then the execution algorithm must also return a new copy of the environment, modified to reflect the variable assignments that might have occurred in the statement sequence.
environment statement sequence
(abstract syntax)
execution
algorithm
environment output
(record of side effects)
Example: We present an operational semantics for an extension of a language in a previous example.
program
::=
print formula ; program
|
assign variable := formula ; program
|
repeat twice { program } program
|
end ;
An example of a program in this language is presented below:

print true;
repeat twice {
  print false;
  end;
}
print true;
end;
        
The operational semantics inference rules are the same as those in a previous example. However, we must modify the inference rules for statements. In particular, the predicate ⇓ that defines execution of statements must also specify the new environment that exists after a statement is executed. This is to capture the fact that a program may have variable assignments in a block of code that may be executed multiple times (or, in other languages with branching constructs, may be executed conditionally).
[Statement-Assign]
 Σ1 ⊎ {xv}, p ⇓ Σ2, o           Σ1, fv 
 Σ1, assign x := f ; p ⇓ Σ2, o 
[Statement-Print]
 Σ1, p ⇓ Σ2, o           Σ1, fv 
 Σ1, print f ; p ⇓ Σ2, v;o 
[Statement-Repeat-Twice]
  Σ1, p1 ⇓ Σ2, o1           Σ2, p1 ⇓ Σ3, o2           Σ3, p2 ⇓ Σ4, o3  
 Σ1, repeat twice { p1 } p2 ⇓ Σ4, o1;o2;o3 
[Statement-End]
  
 Σ, end ; ⇓ Σ, o0 
We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this language. The implementation of evaluate() is the same as the one in a previous example, while the implementation of execute() from that example must be modified so that it also returns an updated version of the environment data structure.

def execute(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return (env, [])
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(env, f)
                (env, o) = execute(env, p)
                return (env, [v] + o)
            if label == 'Assign':
                children = s[label]
                x = children[0]['Variable'][0]
                f = children[1]
                p = children[2]
                v = evaluate(env, f)
                env[x] = v
                (env, o) = execute(env, p)
                return (env, o)
            if label == 'RepeatTwice':
                children = s[label]
                body = children[0]
                rest = children[1]
                env1 = env
                (env2, o1) = execute(env1, body)
                (env3, o2) = execute(env2, body)
                (env4, o3) = execute(env3, rest)
                return (env4, o1 + o2 + o3)
        
Example: We consider inference rules for a few common looping constructs.
[Statement-While-False]
 Σ1, p2 ⇓ Σ2, o1           Σ1, efalse 
 Σ1, while e { p1 } p2 ⇓ Σ2, o1 
[Statement-While-True]
  Σ1, p1 ⇓ Σ2, o1           Σ2, while e { p1 } p2 ⇓ Σ3, o2           Σ1, etrue  
 Σ1, while e { p1 } p2 ⇓ Σ3, o1;o2 
[Statement-Until-True]
 Σ1, p2 ⇓ Σ2, o1           Σ1, etrue 
 Σ1, until e { p1 } p2 ⇓ Σ2, o1 
[Statement-Until-False]
  Σ1, p1 ⇓ Σ2, o1           Σ2, until e { p1 } p2 ⇓ Σ3, o2           Σ1, efalse  
 Σ1, until e { p1 } p2 ⇓ Σ3, o1;o2 
The following code snippet illustrates one way to implement the [Statement-While-True] and [Statement-While-False] rules.

if label == 'While':
    [cond, body, rest] = s[label]
    env1 = env
    v = evaluate(env1, cond)
    if v == 'False':
        (env2, o1) = execute(env1, rest)
        return (env2, o1)
    if v == 'True':
        (env2, o1) = execute(env1, body)
        (env3, o2) = execute(env2, {'While':[cond, body, rest]})
        return (env3, o1 + o2)
        

[link]  4.5. Interpreters

An interpreter is an algorithm that invokes evaluation and execution algorithms as required by the operational semantics for a language.
program
(abstract syntax)
evaluation
algorithm
execution
algorithm
output and/or
value(s)
Often, the term interpreter will refer to the entire toolchain of components that make it possible to convert the character string (i.e., concrete syntax) representation of a program in the source language into value(s) and/or a record of the side effects of a program. In these cases, we will refer to the individual component that performs interpretation as the interpretation algorithm or interpretation component.
character string
(concrete syntax)
parser abstract
syntax
interpretation
algorithm

interpreter
output and/or
value(s)
It is important to consider the relationship between the interpreted language and the underlying language used to implement the interpreter.
abstract syntax
trees
boolean
values
outputs
over time
nested
dictionaries
strings ordered
list
abstract syntax
trees
boolean
values
outputs
over time
unbounded
integers
nested
dictionaries
boolean
values
outputs
over time
lists of
bounded integers

[link]  5. Compilation

Compilation involves translating or transforming one programming language (sometimes called the source language) into another (often called the target language). It is often the case that the target language is a more "low-level" language that more closely corresponds to the capabilities of a physical (i.e., electronic) device, such as the central processor (CPU), memory, and other devices that constitute a modern computer. However, today, compilation could also refer to any transformation from one programming language to another. For example, it might be necessary to compile PHP to ASP, which are both server-side web application scripting languages, in order to migrate software from one operating system to another.

[link]  5.1. History, background, and context

One way to summarize a very thin slice of the history of general-purpose computers and their relationship with programming languages is to consider how the increasingly complex structure of a computer both enabled and required increasingly abstract programming languages and programming language features.

The three major categories of concept involved are the program, the computing device, and the physical laws upon which the computing device is based. As long as the physical laws being used are predictable (i.e., consistent across time and space), we can build computing devices that behave the same way even if they are operated in different places or at different times. We can then interpret what that device does in some meaningful way (e.g., as arithmetic operations). To perform different kinds of tasks using the device, we would need some kind of way to change its initial configuration (i.e., to "program" it).
program (e.g., input/configuration, instructions, etc.)
computing device (e.g., difference engine, modern CPU, quantum computer, etc.)
physical laws (mechanics, electricity, chemistry, biology, quantum mechanics, etc.)
Until the 1940s, all artifacts that could be considered computing devices (e.g., an abacus, Babbage's difference engine, etc.) were single-purpose machines that could only perform the same operation (but possibly on different inputs, which corresponded to different initial configurations of the device). In the early 1940s, a number of computers were built (such as the ENIAC and Colossus) that could be physically reconfigured to perform different operations. Arguably, the "program" was the machine's physical configuration. In the late 1940s, computers were developed (or improved, as in the case of ENIAC) that could operate according to a program stored in a memory device. At this point, both the program and the input were stored in a single, unified memory component; the program told the central processing unit (CPU) what operations to perform, and what modifications to make to memory. This characteristic is part of what's known as the Von Neumann architecture.
memory
input program
central processing unit (CPU)
Subsequent improvements to computing device components in the 1950s and onward primarily served to make the components smaller, cheaper, faster, and more reliable while still conforming to the Von Neumann architecture. However, the programs controlling these devices continued to become more and more sophisticated.

Machine languages were developed in the 1940s (particularly in the very late 1940s) and the 1950s. These consisted of small collections of instructions that corresponded to operations that could be performed by the CPU. A program was then a list of machine language instructions, and the abstraction this supplied to the programmer was that of a block of memory indexed by an integer range, and a single processor that could perform a sequence of operations on memory. The diagram below provides a stylized depiction of the view from a user's or programmer's perspective.
machine program
machine languauge
CPU memory
hardware
In part because using machine languages to specify the desired behavior of computers was time-consuming and error-prone, more human-friendly and legible languages were developed in the mid and late 1950s, such as FORTRAN, LISP, and COBOL. These programs were mechanically transformed into machine language programs using a compiler, which was itself a program. For example, after a human wrote a FORTRAN program, it could be compiled once into a machine language program by running the compiler once. The machine language program this produced could then be run any number of times or copied and reused later.
FORTRAN program
FORTRAN compiler
machine program
machine languauge
CPU memory
hardware
Special-purpose applications could be written using either a machine language or a high-level language.
application
high-level program
compiler
machine program
machine languauge
CPU memory
hardware
Once it became possible for non-experts and multiple users to use a computer with a variety of different components beyond the memory and CPU (due to advances in many areas, including reductions in cost and the availability of better interface devices and special-purpose components), operating systems were developed to make it possible to manage multiple applications that might run by different users and might employ different components. However, all the applications would need to be either written using a machine language, or written in a language that could be compiled into a machine language. The operating system would then manage how these machine programs could use the various components and resources available to the computer (including the resource of time).
user A
application application
user B
application application
high-level
program
high-level
program
high-level
program
high-level
program
compiler
machine
program
machine
program
machine
program
machine
program
machine languauge
operating system
CPU CPU CPU memory input device output device other device
hardware
One could collapse this into a relatively simple diagram by omitting the portions of the diagram that do not represent components that either appear to be running to the user, or are actually running on the device.
application application application application
operating system
hardware
The above diagram might represent a single computer in a single physical location running a single operating system and a few applications. Today, this picture is much more complicated. While many devices are still captured by the above diagram, in many other cases the actual abstractions provided are much more rich and diverse, and they reflect the diverse ecosystem of capabilities and components available to a user, programmer, or application.
app. app.
web app. platform
browser app. app.
API API OS OS
web service web service virtual machine virtual machine
dist. platform web server virtualization layer
controller controller OS host operating system
node node node node node node node node server hardware (web/cloud server)
Today, the picture is much more complicated. Many applications running on a device actually provide capabilities that are assembled from a collection of underlying layers of potentially interdependent components. Every component in every layer must provide an interface and a collection of abstractions that capture its capabilities and communicate them to users (and expose them to the components on the layers above). Effectively, every component in every layer has a programming language, and assembling multilayered applications may require many compilation steps at various layers.

[link]  5.2. Transformations between abstract syntaxes

Definition: Given two abstract syntaxes A and B (i.e., both A and B are sets of parse trees that conform to two distinct grammars), an algorithm that always terminates and maps parse trees in A to parse trees in B is called a compilation algorithm, and it is said to compile programs in the language corresponding to A into programs in the language corresponding to B. Note that while a compiler must always terminate on all inputs, it is allowed to return an error on some inputs.
Note the similarities and distinctions between an interpreter and a compiler. Both interpreters and compilers map some set of abstract syntax trees to some other set of abstract syntax trees (in the case of interpreters, the resulting abstract syntax trees are usually values that are simpler than programs). However, an interpreter might not terminate (e.g., if the program it is interpreting does not terminate when it is evaluated and/or executed), while a compiler must terminate. Furthermore, an interpreter must keep track of both values and side effects (such as output), while a compiler is simply transforming one program into another program without concerning itself with what the program might do when evaluated/executed.
source language
abstract syntax
compiler target language
abstract syntax
Often, the term compiler will refer to the entire toolchain of components that make it possible to convert the character string (i.e., concrete syntax) representation of a program in the source language into the concrete syntax representation of the program in the target language. In these cases, we will refer to the individual component that performs compilation as the compilation algorithm or compilation component.
source language
concrete syntax
libraries
parser linker target language
concrete syntax
source language
abstract syntax
compilation
algorithm
target language
abstract syntax
Furthermore, a compilation algorithm will often consist of multiple stages. Each stage may perform some specific task, such as simplifying the features used in a program (e.g., eliminating procedures and replacing them with labels and branching statements, or eliminating variable names and replacing them with explicit memory addresses), or it may perform optimizations (e.g., elimination of unreachable code blocks). The intermediate representation (IR) between each stage might be a distinct abstract syntax data structure.
source
language
abstract
syntax
comp.
alg. A
IR
#1
comp.
alg. B
IR
#2
comp.
alg. C
target
language
abstract
syntax

[link]  5.3. Machine languages and bytecodes

Definition: Given a model of a computer that consists of a single central processing unit (CPU) that can perform some basic operations, and a block of memory (accessible to the CPU) that is organized using some kind of sequential and/or hierarchical addressing scheme (e.g., positive integers in some range represent memory locations), a machine language, assembly language, or machine code is an imperative programming language that consists of instructions that can directly be executed by a single CPU.
Note that in modern systems and networks, machine code might not be executed by a CPU directly. Instead, a program might be running on an operating system that might itself be running on a virtual machine, which is being simulated by a program that runs on a physical computer but emulates the behavior of a (perhaps very different) CPU and memory.
Definition: A bytecode is a low-level programming language that is similar to a machine language (i.e., the instructions in the language are primarily for manipulating memory and performing basic arithmetic and logical operations) but is not platform-dependent (i.e., dependent on a particular CPU model), and itself needs to be either interpreted (e.g., by a run-time system or virtual machine) or compiled into a machine language.
Note that there is a distinction between machine code and bytecode, though this distinction is increasingly harder to discern in practice with the advent of widespread virtualization. Usually, machine code is said to be executed when it is run on its native CPU, and simulated if it is being executed by some software application that emulates the behavior of a CPU; bytecode is typically executed by an interpreter, which might be called a run-time system or a virtual machine (depending on the features and capabilities of the interpreter, some of which will be discussed in more detail in subsequent sections). Examples of bytecodes include the Java bytecode (interpreted by the Java Virtual Machine, or JVM) and the Python bytecode.
Example: Consider the following target language, which has many characteristics of a machine language or bytecode. A program consists of an ordered sequence of instructions that can manipulate a block of memory addresses.
program
::=
instruction program
|
instruction
::=
label label
|
branch label address
|
branch label
|
set address number
|
copy address address
|
add
label
::=
any valid label
address
::=
any positive integer or zero
number
::=
any positive integer or zero
We can implement a simple Python simulator (or virtual machine, if the above is a bytecode) for this language.

# Simulate a machine running the program represented by the string s.
def simulate(s):
    instructions = [l.split(" ") for l in s.split("\n")]
    mem = {0: 0}
    control = 0
    while control < len(instructions):
        inst = instructions[control]
        if inst[0] == 'label':
            pass
        # For two arguments, test the memory address in the second
        # argument; given one argument, branch unconditionally.
        if inst[0] == 'branch':
            if len(inst) == 2 or mem[int(inst[2])] != 0:
                control = instructions.index(['label', inst[1]])
                continue
        if inst[0] == 'set':
            mem[int(inst[1])] = int(inst[2])
        if inst[0] == 'copy':
            mem[int(inst[2])] = mem[int(inst[1])]
        if inst[0] == 'add':
            mem[0] = mem[1] + mem[2]
        control = control + 1
    return mem[0]
        
A program in this language might look as follows:

>>> p = '''
  label start
  set 1 2
  set 2 3
  branch skip 1
  set 2 40
  label skip
  add
  '''
>>> simulate(p)
5
        
Example: In this example, the target language will be a slightly more complex machine language than the one presented in a previous example. A program consists of an ordered sequence of instructions that can manipulate a block of memory addresses. The abstract syntax for the language is presented below.
program
::=
instruction program
|
instruction
::=
label label
|
goto label
|
branch label address
|
jump address
|
set address number
|
copy
|
add
label
::=
any alphanumeric character string
address
::=
any integer
number
::=
any integer
When a machine is running a program, the state of the machine consists of a single block of memory (indexed by integers: both negative and positive) and a single integer representing the machine's position in the sequence of instructions (this integer represents which instruction in the program currently has control, i.e., which instruction is being executed by the CPU). The memory has eight special memory locations, which are defined below:
  • memory address 0: the address of the result of an add operation;
  • memory address 1: the address of the first input to an add operation;
  • memory address 2: the address of the second input to an add operation;
  • memory address 3: the address containing the "from" address for a copy operation;
  • memory address 4: the address containing the "to" address for a copy operation;
  • memory address 5: the output buffer (set to −1 after every step);
  • memory address 6: the address of the program's control index.
The meaning of each instruction is described below:
  • label l: there is no effect and control is passed to the next instruction;
  • goto l: control immediately moves to the program instruction label l;
  • branch l a: control immediately moves to the program instruction label l if memory address a contain a non-zero integer;
  • jump a: if the integer i is stored in memory address a, control moves immediately to the ith instruction in the program;
  • set a n: memory address a is set to the integer n;
  • copy: if a is the integer stored in memory address 3, and b is the integer stored in memory address 4, then the contents of memory address a are copied to memory address b;
  • add: the contents of memory location 1 and memory location 2 are added and stored in memory location 0.
We can implement a simple Python simulator for this language.

As an example of how this language might be used, suppose we want to write a program in this language to copy the contents of memory address 10 to memory address 20. The machine language program presented below would have this behavior:

set 3 10
set 4 20
copy
        
In fact, we can write a Python program that can generate an instance of the above for any two addresses. Notice that the copy() function below can be viewed as a macro that generates a specific machine language program.

def copy(frm, to):
   return [\
      'set 3 ' + str(frm),\
      'set 4 ' + str(to),\
      'copy'\
   ]
        
Suppose we have another macro increment() that takes a single argument, a memory address, and generates a machine program that increments the value at the address specified memory address. We could then implement the following macro, which will set an entire region of memory to a particular constant.

def setAllToInRegion(constant, startAddr, endAddr):
  # Memory location 8 contains the constant.
  # Memory location 9 contains the counter.  
  return [\
        'set 8 ' + str(constant),\
        'set 9 ' + str(startAddr),\
        'label loop'\
      ] \
      # Set second argument (destination) of copy.
      + copy(9,4)\
      + [\
        # Set first argument (source) for copy.
        'set 3 8',\
        'copy'\
      ]\
      # Increment the counter.
      + increment(9) \
      # If counter is endAddr, break out of loop;
      # otherwise, go back to the beginning of the loop.
      # We check by adding -endAddr to the counter.
      # First argument of add.
      + copy(9,1) \
      + [\
        # Second argument of add.
        'set 2 -' + str(endAddr),\
        'add',\
        'branch loop 0'\
      ]
        
Running setAllToInRegion() on specific inputs generates a machine language program for a particular region.

>>> setAllToInRegion(3, 10, 100)
['set 8 3', 
 'set 9 10', 
 'label loop', 
 'set 3 9', 
 'set 4 4', 
 'copy', 
 'set 3 8', 
 'copy', 
 '# increment() done as homework.',
 'set 3 9', 
 'set 4 1', 
 'copy', 
 'set 2 -100', 
 'add', 
 'branch loop 0']
        
The version below works with the web-based machine language simulator (but it only initializes a region up to address 15). Note that #increment# is a macro provided by the simulator and not part of the machine language itself; you cannot use #increment# when working on an assignment because simulator provided to you will not support such macros.

set 8 3
set 9 10
label loop
set 3 9
set 4 4
copy
set 3 8
copy
#increment# 9
set 3 9
set 4 1
copy
set 2 -15
add
branch loop 0

Example: Suppose we need to copy a value 123 from memory address 9 to memory address 12. However, we do not know that the source address is memory address 9. Instead, 9 is stored in memory address 0 at the beginning of our machine language program. Can we still copy from memory address 9?

set 0 9
set 9 123
set 3 0
set 4 3
copy
set 4 12
copy

We do so in the above machine language program by first copying from memory address 0 to the memory address corresponding to the source parameter for the copy instruction (notice that we are copying an address, but the machine language program does not know this and simply copies the integer in memory address 0 to memory address 3).
Now, memory address 3 contains the source address we want (which is 9), so we set the destination address for the copy instruction by putting 12 into memory address 4. The final copy instruction now copies the value 123 from memory address 9 to memory address 12, as desired.
Example: Suppose the value we want to copy to address 0 is in address 13. The value is actually 999 but we do not know this. We also do not know that it is in address 13. We only know that memory address 7 contains the address that holds the address that holds the value 999 (in reality, address 7 holds the address 11, and address 11 holds the address 13, and address 13 holds the value 999). In other words, what if there are two layers of indirection between the value we want and the address we have?
We can set this situation up using our machine language by setting memory addresses 7, 11, and 13 to appropriate values. How can we then write a machine language program that copies the value from address 13 to address 0 when all it knows about is address 7?

set 7 11
set 11 13
set 13 999
set 3 7
set 4 3
copy
set 4 3
copy
set 4 0
copy

We do so in the above machine language program by first copying the address 11 contained in address 7 to the source copy parameter address 3. We can then issue another copy command that copies 13 from memory address 11 to the source copy parameter address 3. Finally, once we've set up a situation in which we can copy from address 13, we can issue one more copy command to copy the integer value 999 from address 13 to address 0.

[link]  5.4. Compiling expressions to a machine language or bytecode

In order to compile a high-level language to a machine language that only provides the CPU and indexed memory abstractions, it is necessary to transform the high-level language's parse tree into a series of instructions that manipulate memory in a way that appears identical to the behavior of the high-level program when it is being interpreted.

Since machine language programs describe how to manipulate memory, it is a good idea to plan how the machine program will utilize the memory available to it to store intermediate results it needs to compute while the program is running.
Definition: The region of memory that a machine program uses to store values (i.e., results of computing expressions) is called the heap.
As a machine program continues running and computing more results, it grows the heap by adding values to new memory locations. To keep track of this at the time of compilation, the compiler might maintain a counter that stores the current "top" of the heap. When a machine program is running, it may itself maintain a heap pointer to keep track of the last item stored on the heap, primarily so that it knows where to look for more space if it wants to store more values in memory.
Example: Suppose we are starting with the following source language:
formula
::=
true | false | not formula | formula and formula | formula or formula
How can we implement a compiler that converts formulas in the source language into sequences of instructions in the target language in the previous example?

We can approach this problem by constructing a function compileFormula() that takes as its input the parse tree, and a variable heap that represents the highest address in memory that contains a computed value (in our case, 1 can represent true and 0 can represent false). The function returns a list of machine language instructions that computes that value, as well as the memory address where that value would be stored if anyone actually executed (or simulated) the machine language program.

def compileFormula(f, heap = 3, fresh = 0):
    if type(f) == Leaf:
        if f == 'True':
            # Find a new memory address on the heap.
            heap = heap + 1
            # Generate instruction to store the integer representing 'True' on the heap.
            inst = 'set ' + str(heap) + ' 1'
            # Return the instruction list and "top" of the heap.
            return ([inst], heap, fresh)
        if f == 'False':
            # Find a new memory address on the heap.
            heap = heap + 1
            # Generate instruction to store the integer representing 'False' on the heap.
            inst = 'set ' + str(heap) + ' 0'
            # Return the instruction list and "top" of the heap.
            return ([inst], heap, fresh)
    if type(f) == Node:
        for label in f:
            children = f[label]
            if label == 'Not':
                # Compile the subtree f to obtain the list of
                # instructions that computes the value represented
                # by f.
                f = children[0]
                (insts, heap, fresh) = compileFormula(f, heap, fresh)
                # Generate more instructions to change the memory
                # location in accordance with the definition of the
                # 'Not' operation.
                instsNot = \
                   ["branch setZero" + str(fresh) + " " + str(heap),\
                    "set " + str(heap) + " 1",\
                    "branch finish" + str(fresh),\
                    "label setZero" + str(fresh),\
                    "set " + str(heap) + " 0",\
                    "label finish" + str(fresh)\
                   ]
                return (insts + instsNot, heap, fresh + 1)
            if label == 'Or':
                # Compile the two subtrees and get the instructions
                # lists as well as the addresses in which the results
                # of computing the two subtrees would be stored if someone
                # were to run those machine instructions.
                f1 = children[0]
                f2 = children[1]
                (insts1, heap2, fresh1) = compileFormula(f1, heap, fresh)
                (insts2, heap3, fresh2) = compileFormula(f2, heap2, fresh1)
                # Increment the heap counter so we store the
                # result of computing 'Or' in a new location.
                heap4 = heap3 + 1
                # Add instructions that compute the result of the
                # 'Or' operation.
                instsOr = \
                   ["copy " + str(heap2) + " 1",\
                    "copy " + str(heap3) + " 2",\
                    "add",\
                    "branch setOne" + str(fresh2) + " 0",\
                    "branch finish" + str(fresh2),\
                    "label setOne" + str(fresh2),\
                    "set 0 1",\
                    "label finish" + str(fresh2),\
                    "copy 0 " + str(heap4)\
                   ]
                return (insts1 + insts2 + instsOr, heap4, fresh2 + 1)
        

[link]  5.5. Compiling statement sequences and procedures to a machine language or bytecode

Example: Suppose we want to compile the following program in an imperative programming language that supports while loops to a machine language program in our machine language.

x := true; 
while x { 
  x := false; 
}
print false;
        
If our heap begins at memory address 8 and true is represented as 1, then our compiler would first emit the instruction set 8 1 and would store inside the environment a mapping {x ↦ 8}. It would then use this information to compile the expression condition and assignment statement in the body of the loop.

set 8 1
label startLoop
branch bodyLoop 8
goto endLoop
label bodyLoop
set 9 0
set 3 9
set 4 8
copy
goto startLoop
label endLoop
set 5 0

Definition: The region of memory that a machine program uses to store information that makes it possible to simulate calls to named procedures is called the stack.
In its simplest form, the stack is simply a region of memory that holds integers corresponding to program locations. The top of the stack always corresponds to the program location to which control must move once the procedure that is currently being executed (or simulated) reaches the end of its body.
Example: Suppose we are starting with the following source language:
program
::=
variable := formula ; program
|
print formula ; program
|
procedure variable { program } program
|
call variable ; program
|
formula
::=
true | false | variable | not formula | formula and formula | formula or formula
How can we implement a compiler that converts formulas in the source language into sequences of instructions in the target language in a previous example?
As an example, suppose we begin with the following program.

procedure printProc {
  print true;
  print false;
}
call printProc;
print false;
call printProc;
print true;
        
If we want to compile the above program, we will need to ensure that the machine language program passes control to the block of machine code representing the procedure body, and then passes it back when the body finishes running.
However, we must compile the procedure into a single block of machine code. How will that block know whether to come back to the point in the machine language program corresponding to the first call printProc; statement, or to the point corresponding to the second call printProc; statement?
We will need to make sure that the call printProc; are compiled into machine code that stores the index of the current instruction at each call point at a particular memory location (in this example, suppose it is memory address 7). Then, the block of machine code into which the procedure body is compiled can always consult the same memory location to determine to which point it should pass control back after the machine code for the procedure body finishes executing. Because control should not go back to the goto instruction but beyond it, it may be necessary to increment the value representing the program location before using the jump instruction. We provide machine language pseudocode below that does just that.

# The procedure.
goto printProcEnd   # Skip procedure body.
label printProc
set 5 1             # print true;
set 5 0             # print false;

# Return control to the last calling point.
< increment integer stored in address 7 by 2 >
jump 7
label printProcEnd

# The first call.
< copy integer from address 6 to address 7 >
goto printProc

set 5 0             # print false;

# The second call.
< copy integer from address 6 to address 7 >
goto printProc

set 5 1             # print true;
        
The above approach might work if we never call a procedure from inside another procedure. However, it will not work if a call instruction appears inside the body of a procedure because that call instruction will be compiled into machine language instructions that overwrite the contents of memory address 7, making it impossible to determine the program location to which control must return after the outermost procedure body finishes. To address this, it is possible to have the machine language program maintain a stack starting at memory location 7 that will keep track of all the program locations to which control must return after each nested procedure call finishes. We do so in the machine language pseudocode below.

# Set stack pointer to top of stack at address 8.
set 7 8

# The procedure.
goto printProcEnd   # Skip procedure body.
label printProc
set 5 1             # print true;
set 5 0             # print false;

# Return control to the last calling point.
< copy integer from top of stack 
  (i.e., integer/address contained in address 7) to address 1 >

< set address 2 to some constant
  (to skip the calling machine instructions) >

< add contents of address 1 and address 2 >
jump 0
label printProcEnd

# The first call.
< increment integer in address 7 (the stack pointer) >
< copy integer from address 6 to top of stack
  (i.e., address/integer specified in address 7) >

goto printProc

set 5 0             # print false;

# The second call.
< increment integer in address 7 (the stack pointer) >
< copy integer from address 6 to top of stack
  (i.e., address/integer specified in address 7) >

goto printProc

set 5 1             # print true;
        

[link]  5.6. Common compiler optimizations and register allocation

Optimization algorithms are typically transformations of the abstract syntax of a language. In a compiler, this abstract syntax is usually that of a particular intermediate representation used within the compiler, but optimizations might occur at any stage in the compilation process.
source
language
abstract
syntax

optimization
algorithm #1
compilation
algorithm A
intermediate
representation
(IR) #1

optimization
algorithm #2
compilation
algoritm B
intermediate
representation
(IR) #2

optimization
algorithm #3
compilation
algorithm C
target
language
abstract
syntax

optimization
algorithm #4
Definition: Loop unrolling is an optimization that involves duplicating loop bodies a constant number of times, typically to reduce the number of conditional branch instructions that must occur if the number of loop iterations is known at compile time. Depending on the particular language and compiler, the loop unrolling optimization could be applied to high-level language abstract syntax trees, intermediate representation abstract syntax trees, and sometimes to machine language programs.
Example: Suppose we are compiling a program written in a high-level language that looks like the following.

i = 0;
while (i < 1000000) {
  i++;
}
        
One option is to compile the above program into the following sequence of machine language instructions (assume that the variable i corresponds to memory address 123).

label whileLoop
< increment integer in address 123 >
< compare integer in address 123 to integer 1000000 and store result in address 200 >
branch whileLoop 200
        
If we execute the above machine language program, we will execute 1,000,000 comparison operations. We can reduce the cost of the comparison operations in this case by duplicating the body of the loop some number of times.

i = 0;
while (i < 1000000) {
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
}
        
Notice that in the above program, we have reduced the number of comparison operations by a factor of 10. However, to achieve this we had to increase the size of the machine language program by a factor of 10.

label whileLoop
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< compare integer in address 123 to integer 1000000 and store result in address 200 >
branch whileLoop 200
        
Definition: Constant folding involves replacing subtrees (typically expressions) within an abstract syntax tree with values (the value should normally correspond to the result of evaluating the expression).
Example: Suppose we have the following abstract syntax tree.

{'Print': [
  {'Plus': [
    {'Mult':[{'Number':[3]}, {'Number':[4]}]},
    {'Variable':['x']}
  ]}
]}
        
An optimization algorithm that performs constant folding might traverse the above tree, recognize that the subtree {'Mult':[{'Number':[3]}, {'Number':[4]}]} can be evaluated in a finite amount of time, and then replace it with the value to which it evaluates. The overall tree would then look as follows.

{'Print': [
  {'Plus': [
    {'Number':[12]},
    {'Variable':['x']}
  ]}
]}
        
Definition: Dead code elimination involves removing subtrees (typically statements) within an abstract syntax tree that can never be reached (e.g., because a branching statement or loop condition is always false).
Definition: An abstract syntax tree is in single static assignment or SSA form if every variable appears exactly once on the left-hand side of an assignment statement.
An abstract syntax tree can be converted into SSA form by converting every variable on the left-hand side of an assignment statement so that it is unique, and then replacing all occurrences of that variable with the version of that variable that appears in the last assignment above the occurrence.
Example: Suppose we have the following program in a high-level programming language.

a := 2;
b := 5;
print a + b;
f := true and false;
print f;
a := 7;
g := true;
h := false;
b := 0;
print b;
print h;
print g;
print a;
        
If we converted the abstract syntax tree that represents the above program into SSA form, we might obtain an abstract syntax tree that represents the following program.

a_0 := 2;
b_0 := 5;
print a_0 + b_0;
f_0 := true and false;
print f_0;
a_1 := 7;
g_0 := true;
h_0 := false;
b_1 := 0;
print b_1;
print h_0;
print g_0;
print a_1;
        
Notice that each variable in the original program has been associated with an index so that no variable in the new program ever appears on the left-hand of an assignment more than once.
Definition: Register allocation involves assigning a CPU register to each variable within the abstract syntax tree representing a program (typically an intermediate representation in single static assignment form). This usually requires building an interference graph of the variables within the program, and then solving the graph coloring problem where each register represents a color.
Example: When compiling list or set comprehensions in languages such as Haskell and Python, the operational semantics of comprehensions allows for parallelization of the individual steps in a way that iterative constructs do not allow. This is because in a comprehension, each "iteration" or "step" of the comprehension is evaluated under the same environment. In an imperative looping statement, the modified environment emitted by one iteration is fed into the execution of the next iteration, so the effects of each iteration accumulate over time. This requires executing each iteration one after the other, and keeping track of the changing environment.
Thus, it is possible to compile a program so that the individual "iterations" or "steps" of a comprehension are evaluated on different processors, or even different computers, and then reassembled into the final results. This is not necessarily possible with a looping statement.
You can see this in the two different operational semantics rules below (the first for an imperative looping statement, the second for a comprehension expression).
[Stmt-For]
  Σ0 ⊎ {x ↦ 1}, p ⇓ Σ1           Σ1 ⊎ {x ↦ 2}, p ⇓ Σ2           Σ2 ⊎ {x ↦ 3}, p ⇓ Σ3  
 Σ0, for x in {1,2,3}: p ⇓ Σ3 
[Exp-Comprehension]
  Σ ⊎ {x ↦ 1}, pv1           Σ ⊎ {x ↦ 2}, pv2           Σ ⊎ {x ↦ 3}, pv3  
 Σ, [ e for x in {1,2,3} ] ⇓ [v1, v2, v3
The following diagram illustrates how the various optimization and compilation algorithms described in this section might fit together within the architecture of a compiler.
high-level
source
language
abstract
syntax

loop
unrolling
SSA conversion
IR in SSA form
constant
folding
compilation
low-level IR
register
allocation
instruction
selection
low-level
target
machine
language
abstract
syntax

[link]  5.7. Correctness of compilation algorithms

Compilers represent a unique kind of software application because their implementation can affect the quality of all the software applications that they compile. This means that compiler flaws can be particularly expensive; as a result, there may be many practical justifications for investing heavily in verifying that a compiler, or components thereof, behave correctly. However, to determine whether a compiler behaves correctly, it is first necessary to define what constitutes correct behavior.

One way to define correct behavior is to appeal to the operational semantics of the source and target languages.
source language
abstract syntax
compilation
algorithm
target language
abstract syntax
source language
operational semantics
(interpreter)
target language
operational semantics
(simulator)
source language
abstract syntax
for values/outputs
simple
conversion
algorithm for
values/outputs
target language
abstract syntax
for values/outputs
Definition: Suppose we are given a source and target language (including the definitions of the abstract syntax and the operational semantics for each language, as well as corresponding interpretation/simulation functions interpret() and simulate()), a simple conversion algorithm convert() between the sets of values and outputs in each language, and a compilation algorithm compile().

The compilation algorithm compile() is correct (or behaves correctly) if for all possible input programs p in the source language, the following equation is true:
simulate(compile(p))
=
convert(interpret(p))
Notice that the above definition of correctness is functional but perhaps not practical. If the operational semantics of one or both languages does not capture running time, security (e.g., stack overflows), or other concerns, the definition may be inadequate. We will discuss these issues in more detail in a subsequent section.

If the number of possible programs in the source language is infinite, it is impossible to check that the equation in the above definition holds for all possible programs through testing alone, though testing can provide some confidence. Nevertheless, it may be possible to mathematically prove that the equation holds without doing any testing at all.

Software testing and verification is a broad area of research and practice that spans multiple (sometimes disparate) disciplines and communities. In this subsection, we will focus on three specific categories of testing and verification that are useful for determining the correctness of compiler implementations. These three categories represent only a small (but important) subset of the approaches currently being developed and used.
Definition: There are at least three possible approaches to verifying that a compiler's behavior conforms to its specification. These include:
  • suites of individual test cases;
  • bounded exhaustive testing;
  • formal (usually inductive) proof of correctness over all inputs.
Example: Bounded exhaustive testing involves defining a metric on inputs (e.g., a mapping from inputs to an integer), and then testing all inputs for which the matric is below a certain integer bound. For example, let us consider a simple language for formulas:
formula
::=
true | false | not formula| formula and formula | formula or formula
We can define a metric on the set of abstract syntax trees that conform to the above definition: the depth of the tree.

Node = dict
Leaf = str

def metric(f):
    if type(f) == Leaf:
        return 1
    if type(f) == Node:
        for label in f:
            return 1 + max([metric(child) for child in f[label]])
        
We can then generate all trees whose metric falls within a specified bound.

def formulas(n):
    if n <= 0:
        return []
    elif n == 1:
        return ['True', 'False']
    else:
        fs = formulas(n-1)
        fsN = []
        fsN += [{'Not':[f]} for f in fs]
        fsN += [{'And':[f1,f2]} for f1 in fs for f2 in fs]
        fsN += [{'Or':[f1,f2]} for f1 in fs for f2 in fs]
        return fs + fsN
        
It is then possible to use the above code to generate test cases automatically, and to check the equation specifying compiler correctness against each test case.

[link]  6. Static Analysis and Abstract Interpretation

Most compilers and interpreters perform basic error checking and static analysis before running or compiling a program. This serves a variety of purposes, some of which include:
  • catching errors in the program before incurring the cost (in time, power, etc.) or actually running it;
  • returning more detailed and user-friendly error messages;
  • determining information about the program that is required for the optimization and compilation algorithms (e.g., detecting opportunities for constant folding);
  • improving performance of interpreted or compiled programs by eliminating run-time error checking;
  • eliminating invalid test cases when doing bounded exhaustive testing.
Static analysis is related to a broader concept: abstract interpretation. It is possible to analyze programs along a variety of other dimensions that may of interest, including, for example, the program's range of outputs, the program's running time, the monetary cost of running the program, the program's power consumption, and so on.

Both static analysis and abstract interpretation algorithms are conventionally expected to terminate in a finite, and typically relatively fast, amount of time (e.g., it should be possible to run a new static analysis or abstract interpretation each and every time the program is compiled, even during an interactive programming and/or testing session).

In this section we present a notation for defining static analysis and abstract interpretation algorithms. This notation is very similar to the one we introduced for defining the operational semantics of programming languages.
source
language
concrete
syntax
target
language
abstract
syntax
parser source
language
abstract
syntax
static
analysis
algorithms
abstract
syntax
or IR
compilation
algorithms
syntax
errors
other errors
(unbound variables,
type errors, etc.)

[link]  6.1. Monomorphic type systems and type checking

A type checking algorithm is a specific kind of static analysis algorithm. It represents a "low-resolution" evaluation of the abstract syntax tree of a program to determine the type that can be assigned to every subtree in that program.
Example: Suppose we are working with a programming language with the following abstract syntax.
expression
::=
true | false | number
|
not expression
|
expression and expression
|
expression or expression
|
expression + expression
The abstrac syntax for types is defined as follows:
type
::=
integer
|
boolean
We can define a set of type inference rules for this language. These rules constitute the type system for the language.
[Formula-True]
  
 ⊢ true: boolean 
[Formula-False]
  
 ⊢ false: boolean 
[Formula-Not]
 ⊢ e: boolean 
 ⊢ not e: boolean 
[Formula-And]
 ⊢ e1: boolean           ⊢ e2: boolean 
 ⊢ e1 and e2: boolean 
[Formula-Or]
 ⊢ e1: boolean           ⊢ e2: boolean 
 ⊢ e1 or e2: boolean 
[Term-Number]
  
 ⊢ number: integer 
[Term-Plus]
 ⊢ e1: integer           ⊢ e2: integer 
 ⊢ e1 + e2: integer 
We can convert the above inference rules into a type checking algorithm; the process is almost identical to the process used for converting an operational semantics into an interpreter implementation.

def typeCheck(e):
    if type(e) == Leaf:
        if e == 'True' or e == 'False':
            return 'boolean'
    if type(e) == Node:
        for label in e:
            if label == 'Number':
                 return 'integer'
            if label == 'Not':
                  [e1] = e[label]
                  t1 = typeCheck(e1)
                  if t1 != 'boolean':
                     print("not must be applied to a boolean")
                     return None
                  return 'boolean'
            
            # ...
            
            if label == 'Plus':
                [e1,e2] = e[label]
                t1 = typeCheck(e1)
                t2 = typeCheck(e2)
                if t1 != 'integer' or t2 != 'integer':
                    print("+ requires integer arguments")
                    return None
                return 'integer'
        
Example: Suppose we are working with a programming language with the following abstract syntax.
program
::=
variable := expression ; program
|
expression
::=
variable
|
constant
|
not expression
|
expression and expression
|
expression or expression
|
expression * expression
|
expression + expression
constant
::=
true | false | number
number
::=
any valid integer
variable
::=
any valid variable
type
::=
integer
|
boolean
|
void
We can define a set of type inference rules for this language. These rules constitute the type system for the language.
[Statement-Assign]
 Γ ⊎ {x ↦ τ} ⊢ p: void           Γ ⊢ e: τ 
 Γ ⊢ x := e ; p : void 
[Statement-End]
  
 Γ ⊢ end: void 
[Variable]
 Γ(x) = τ 
 Γ ⊢ x: τ 
[Formula-True]
  
 Γ ⊢ true: boolean 
[Formula-False]
  
 Γ ⊢ false: boolean 
[Formula-Not]
 Γ ⊢ e: boolean 
 Γ ⊢ not e: boolean 
[Formula-And]
 Γ ⊢ e1: boolean           Γ ⊢ e2: boolean 
 Γ ⊢ e1 and e2: boolean 
[Formula-Or]
 Γ ⊢ e1: boolean           Γ ⊢ e2: boolean 
 Γ ⊢ e1 or e2: boolean 
[Term-Number]
  
 Γ ⊢ number: integer 
[Term-Plus]
 Γ ⊢ e1: integer           Γ ⊢ e2: integer 
 Γ ⊢ e1 + e2: integer 
[Term-Mult]
 Γ ⊢ e1: integer           Γ ⊢ e2: integer 
 Γ ⊢ e1 * e2: integer 
Example: Let's consider the difference between the operational semantics inference rule and the type checking rule for a while construct in a hypothetical programming language. Notice that the execution rule can result in an infinite recursion because a premise above the line contains an exact copy of the abstract syntax tree that we are trying to execute. Unlike the execution rule, the typing rule cannot result in an infinite recursion (it only requires a linear-time recursive examination of the abstract syntax tree).
[Statement-While-Eval-True]
  Σ1, p1 ⇓ Σ2, o1           Σ2, while e { p1 } p2 ⇓ Σ3, o2           Σ1, etrue  
 Σ1, while e { p1 } p2 ⇓ Σ3, o1;o2 
[Statement-While-Type]
  Γ ⊢ e : boolean           Γ ⊢ p1 : void           Γ ⊢ p2 : void  
 Γ ⊢ while e { p1 } p2 : void 
Example: Suppose we are working with a programming language that supports the definition of functions that take a single integer argument.
program p
::=
function f ( int x ) { return e ; } p
|
x := e ; p
|
expression e
::=
x | n | e + e | f ( e )
number n
::=
any valid integer
variable f, x
::=
any valid variable
type τ
::=
int
|
int int
|
void
We can define a set of type inference rules for this language. These rules constitute the type system for the language.
[Statement-Assign]
  Γ ⊎ {xint} ⊢ p: void           Γ ⊢ e: int  
 Γ ⊢ x := e ; p : void 
[Statement-Function]
  Γ ⊎ {fint int} ⊢ p: void           Γ ⊎ {xint} ⊢ e: int  
  Γ ⊢ function f ( int x ) { return e ; } p : void  
[Statement-End]
  
 Γ ⊢ end: void 
[Term-Variable]
 Γ(x) = int 
 Γ ⊢ x: int 
[Term-Number]
  
 Γ ⊢ n: int 
[Term-Plus]
 Γ ⊢ e1: int           Γ ⊢ e2: int 
 Γ ⊢ e1 + e2: int 
[Term-Apply]
 Γ(f) = int int           Γ ⊢ e: int 
 Γ ⊢ f ( e ): int 
Before implementing a type checking algorithm, suppose we want to implement an interpreter for this language. Below is an example of how it might be implemented.

def evaluate(env, e):
   if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Number':
                n = children[0]
                return n
            elif label == 'Plus':
                [e1,e2] = children
                return evaluate(env, e1) + evaluate(env, e2)
            elif label == 'Variable':
                x = children[0]
                value = env[x]
                return value
            elif label == 'Apply':
                 [f, eArg] = children
                 f = f['Variable'][0] # Unpack.
                 x = env[f]['Function'][2]['Variable'][0]
                 eBody = env[f]['Function'][3] # Body.
                 vArg = evaluate(env, eArg)
                 envF = env.copy()
                 envF[x] = vArg
                 vResult = evaluate(envF, eBody)
                 return vResult
        
def execute(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return (env, [])
    elif type(s) == Node:
        for label in s:
            if label == 'Assign':
                [var, e, p] = s[label]
                x = var['Variable'][0]
                env[x] = evaluate(env, e)
                (env, o) = execute(env, p)
                return (env, o)
            if label == 'Function':
                [f, ty, x, e, p] = s[label]
                f = f['Variable'][0] # Unpack.
                env[f] = s
                (env, o) = execute(env, p)
                return (env, o)
        
A type checking algorithm implementation is presented below.

def tyExpr(env, e):
   if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Number':
                return 'TyInt'
            elif label == 'Variable':
                x = children[0]
                return env[x]
            elif label == 'Apply':
                 [f, eArg] = children
                 f = f['Variable'][0] # Unpack.
                 tyArg = tyExpr(env, eArg)
                 tyFunArg = env[f]['TyArrow'][0]
                 if tyArg == tyFunArg:
                     return env[f]['TyArrow'][1]
        
def tyProg(env, s):
   if type(s) == Leaf:
        if s == 'End':
            return 'TyVoid'
   elif type(s) == Node:
        for label in s:
            if label == 'Assign':
                [x,e,p] = s[label]
                x = x['Variable'][0] # Unpack.
                tExpr = tyExpr(env, e)
                env[x] = tExpr
                tProg = tyProg(env, p)
                if tExpr == 'TyInt' and tProg == 'Void':
                    return 'TyVoid'
            if label == 'Function':
                [f, tyArg, x, e, p] = s[label]
                name = f['Variable'][0]
                x = x['Variable'][0]
                envF = env.copy()
                envF[x] = tyArg
                tBody = tyExpr(envF, e)
                if tBody == 'TyInt':
                    env[name] = {'TyArrow':[tyArg, tBody]}
                    tProg = tyProg(env, p)                  
                    return tProg
        
Example: Suppose we are working with a programming language that supports the definition of functions that take a single string argument. The type system of this language supports explicit tracking of the sizes of strings.
program p
::=
function f ( string[k] x ) { return e ; } p
|
x := e ; p
|
expression e
::=
x | s | e + e | f ( e )
string s
::=
any valid string literal
variable f, x
::=
any valid variable
type τ
::=
string[k]
|
string[k] string[k]
|
void
type size k
::=
any valid integer
|
k + k
We can define a set of type inference rules for this language. These rules constitute the type system for the language.
[Statement-Assign]
  Γ ⊎ {xstring[k]} ⊢ p: void           Γ ⊢ e: string[k]  
 Γ ⊢ x := e ; p : void 
[Statement-Function]
  Γ ⊎ {fstring[k1] string[k2]} ⊢ p: void           Γ ⊎ {xstring[k1]} ⊢ e: string[k2]  
 Γ ⊢ function f ( string[k1] x ) { return e ; } p : void 
[Statement-End]
  
 Γ ⊢ end: void 
[Term-Variable]
 Γ(x) = string[k] 
 Γ ⊢ x: string[k] 
[Term-String]
 |s| = k 
 Γ ⊢ s: string[k] 
[Term-Concat]
  Γ ⊢ e1: string[k1]           Γ ⊢ e2: string[k2]  
 Γ ⊢ e1 + e2: string[k1+k2] 
[Term-Apply]
  Γ(f) = string[k1] string[k2]           Γ ⊢ e: string[k1] 
  Γ ⊢ f ( e ): string[k2]  

[link]  6.2. Abstract interpretation

An abstract interpretation algorithm is a specific kind of interpreter for a programming language: it must always terminate (and, typically, it must run quickly), and it typically provides information about the given program along some dimension of interest (i.e., not the value(s) it produces).
Example: Suppose we are working with a simple general-purpose programming language with the following abstract syntax.
program p
::=
print e ; p
|
for x to n { p } p
|
procedure x { p } p
|
call x ; p
|
expression e
::=
n | e * e
number n
::=
any valid integer
variable x
::=
any valid variable
We want to define an abstract interpretation over the above language to a language that describes running times of programs. The language for running times has the following syntax:
running time t
::=
n | t + t | t * t | ( t )
We can define the following abstract interpretation of the language:
[Expression-Number]
  
 ⊢ n : 1 
[Expression-Mult]
 ⊢ e1: t1           ⊢ e1: t2 
 ⊢ e1 * e2: 1 + t1 + t2 
[Statement-Print]
 Γ ⊢ p: t1           ⊢ e: t2 
 Γ ⊢ print e ; p : 1 + t1 + t2 
[Statement-For]
 Γ ⊢ p1: t1           Γ ⊢ p2: t2 
 Γ ⊢ for x to n { p1 } p2 : (n * t1) + t2 
[Statement-Procedure]
 Γ ⊢ p1: t1           Γ ⊎ {xt1} ⊢ p2: t2 
 Γ ⊢ procedure { p1 } p2 : t2 
[Statement-Call]
 Γ(x) = t2           Γ ⊢ p: t1 
 Γ ⊢ call x ; p : t1 + t2 
[Statement-End]
  
 Γ ⊢ end : 0 
We can implement the above inference rules as an abstract interpretation algorithm.

def timeExpr(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Number':
                return 1
            elif label == 'Mult':
                [e1,e2] = children
                return 1 + timeExpr(e1) + timeExpr(e2)

def timeProg(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return 0
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                [e,p] = s[label]
                return 1 + timeExpr(e) + timeProg(env, p)
            if label == 'For':
                [var, num, p1, p2] = s[label]
                n = num['Number'][0]
                return n * timeProg(env, p1) + timeProg(env, p2)

                # ...
        
Example: Suppose we are working with a simple programming language for reserving cloud computing resources with the following abstract syntax.
program p
::=
reserve d ; p
|
for x to n { p } p
|
procedure x { p } p
|
call x ; p
|
duration d
::=
n minutes
|
n hours
|
n days
number n
::=
any valid integer
fixed-point real number r
::=
any valid real number
variable x
::=
any valid variable
We want to define an abstract interpretation over the above language to a language that describes the monetary (i.e., dollar) cost of running a program. The language for costs has the following syntax:
cost c
::=
$r | c + c | n * c | ( c )
Suppose that a single reserve d ; statement reserves a distinct computing resources for the duration specified, and that the cost of any resource is always $0.60 per hour. We can define the following abstract interpretation of the language:
[Duration-Minutes]
  
 ⊢ n minutes : n ⋅ $0.01 
[Duration-Hours]
  
 ⊢ n hours : n ⋅ $0.60 
[Duration-Days]
  
 ⊢ n days : n ⋅ $14.40 
[Statement-Reserve]
 ⊢ d: c1           Γ ⊢ p: c2 
 Γ ⊢ reserve d ; p : c1 + c2 
[Statement-For]
 Γ ⊢ p1: c1           Γ ⊢ p2: c2 
 Γ ⊢ for x to n { p1 } p2 : (n * c1) + c2 
[Statement-Procedure]
 Γ ⊢ p1: c1           Γ ⊎ {xc1} ⊢ p2: c2 
 Γ ⊢ procedure x { p1 } p2 : c2 
[Statement-Call]
 Γ(x) = c2           Γ ⊢ p: c1 
 Γ ⊢ call x ; p : c1 + c2 
[Statement-End]
  
 Γ ⊢ end : $0 
We can implement the above inference rules as an abstract interpretation algorithm.

def costDuration(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Minutes':
                [num] = children
                n = num['Number'][0]
                return n * 0.01
            if label == 'Hours':
                [num] = children
                n = num['Number'][0]
                return n * 0.60
            if label == 'Days':
                [num] = children
                n = num['Number'][0]
                return n * 14.40

def costProg(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return 0
    elif type(s) == Node:
        for label in s:
            if label == 'Reserve':
                [d,p] = s[label]
                 return costDuration(d) + costProg(env, p)
            if label == 'For':
                [var, num, p1, p2] = s[label]
                n = num['Number'][0]
                return n * costProg(env, p1) + costProg(env, p2)

            # ...
        
Example: Suppose we are working with a simple programming language in which the type system tracks whether values are secret or public.
program p
::=
print e ; p
|
secret x := e ; p
|
public x := e ; p
|
expression e
::=
n | x | e + e
number n
::=
any valid integer
variable x
::=
any valid variable
We want to define an abstract interpretation over the above language to a language that describes running times of programs. The language for running times has the following syntax:
visibility t
::=
secret | public | void
We can define the following abstract interpretation of the language:
[Statement-Assign-Secret]
  Γ ⊎ {xsecret} ⊢ p: void           Γ ⊢ e: secret  
 Γ ⊢ secret x := e ; p : void 
[Statement-Assign-Public]
  Γ ⊎ {xpublic} ⊢ p: void           Γ ⊢ e: public  
 Γ ⊢ public x := e ; p : void 
[Statement-Print]
  Γ ⊢ p: void           Γ ⊢ e: public  
 Γ ⊢ print e ; p : void 
[Expression-Variable]
 Γ(x) = τ 
 Γ ⊢ x: τ 
[Expression-Number-Secret]
  
 Γ ⊢ n: secret 
[Expression-Number-Public]
  
 Γ ⊢ n: public 
[Expression-Plus-Secret-Secret]
 Γ ⊢ e1: secret           Γ ⊢ e2: secret 
 Γ ⊢ e1 + e2: secret 
[Expression-Plus-Secret-Public]
 Γ ⊢ e1: secret           Γ ⊢ e2: public 
 Γ ⊢ e1 + e2: secret 
[Expression-Plus-Public-Secret]
 Γ ⊢ e1: public           Γ ⊢ e2: secret 
 Γ ⊢ e1 + e2: secret 
[Expression-Plus-Public-Public]
 Γ ⊢ e1: public           Γ ⊢ e2: public 
 Γ ⊢ e1 + e2: public 

[link]  7. Declarative (and Functional) Programming Language Paradigms

In this section we introduce the notion of a programming language paradigm, define several major programming language paradigms, and discuss which paradigms are supported by some widely used modern programming languages (in practice, many languages support more than one paradigm). We assume the reader is already at least somewhat familiar with the imperative and object-oriented programming paradigms; thus, we study in detail the declarative programming paradigm (including an important subset, functional programming).

[link]  7.1. Programming language paradigms

A programming language's programming paradigm is the collection of abstractions, tools, and techniques it makes available to a programmer. While there is some disagreement about what exactly constitutes a programming paradigm and how programming paradigms can be distinguished and categorized, there is wide agreement that at least a few distinct, widely-employed programming paradigms exist. We can enumerate these paradigms, and we can list the programming language features (i.e., abstractions) that can be used to identify and distinguish between them:
  • imperative programming:
    • mutable state and side effects:
      • indexed memory regions;
      • named, mutable variables;
      • ability to interact with other "external" devices (video card, network, etc.);
    • sequential execution of statements:
      • one operation per time step (control flow);
      • branching and loops (direction of control flow);
    • procedural programming:
      • definition and invocation of named procedures;
    • parallelism achieved with forking, threads, and message passing;
  • object-oriented programming:
    • data and methods organized into objects;
    • multiple instances of objects interacting via method calls:
      • event handling;
    • enforcement of encapsulation and modularity;
  • declarative programming:
    • static, immutable descriptions of data, models, or problems:
      • user-defined data types:
        • algebraic data types;
      • type checking/inference.
    • query sublanguage for examining/exploring static descriptions;
    • functional programming:
Mathematically, it is possible to identify which paradigms are supported by a programming language by examining its operational semantics. We have already seen many examples of an operational semantics for an imperative language (such as this one and this one).

As a result of both historical trends and of practical necessity, most widely-used programming languages support multiple programming paradigms:

imperative procedural object-
oriented
declarative functional static type
checking
garbage
collection
side
effects
machine
languages
some support
(indexed
memory,
branching)
no support no support no support no support none no yes
FORTRAN
(initial
release)
full support no support no support no support no support very
limited
no yes
C full support full support no support no support some support
(function pointers)
very
limited
no yes
C++ full support full support full support class declarations,
some type checking
function pointers
and some extensions
(e.g., STL)
limited no yes
Java full support full support full support type system,
class declarations,
abstract classes,
interfaces,
type checking,
and type inference
generics, interfaces,
objects as wrappers,
anonymous classes
extensive yes yes
PHP full support full support full support no support some support none yes
(≥ 5.3)
yes
Python full support full support full support no support full support none yes yes
JavaScript full support full support full support no support full support none yes yes
Haskell meta-support
(monads)
meta-support
(functions)
no support full support full support extensive yes some
(IO
monads)
SQL some support
(insertion/
deletion;
sequences of
statements in
versions
like MS-SQL)
some support
(procedures in
versions
like MS-SQL)
no support full support
(definition of
schemas and
tables; ability
to evaluate
queries over
tables)
some support
(custom map and
aggregation
procedures in
MS-SQL)
some yes some
Amazon
Web
Services
APIs
some support no support no support some support some support none n/a yes

While most languagues support many paradigms in practice, there is often a historically predominant paradigm that distinguished that programming language at the time it was introduced, that is supported natively by the programming language (i.e., without special extensions, libraries, or idioms), and that is employed by the communities of programmers who use that language. In the above, the paradigm most closely identified with each language is indicated with a dark green outline.

A note about the above chart: when it comes to domain-specific languages that have a very narrow purpose (such as the Amazon Web Services API) and are not meant to be general-purpose programming languages, support of a paradigm does not imply that the language supports that paradigm in general; instead, it means the language supports that paradigm insofar as it applies to that particular application domain.

[link]  7.2. Substitution and unification

In this course, we will focus on the declarative (and functional) programming language Haskell. The Haskell syntax and operational semantics natively support unification in multiple ways (specifically, pattern matching within expressions and polymorphism within the type system). Unification is a general concept that is applied in some form in all areas of mathematics and computer science. In this subsection, we will define unification.
Definition: Given a set of variables X and an abstract syntax A (i.e., a set of abstract syntax trees), a substitution σ is a mapping from variables names in X to abstract syntax trees in A. Given an abstract syntax tree a A, we denote the result of substituting all the variables in a according to σ using the notation σ(a).
You'll notice that we use the same mathematical object to represent both environments (a.k.a., contexts) and substitutions. Thus, we will use the same notation for representing manipulations involving substitutions.
  • In this course, we will usually denote a substitution using the symbol σ.
  • Given a variable name x X, we denote by σ(x) the abstract syntax tree that the environment σ assigns to x.
  • An empty substitution (one that does not map any variable names to any abstract syntax trees) is denoted using σ0.
  • We use the notation dom(σ) to represent the set of variables that σ maps to abstract syntax trees.
  • Given two substitutions σ1 and σ2 such that dom(σ1) ∩ dom(σ2) = ∅ (i.e., the two substitutions do not overlap or conflict on any variables), we denote by σ1 ⊎ σ2 the substitution that can be obtained by combining the two individual substitutions. Thus, dom(σ1 ⊎ σ2) = dom(σ1) ∪ dom(σ2).
Example: Suppose we have the following abstract syntax (here, node is an infix operator for building tree nodes):
tree t
::=
leaf | x | t node t
variable x
::=
any valid variable
Suppose we are given the following substitution, which maps the variable u to the tree leaf and the variable v to the tree leaf node leaf:
σ
=
{ uleaf, vleaf node leaf }
Applying the above substitution to a tree that does not contain any variable nodes u or v has no effect:
σ(leaf)
=
leaf
Applying the substitution to a tree containing the variables u and/or v would result in those variables being replaced by the corresponding tree inside σ (we add parentheses to the result for legibility):
σ(u node v)
=
(leaf) node (leaf node leaf)
Definition: Given an abstract syntax A and two abstract syntax trees a, b A, we say that a substitution σ unifies a and b if the following equation holds:
σ(a)
=
σ(b)
That is, applying the substitution to both trees produces exactly the same tree. The process of calculating the smallest σ that guarantees the above equality is called unification.
For those familiar with the concept of a magma, unification corresponds to the process of solving equations on magmas (or, more generally, any algebraic structure without any reduction or cancellation laws).
Fact: Suppose we are given an abstract syntax A and two abstract syntax trees a, b A that contain no variables. If a = b, then it is true that for all σ (including σ0):
σ(a)
=
σ(b)
If ab, then it is true that for all σ:
σ(a)
σ(b)
Example: Suppose we have the following abstract syntax:
m
::=
object | x | m m | m m | ( m )
variable x
::=
any valid variable
Unify the following two abstract syntax trees, if possible (i.e., solve the following equation):
(z ⊗ object) ⊕ x
=
(object ⊗ y) ⊕ (object ⊗ object)
We can proceed by breaking the problem down. Since the root nodes of both abstract syntax trees are ⊕, we can turn the above into two equations:
z ⊗ object
=
object ⊗ y
x
=
object ⊗ object
The first equation can be broken down further:
z
=
object
object
=
y
x
=
object ⊗ object
At this point, we can build up the substitution using the above information:
σ
=
{ xobject ⊗ object, yobject, zobject }
Thus, σ as defined above unifies the two sides of the equation; the equation has a solution.
The equality algorithm on abstract syntax trees is a degenerate version of a unification algorithm in which substitutions must always remain empty. This algorithm is implemented natively within Python (we have been using it every time we checked whether two trees represented as nested dictionaries are equivalent using the == operator), and Haskell supports the automatic derivation of this algorithm for any data type (using the deriving Eq directive).
Algorithm (equality): The following algorithm can determine whether two trees are equivalent.
  1. equal(a, b): two abstract syntax trees a and b
    1. if both a and b are leaf nodes and are equivalent
      1. return true
    2. if both a and b have the same label and the same number of children
      1. in order from left to right, check each pair of corresponding children of a and b for equality
      2. return true if all the subtrees are equivalent
      3. return false otherwise
Algorithm (pattern matching unification): Whether unification can be computed efficiently, or at all, depends on the subset of the abstract syntax from which inputs are drawn (i.e., it depends on what restrictions are placed on the abstract syntax trees that may need to be unified). The following algorithm, which we will call pattern matching unification, is guaranteed to terminate quickly (i.e., polynomial time) as long as at least one side of the equation contains no variables. It is guaranteed to quickly find a solution if it exists, as long as all variables occur exactly once.
  1. unify(a, b): two abstract syntax trees a and b
    1. if both a and b are leaf nodes and are equivalent
      1. return the empty substitution σ0
    2. if a is a variable node representing a variable x
      1. return the substitution {xb}
    3. if b is a variable node representing a variable x
      1. return the substitution {xa}
    4. if both a and b have the same label and the same number of children
      1. in order from left to right, unify each pair of corresponding children of a and b
      2. as long as they do not overlap on any variables, combine the substitutions obtained above
      3. return the combined substitution
Example: The Python operational semantics (and interpreter) employ a pattern matching unification algorithm in some cases. For example, it is possible to assign to arbitrarily deep tree of nested tuples.

>>> (a,(b,c),(d,(e,f),g)) = (1,(2,3),(4,(5,6),7))
>>> (a,b,c,d,e,f,g)
(1,2,3,4,5,6,7)
        

[link]  7.3. Declarative programming languages

Any programming language that allows users to provide a static, immutable description of data, a model, or a problem supports the declarative programming paradigm to at least some degree. For example, languages that allow users to specify types and relationships between those types (e.g., Haskell types, Java classes, and so on) support a declarative programming paradigm for types.

In this course we will focus on declarative languages that allow users to specify data types, relationships between data types, and pure mathematical functions over those data types. In particular, we will look at the Haskell programming language.

In order to define the operational semantics for a declarative language, we must slightly modify the way in which we define an operational semantics by extending the premises and conclusions with an additional parameter representing the data, model, or problem description. We will adopt the convention of calling this parameter M. Thus, an intepreter for a declarative programming language first builds data, a model, or a problem description by parsing and collecting into a single structure all the declarations in a program. It then allows the user to query that structure.

In the Haskell programming language, a module is a collection of declarations (that can occur in any order) of variables, functions, and types. For example:

module Example where

c = 5;

g (0) = 0;
g (x) = f(x) + f(x);

f (x) = 2 * x + 1;
      
A Haskell interpreter first assembles the module without performing any evaluation, and then allows users to make queries about that module. Queries take the form of expressions, and the response to a query is the evaluation of that expression with respect to the entire module (i.e., all the definitions in the module).

*> c + c
10
*> f(4)
9
      
Haskell natively supports lists and tuples, and the evaluation algorithm employs the pattern matching unification algorithm. Suppose we have the following module:

module AnotherExample where

f (-1) = 1;
f (0) = 0;
f (1) = -1;

(a,(b,(c,d),e)) = (1,(2,(3,4),5));

g([x,y,z]) = [z,y,x];
      
We can query the above module as follows:

*> f(-1)
1
*> d
4
*> g([1,2,3])
[3,2,1]
      

[link]  7.4. Algebraic data types

Some declarative languages natively support abstract mathematical, logical, and algebraic structures such as BNF notation, abstract syntaxes, logical formulas and predicates, and inference rules. Haskell natively supports user-defined algebraic data types. An algebraic data type is effectively a user-defined abstract syntax; the Haskell syntax for defining algebraic data types corresponds closely to BNF notation. In the module below, we define an algebraic data type for representing binary trees.

module Tree where

data Tree = Leaf | Node Tree Tree;
      
Most Haskell interpreters and compilers support infix notation in data type definitions as well.

data Tree = Leaf | Tree `Node` Tree;
      
Once the above module is parsed and built, the user is allowed to construct instances of the Tree algebraic data type. However, in order to actually display instances of the data type, it is necessary to ask the Haskell interpreter to derive a definition of a show function automatically from the definition of the algebraic data type (something that is done implicitly in, for example, Python). Haskell requires that this be requested explicitly because it also allows users to define their own variants of such functions.

module Tree where

data Tree = 
    Leaf 
  | Node Tree Tree 
  deriving Show;
      
We can now try building trees using this algebraic data type.

*> Leaf           -- We represented this using "Leaf" in Python.
Leaf

*> Node Leaf Leaf -- We represented this using {"Node":["Leaf", "Leaf"]} in Python.
Node Leaf Leaf

*> Leaf `Node` Leaf
Node Leaf Leaf

*> Leaf `Node` ((Leaf `Node` Leaf) `Node` (Leaf `Node` Leaf))
Node Leaf (Node (Node Leaf Leaf) (Node Leaf Leaf))
      
Haskell's native support for pattern matching unification makes it possible to write concise definitions for functions that are defined over trees.

-- Count the number of nodes in a tree that have exactly one non-leaf child.
-- The spaces between the ( ... ) are just for legibility.
nodesWithOneLeaf (Node Leaf Leaf) = 0;
nodesWithOneLeaf (Node Leaf t   ) = 1 + nodesWithOneLeaf(t);
nodesWithOneLeaf (Node t    Leaf) = 1 + nodesWithOneLeaf(t);
nodesWithOneLeaf (Node t1   t2  ) = nodesWithOneLeaf(t1) + nodesWithOneLeaf(t2);
nodesWithOneLeaf (Leaf          ) = 0;
      
Algebraic data types are very well-suited for represented abstract syntaxes of languages, and pattern matching unification makes it possible to define concise implementations of the operational semantics of languages (as well as any other transformations over abstract syntax trees, including optimization algorithms, static analysis algorithms, and compilation algorithms).
Example: Suppose we want to define an abstract syntax and simple interpreter for formulas. We can use the declarations below to do so.

module Formula where

data Formula = 
    T
  | F
  | Not Formula
  | And Formula Formula
  | Or Formula Formula
  deriving Show;

vNot (T) = F;
vNot (F) = T;

vAnd (T, T) = T;
vAnd (T, F) = F;
vAnd (F, T) = F;
vAnd (F, F) = F;

vOr (T, T) = T;
vOr (T, F) = T;
vOr (F, T) = T;
vOr (F, F) = F;

eval (T        ) = T;
eval (F        ) = F;
eval (Not f    ) = vNot (eval f);
eval (And f1 f2) = vAnd (eval f1, eval f2);
eval (Or  f1 f2) = vOr (eval f1, eval f2);
      
Using wildcards (written using the underscore _) in patterns makes it possible to write even more concise definitions.

vAnd (T, T) = T;
vAnd _      = F;

vOr (F, F) = F;
vOr _      = T;

eval (Not f    ) = vNot (eval f);
eval (And f1 f2) = vAnd (eval f1, eval f2);
eval (Or  f1 f2) = vOr (eval f1, eval f2);
eval (f)         = f;
        
To check equality of algebraic data type instances, it is necessary to either implement or derive the equality function ==. To have the Haskell interpreter or compiler derive it automatically, you can use deriving Eq. These can be combined with other deriving directives, such as in the example below.

data Formula = 
    T
  | F
  | Not Formula
  | And Formula Formula
  | Or Formula Formula
  deriving (Eq, Show);
        

[link]  7.5. Type system of the typed declarative language Haskell

Haskell natively supports defining and using algebraic data types, tuples, lists, and functions. The Haskell type system provides a good way to organize and document these data structures, and governs how they can be combined. The abstract syntax for a portion of the Haskell type system is provided below.
Haskell type τ
::=
Bool
|
Int
|
Integer
|
Char
|
()
|
( τ , τ )
|
( τ , τ , τ )
|
( τ , ... , τ )
|
[ τ ]
|
τ -> τ
Haskell allows users to define their own algebraic data types as well as their own type synonyms. Every Haskell module has two distinct namespaces for identifiers, and these two namespaces are completely disjoint: there is a namespace for identifiers that correspond to variable, functions, and constructors, and there is a namespace for identifiers that correspond to types. The concrete syntax determines to which namespace a given variable belongs, and there is never any ambiguity in the concrete syntax in this regard. Note that this is unlike Python, where types can appear in any expression within a program. The following example illustrates the distinct namespaces for expressions and types:

module Example where

f :: Integer -> Integer
f (x) = x + x

data Tree = Leaf | Node Tree Tree

type AnotherNameForInteger = Integer

type AnotherNameForTree = Tree

isLeaf :: AnotherNameForTree -> Bool
isLeaf (Leaf      ) = True
isLeaf (Node t1 t2) = False

Haskell lists have a list type in Haskell. For example, a list of integers has the type [Integer].

list :: [Integer]
list = [1,2,3]
      
Haskell lists are actually represented using an algebraic data type that is similar to the Tree data type that we have already seen. If we were to define our own algebraic data type for lists in Haskell, for example, we could do so as follows (the terms "nil" and "cons" are commonly used in the functional language community for the empty list and for the list node constructor, respectively):

data List = Nil | Cons Integer List

example :: List
example = Cons 1 (Cons 2 (Cons 3 Nil))
      
The built-in list type has its own special names for the two constructors: the empty list is denoted [], and the list node constructor is the operator (:), which can also be written as an infix operator :. The following three syntaxes are equivalent, and evaluate to the same list [1,2,3]:

example1 :: [Integer]
example1 = (:) 1 ((:) 2 ((:) 3 []))

example2 :: [Integer]
example2 =  1 : (2 : (3 : []))

example3 :: [Integer]
example3 =  [1,2,3]
      
There is also a syntax for defining lists that contain ranges of integers:

*> [1..4]
[1,2,3,4]
      
Haskell supports user-defined type synonyms. These are simply substituted and have no restrictive effects.

type Distance = Integer
      
There are some default type synonym definitions, such as String; strings in Haskell are just lists of characters:

type String = [Char]

example1 :: String
example1 = "abcdefg"

example2 :: String
example2 = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
      
Type synonyms can be assigned any valid type, no matter how complex it may be. In the below example, the type synonym GameStrategyAlgorithm is equivalent to the type [[(Integer, Integer)]] -> String (i.e., the type for functions that take lists of lists of integer pairs and return a string):

type GameState = [(Integer, Integer)]
type Game = [GameState]
type Strategy = String
type GameStrategyAlgorithm = Game -> Strategy
      

[link]  7.6. Defining and working with infinite data type instances using lazy evaluation

Since Haskell expressions are only evaluated when a query is made, a Haskell module can safely describe recursively defined, potentially infinitely large values. For example, we can define an infinite list containing the integer 1 in every position:

infinite :: [Integer]
infinite = [1] ++ infinite
      
Equivalently, we can use the (:) list node constructor:

infinite :: [Integer]
infinite = 1 : infinite
      
Haskell also allows infinite lists containing ranges to be declared using the list range notation:

*> take 10 [5..]
[5,6,7,8,9,10,11,12,13,14]
      
The Haskell operational semantics is based on pattern matching unification, but it doesn't work exactly like the operational semantics we implemented in Assignment #4. Haskell supports lazy evaluation. Intuitively, when trying to evaluate a function application of the form f(e), Haskell will only evaluate the expression e using substitutions until it is sufficiently deep to try all the patterns in the definition of f (e.g., if the patterns associated with f are all of depth at most 3, the expression e will be evaluated until the constructors up to depth 3 are known, and no further).

Haskell lazy evaluation operational semantics make it possible to perform queries on infinite values without running into an infinite recursion or infinite loop. For example, suppose we define the following function on lists, which takes only the first n elements of a list (note that this function is already included in the Prelude; we use take_ for our examples to avoid a name collision).

take_ :: (Integer, [Integer]) -> [Integer]
take_ (0, xs    ) = []
take_ (n, (x:xs)) = x : take (n-1, xs)
      
Querying the infinite list is not a problem if we use take.

*> take_ (10, infinite)
[1,1,1,1,1,1,1,1,1,1]
      
Using pattern matching unification and lazy evaluation, it is also possible to define algorithm that can operate on infinite lists. For example, we can define a function that adds all the corresponding elements of two infinite lists. Notice that there is no case for the empty list in this definition because infinite lists are never empty.

addInfiniteLists :: ([Integer], [Integer]) -> [Integer]
addInfiniteLists (x:xs, y:ys) = (x + y) : addInfiniteLists (xs, ys)
      
We can now use the addInfiniteLists function to add two infinite lists. Suppose we have the following definitions:

ones = 1 : ones
twos = 2 : twos
      
We can then define an infinite list in which all entries are 3:

*> take 10 (addInfiniteLists (ones, twos))
[3,3,3,3,3,3,3,3,3,3]
      

[link]  7.7. Functions and higher-order functions in a typed functional language

Haskell is a functional language; functional languages are distinguished by their ability to allow users to define and use higher-order function: functions that can take other function as arguments, and that can return functions as results. Many other languages have native support for higher-order functions, including JavaScript and Python; we say that these languages support the functional programming paradigm.

To start become familiar with the concept of a higher-order function, we can consider the following example:

f :: Integer -> Integer
f (x) = x + 1

g :: Integer -> (Integer -> Integer)
g (y) = f
      
Notice that g is a function that takes an integer as an argument. It then throws this argument away and returns the function f. Its type reflects this. Thus, the result of evaluating an expression such as g(4) is a function of type Integer -> Integer, and this function can be applied to another argument.

*> :t g(1)
g(1) :: Integer -> Integer
*> :t g(1)(5)
g(1)(5) :: Integer
*> g(1)(5)
6
      
What if we want to define a function like g that can be partially applied to only some of its arguments, but that still uses all of its arguments? We can do so by listing multiple, space-separated parameters in the definition of g.

g :: Integer -> (Integer -> Integer)
g (y) (x) = y + x
      
It is then possible to apply g to arguments one at a time, as before.

*> :t g(3)
g(3) :: Integer -> Integer
*> :t g(3)(4)
g(3)(4) :: Integer
*> g(3)(4)
7
      
Notice the relationship between application of functions to arguments, and the types of each intermediate result. Suppose we have the following definitions.

data A = A
data B = B
data C = C
data D = D
data E = E

f :: A -> (B -> (C -> (D -> E)))
f a b c d = E
      
Then we have the following types for each of the following expressions (spacing adjusted for legibility):

*> :t f
f             :: A -> (B -> (C -> (D -> E)))

*> :t f(A)
f(A)          :: B -> (C -> (D -> E))

*> :t f(A)(B)
f(A)(B)       :: C -> (D -> E)

*> :t f(A)(B)(C)
f(A)(B)(C)    :: D -> E

*> :t f(A)(B)(C)(D)
f(A)(B)(C)(D) :: E
      
Example: We want to write a higher-order function that takes any function f of type Integer -> Integer and returns a function f' that is its reflection across the x-axis (i.e., for all x, f(x) = -f'(x)).

reflect :: (Integer -> Integer) -> Integer -> Integer
reflect (f) (x) = - f(x) 
        
Notice that we could also write the type as (Integer -> Integer) -> (Integer -> Integer) because -> is right-associative. This makes it more obvious that reflect takes a function of type Integer -> Integer as an input and returns a function of type Integer -> Integer as an output.
Example: We want to write a higher-order function that takes any function f of type Integer -> Integer and returns an integer approximation of its definite integral starting at x = 0.

integral :: (Integer -> Integer) -> Integer -> Integer
integral (f) (x) = sum [ f(k) | k <- [0..x] ]
        
We could also write a higher-order function that takes any function f of type Double -> Double and returns an approximation of its definite integral starting at x = 0 for a given precision.

integral :: Double -> (Double -> Double) -> Double -> Double
integral (i) (f) (x) = sum [ f(k*i)*i | k <- [0..(x/i)] ]
        
Example: We want to write a higher-order function that takes any function f of type Double -> Double and returns an approximation of its derivative.

derivative :: Double -> (Double -> Double) -> Double -> Double
derivative (h) (f) (x) = (f(x + h) - f(x)) / h
        

[link]  7.8. Logic programming

Declarative languages are well-suited for describing logical models and then examining them by making queries. Declarative languages like Prolog and Datalog are designed specifically to support logic programming. In Haskell, it is possible to use functions that return boolean values to represent logical predicates, and boolean expressions to represent the formulas that define them.
Example: Suppose we want to model a simple access control system (e.g., for a file system). Notice that we can view functions like user() and owns() either as tables in a database (with each declaration for the function acting as a row in the table) or as logical predicates.

data Constant = Alice | Bob | X | Y | Z deriving (Eq, Show)

-- Define which constants are a file.
file (X) = True
file (Y) = True
file (Z) = True
file (_) = False

-- Define which constants are a user.
user (Alice) = True
user (Bob)   = True
user (_)     = False

-- Specify which users are administrators.
admin(Alice) = True
admin(Bob)   = False

-- Specify file ownership relation.
owns (Alice, X) = True
owns (Bob,   Y) = True
owns (Bob,   Z) = True
owns (_,     _) = False

-- Only owners and administrators can write files.
write (user', file') = 
     ( owns(user', file') || admin(user'))
  && user(user')
  && file(file')

-- Anyone can read files.
read (user, file) = True
        
In the above example, we needed to define predicates that distinguished different types of constants. However, this means that every call to file() and user() must take place when the program is running. We can make the code more efficient by instead declaring distinct types. Since type checking is performed only at compile time in Haskell, this will still ensure that there are no type errors, but it will be more efficient because fewer function invocations will occur when the program is running.

data User = Alice | Bob deriving (Eq, Show)
data File = X | Y | Z deriving (Eq, Show)

admin :: User -> Bool
admin (Alice) = True
admin (Bob)   = False

owns :: (User, File) -> Bool
owns (Alice, X) = True
owns (Bob,   Y)   = True
owns (Bob,   Z)   = True
owns (_,     _)   = False

write :: (User, File) -> Bool
write (user, file) = owns(user, file) || admin(user)

read :: (User, File) -> Bool
read (user, file) = True
        

[link]  7.9. Data encapsulation using algebraic data types

Typed languages that support algebraic data types make it possible to implement and enforce a kind of data encapsulation. For example, suppose we want to represent distances in various units (such as Imperial and metric units) in our program. There are well-known examples of expensive software failures (such as the crash of the Mars Climate Orbiter) that occurred because one component assumed that numerical data was in Imperial units, while another component assumed it was metric units.

data Meters = Meters Integer
data Feet = Feet Integer
      
Given the above, it is now possible to enforce which type of measurement is supported by a function. For example:

data Action = KeepGoing | SlowDown

updateVelocity :: Feet -> Action
updateVelocity (Feet n) = if n < 100 then SlowDown else KeepGoing
      

[link]  7.10. Predefined Haskell type classes, user-defined operators, and embedded languages

Haskell allows programmers to define their own functions for converting algebraic data type values to strings (e.g., for displaying them in the interactive interface), and it allows programmers to provide custom definitions for numeric literals (i.e., 0, 1, 2, and so on) and built-in operators (such as +, *, and so on).

instance Num Meters where
  fromInteger n = Meters n
  (Meters x) + (Meters y) = Meters (x + y)
      
The above definitions allow us to use the numeric liters and built-in operators to work with values of type Meters.

*> 1 + 2 + 3 :: Meters
Meters 6
      
It is also possible to provide a user-defined function for converting values of type Meters to strings.

instance Show Meters where
  show (Meters n) = show n ++ " meters"
      
Then, we have the following behavior:

*> 1 + 2 + 3 :: Meters
6 meters
      
The Haskell features that support algebraic data types, multiple ways to apply higher-order functions to arguments, user-defined infix operators, and user-defined definitions of built-in literals and functions can be used in concert to easily define new programming languages within Haskell itself. This makes it much easier to build parse trees directly within Haskell, thus obviating the need for a parser.
Definition: An embedded language is a subset of the concrete syntax of a programming language (the host language) that is treated as its own separate, self-contained language. Libraries can often be viewed as embedded languages.
Example: Suppose we define the following algebraic data types, corresponding to an abstract syntax for an imperative programming language. This definition, along with some of Haskell's other features, will allow us to turn this into an embedded language within Haskell. We can then use Haskell's concrete syntax to write the concrete syntax for this language, and Haskell's evaluation algorithm will also act as the parser for this language (with the Haskell evaluation algorithm producing an abstract syntax tree representation of whatever is written).

data Term =
    Number Integer
  | Plus Term Term
  deriving Show
      
data Formula = 
    T 
  | F 
  | Not Formula 
  | And Formula Formula
  | Or Formula Formula
  | Equal Term Term
  | LessThan Term Term
  deriving Show

data Statement =
    Print Term Statement
  | If Formula Statement Statement
  | End
  deriving Show
        
We also introduce some user-defined infix operators, and add user-specified definitions for some built-in Haskell operators:

(<<<) :: Term -> Term -> Formula
(<<<) t1 t2 = LessThan t1 t2

(===) :: Term -> Term -> Formula
(===) t1 t2 = Equal t1 t2

(&&&) :: Formula -> Formula -> Formula
(&&&) f1 f2 = And f1 f2

(|||) :: Formula -> Formula -> Formula
(|||) f1 f2 = Or f1 f2

instance Num Term where
  fromInteger n = Number n
  t1 + t2 = Plus t1 t2
        
We can now use all of the above together with the $ syntax to easily write down abstract syntax trees.

parseTree =
   Print 5 $
   Print 6 $
   If ((5 <<< 6) &&& (6 <<< 7)) (
     Print 6 $
     Print 7 $
     End
   ) $
   End
        
Evaluating the definition of parseTree will yield the actual algebraic data type instance (i.e., a parse tree).

*> parseTree
Print (Number 5) (Print (Number 6) (
  If
    (And (LessThan (Number 5) (Number 6)) (LessThan (Number 6) (Number 7))) 
    (Print (Number 6) (Print (Number 7) End)) End)
  )
        

[link]  7.11. Additional useful Haskell language features and examples

Like Java/C/C++, Haskell provides a built-in ternary operator if ... then ... else ... for writing conditional expressions:

*> if True then 1 else 2
1
*> if False && True then "Yes" else "No"
"No"
      
However, this is only a convenience; a user can easily implement his or her own syntax for this purpose. The support for pattern matching unification and lazy evaluation that Haskell's operational semantics provides is sufficient. Two example implementations are provided below.

if' :: (Bool, Integer, Integer) -> Integer
if' (True,  trueBranch, falseBranch) = trueBranch
if' (False, trueBranch, falseBranch) = falseBranch

if'' :: Bool -> Integer -> Integer -> Integer
if'' True  trueBranch _           = trueBranch
if'' False _          falseBranch = falseBranch
      
The if ... then ... else ... syntax (or function, as we have seen above) can be useful in certain situations.
Example: Suppose we are trying to represent an environment or substitution using a data structure in Haskell. We might do so by using a list of tuples, where each tuple is a pair consisting of a String and a value (e.g., an Integer). For example:

[("x",1), ("y", 2)] :: [(String, Integer)]
        
Then, we would need a function that can help us retrieve a value associated with a particular variable name.

lookup' :: String -> [(String, Integer)] -> Integer
lookup' x ((x',i) : rest) = if x == x' then i else lookup' x rest
        
In fact, Haskell's base library contains a similar function, lookup. However, since it returns a result that is wrapped in a Maybe constructor, we can unwrap it using, e.g.:

unwrap :: Maybe Integer -> Integer
unwrap (Just n) = n
        
We discuss Maybe in more detail in a subsequent section.
Example: Suppose we want to write a simple interpreter for formulas (similar to the one in this example), except we also want to support variables in the language. The data type corresponding to the abstract syntax might then look as follows:

data Formula =
    T
  | F
  | Variable String
  | Xor Formula Formula
  deriving Show
      
We can then write a simple interpreter that uses environments (as defined in this example) in the following way (notice that the logical operation xor can be represented with the inequality operator /= on booleans):

eval :: [(String, Bool)] -> Formula -> Bool
eval env (T         ) = True
eval env (F         ) = False
eval env (Variable x) = lookup' x env
eval env (Xor f1 f2 ) = eval env f1 /= eval env f2
        
We can then run the evaluation algorithm above on a given environment and expression containing variables:

*> eval [("x", True), ("y", False), ("z", True)] (Xor (Variable "x") (Variable "y"))
True
        
Example: Suppose we want to represent our own totally ordered set. For example, suppose we have the following data type for representing a discrete subset of the color spectrum:

data Color = Red | Green | Blue
        
Since Red has the lowest frequency and Blue has the highest frequency, we can define an ordering on all instances of this data type by making the Color type a member of the Ord type class. Note that we could use wildcards (_) to make the definition more concise.

instance Ord Color where
  Red   <= Red   = True
  Red   <= Green = True
  Red   <= Blue  = True
  Green <= Green = True
  Green <= Blue  = True
  Blue  <= Blue  = True
  _     <= _     = False
        
It is now possible to use the built-in operators <, <=, >, and >= to compare colors, as well as library functions such as max :: Color -> Color -> Color, min :: Color -> Color -> Color, maximum :: [Color] -> Color, and minimum :: [Color] -> Color.
Example: Suppose we have a data type that allows us to build trees that contain integer constants in their leaves.

data IntegerTree = Node IntegerTree IntegerTree | Leaf Integer
        
We can define a recursive function that assembles all the integers in the leaves in a single list.

ints :: IntegerTree -> [Integer]
ints (Node t1 t2) = ints t1 ++ ints t2
ints (Leaf n    ) = [n]
        
Example: Suppose we want to write a function that composes two functions on integers. We might define a type synonym for such functions:

type Function = Integer -> Integer
        
We then know that one possible type for a composition function might be as follows (it takes two function arguments, and returns a new composed function as a result):

compose :: Function -> Function -> Function
        
However, how do we actually implement compose?

compose f g = ?
        
To understand how we can do so, we must first expand the type of compose; let's substitute one the right-most Function synonym with its definition:

compose :: Function -> Function -> (Integer -> Integer)
        
Now recall that the -> type operator is right-associative; this means we can remove the right-most set of parentheses without changing the meaning of the type:

compose :: Function -> Function -> Integer -> Integer
        
It is now clear how to implement compose: it takes two Function arguments, and one Integer argument:

compose :: Function -> Function -> Integer -> Integer
compose f g x = f(g(x))
        
Note that this function is already available in the Haskell base library, and it is called (.). So we could also have defined compose as follows:

compose :: Function -> Function -> Function
compose f g = f . g
        
The above is an example of point-free programming, because we never explicitly gave names to the individual inputs (i.e., "points") on which the functions operate.

[link]  7.12. List comprehensions and pattern matching unification

Haskell list comprehension syntax can be combined with Haskell's built-in pattern matching unification capabilities.
Example: In Haskell, list comprehensions make it possible to build up a list by iterating over the elements of one or more other lists (possibly filtering entries using one or more formulas).

*> [x+1 | x <- [0..4]]
[1,2,3,4,5]

*> [(x,y) | x <- [1..3], y <- [1..2]]
[(1,1),(1,2),(2,1),(2,2),(3,1),(3,2)]

*> [(x,y) | x <- [1..3], y <- [1..2], x + y == 3, x >= 1]
[(1,2),(2,1)]
        
Example: In Haskell list comprehensions, the left-hand side of the membership operator <- can be any pattern. If a pattern appears on the left-hand side of <-, then each element in the list on the right-hand side will be checked against the pattern. Only those elements that unify with the pattern will be considered, and for each of them, the subsequent conditions in the comprehension, as well as the body of the comprehension, will be evaluated.
Suppose a module contains a data type definition for a simple tree data structure, and a list of trees.

data Tree =
    Node Tree Tree
  | Leaf
  deriving Show

example :: [Tree]
example = [
    Node Leaf Leaf,
    Node (Node Leaf Leaf) Leaf,
    Node Leaf (Node Leaf Leaf),
    Node Leaf (Node (Node Leaf Leaf) Leaf)
  ]
        
Then we can evaluate comprehensions such as the following. The first expression builds a list of the right subtrees of all the trees in example. The second expression builds a list containing the right subtrees that are not leaves. The last expression builds a list of subtree pairs for each non-leaf tree.

*> [r | Node _ r <- example]
[Leaf, Leaf, Node Leaf Leaf, Node (Node Leaf Leaf) Leaf]

*> [r | Node _ r <- example, r /= Leaf]
[Node Leaf Leaf, Node (Node Leaf Leaf) Leaf]

*> [(l,r) | Node l r <- example]
[(Leaf, Leaf), (Node Leaf Leaf, Leaf), (Leaf, Node Leaf Leaf), (Leaf, Node (Node Leaf Leaf) Leaf)]
        
We can also easily check whether a particular kind of tree is in the list (such as a non-leaf tree) by adding a value of type () for each matching entry and then checking the length of the list resulting from the comprehension, or by comparing it to the empty list.

*> length [() | Node _ _ <- example] > 0
True

*> [() | Node _ _ <- example] /= []
True
        

[link]  7.13. Modeling and exploring state spaces in a typed declarative/functional language

In many artificial intelligence and optimization problems, the collection of possible actions an algorithm, agent, or player can take can be modelled as a state space tree or graph (also sometimes called a decision tree or decision graph) which may or may not be infinite, depending on the length of the process or game being modeled. Algebraic data types and lazy evaluation make it easy to model such graphs.
Example: Suppose we want to model a Tic-tac-toe game, with the eventual goal of writing an AI algorithm that can play the game well. We can first settle on a representation for the board.

data Cell = E | X | O deriving (Eq, Show)
type Pos = (Integer, Integer)
data Board = Board [(Pos, Cell)] deriving Eq

instance Show Board where
  show (Board pcs) = show [c | (p,c) <- pcs]

start :: Board
start = Board [((r,c),E) | r <- [1..3], c <- [1..3]]
        
Notice that the starting board consists of all empty cells.

*> start
[E,E,E,E,E,E,E,E,E]
        
When a player moves, they place an X or O somewhere on the board. We can model this using a function that, given a move, transforms one board into another board. We use a list comprehension that updates only the cell with the corresponding position.

move :: ((Pos, Cell), Board) -> Board
move ((pos', cell'), Board pcs) = Board [(pos, if pos' == pos then cell' else cell) | (pos, cell) <- pcs]
        
Suppose we want to make a state space graph that captures all possible paths a Tic-tac-toe game can take from a given starting position, and given a player's current turn. Let us assume that turns can be modelled using an infinite list. For example, here is an infinite list for game in which the X and O players take alternating turns.

turns :: [Cell]
turns = X:O:turns
        
We first define a data structure for representing the possible paths a game can take. The data structure is a tree graph in which each node represents a game state (i.e., a Board), and can have any number of child trees (including zero).

data DecisionTree = Move Board [DecisionTree] deriving (Eq, Show)
        
Now, we can define a function that builds a decision tree given a starting game state and the player turn sequence. Notice that nextMoves takes a Board and a Cell (to indicate which player is taking a turn), and uses a list comprehension to choose only the board cells that are empty. For each empty cell, it adds a new board state to the list in which that position is filled with the specified player's symbol.

nextMoves :: (Board, Cell) -> [Board]
nextMoves (Board pcs, cell) = [move ((p, cell), Board pcs) | (p, E) <- pcs]
        
Given a starting board and turn sequence, we can now build the tree of state spaces for the entire game.

tree :: Board -> [Cell] -> DecisionTree
tree b (turn:turns) = Move b [tree b' turns | b' <- nextMoves (b, turn)]
        
Suppose we have a function win :: Cell -> Board -> Bool that checks if a given board is already a win for a particular player. We can then add an Ord instance to compare boards from player X's perspective: a board is "greater" (or "better") than another board if either the board is a win for X, or the other board is a loss for O.

instance Ord Board where
  b <= b' = win b' X || win b O
        
It is now possible to use the built-in operators <, <=, >, and >= to compare boards, as well as library functions such as max :: Board -> Board -> Board and maximum :: [Board] -> Board to choose the "best" boards.

[link]  7.14. Ad Hoc and Parametric Polymorphism

Definition: A language supports ad hoc polymorphism if it allows programmers to specify multiple definitions for the same constant or variable (which could be a literal, an operator symbol, a function name, a type, and so on). An interpreter or compiler then choose the appropriate definition to use based on the context (usually, the types of the argument expressions and/or the surrounding expressions and statements in the abstract syntax tree).

Common examples of language features that support ad hoc polymorphism include operator overloading in C++, C and Python (operator overloading is not supported in Java), method overloading in Java subclasses, Java interfaces, and type classes in Haskell.
Definition: A language supports parametric polymorphism if it allows programmers to specify a single definition for a function or procedure that can then be applied to values of multiple types (typically, a large range of types, or even all types). Typically, the function or procedure will operate on a container type, or some other parameterized or generic type. Thus, languages that support parametric polymorphism usually allow programmers to define data structures that can be parameterized by other data structures.

Common examples of language features that support parametric polymorphism include Java generics and the Haskell type system (the latter uses an extension of the Hindley-Milner type inference algorithm that is also used by the programming language ML).
Note an important distinction between ad hoc polymorphism and parametric polymorphism: ad hoc polymorphism allows programmers to continue adding new definitions for a given variable or constant within the same scope, while parametric polymorphism only permits a single, albeit generic, definition for a function or procedure within any given scope.
Example: Suppose we define a function len for computing lengths of lists of integers.

len :: [Integer] -> Integer
len []         = 0
len ((:) x xs) = 1 + len xs
        
Unfortunately, we cannot use the above function to compute the length of a list of, e.g., Bool values because the type [Integer] does not unify with the type [Bool]. We could write an alternative type to address this.

len :: [Bool] -> Integer
len []         = 0
len ((:) x xs) = 1 + len xs
        
We can now use len on lists of type [Bool], but we can no longer use it on lists of type [Integer]. Notice that the only thing that has changed is the type of the function; the body is the same in both cases. Haskell allows us to take advantage of this by employing a type variable.

len :: [a] -> Integer
len []         = 0
len ((:) x xs) = 1 + len xs
        
With the above declaration, it is now possible to apply the len function to a list containing any type of element. This is because the type variable a unifies with any other type (including Bool and Integer), so the input type [a] will unify with both [Integer] and [Bool], as well as types such as [String -> String] (a list of functions that take String inputs and return String outputs).

*> len [True, False, True]
3
*> len [1, 2, 3, 4, 5]
5
*> len [\s -> s ++ ".", \t -> t ++ ";"]
2
        
Since len is a function that can be applied to different input types, but has only a single definition that works for all input types, it is an example of parametric polymorphism.
Example: We can define a Set data structure for holding sets of Integer values that has an underlying list implementation.

module IntegerSet where
  
data Set =
    Empty
  | Set [Integer]
  deriving Show   
 
insert :: Set -> Integer -> Set
insert (Set l) x = Set (x:l)
insert (Empty) x = Set [x]

contains :: Integer -> Set -> Bool
contains x (Empty) = False
contains x (Set l) = x `elem` l
        
However, what if we also want user to have the option of using a data structure for sets that uses an underlying binary search tree implementation? We can first define a class for set data structures that specifies the signatures of the functions that a set data structure must support. We can even define new functions that rely only on the signature.

module SetClass where

class IsSet s where
  empty :: s
  insert :: s -> Integer -> s
  contains :: Integer -> s -> Bool
  
  inserts :: s -> [Integer] -> s
  inserts s (x:xs) = inserts (insert s x) xs
  inserts s []     = s
        
This makes it possible to provide two different implementations that are both members of this class.

module IntegerSetList where

import SetClass
  
data IntegerSetList =
    Empty
  | Set [Integer]
  deriving Show   

instance IsSet IntegerSetList where
  empty = Empty

  insert (Set l) x = Set (x:l)
  insert (Empty) x = Set [x]

  contains x (Empty) = False
  contains x (Set l) = x `elem` l
        

module IntegerSetTree where

import SetClass

data IntegerSetTree =
    Leaf
  | Node Integer IntegerSetTree IntegerSetTree
  deriving Show   

instance IsSet IntegerSetTree where
  empty = Leaf

  insert (Leaf       ) x = Node x Leaf Leaf
  insert (Node y s s') x =
    if x < y then 
      Node y (insert s x) s' 
    else
      Node y s (insert s' x)

  contains x (Leaf       ) = False
  contains x (Node y s s') =
    if y == x then
      True
    else if x < y then
      contains x s
    else
      contains x s'
        
Haskell type classes need not contain types; they can also contain type constructors (i.e., functions in the type system that can be used to construct a type given a type, i.e., type constructors with kind Type -> Type, and so on).
Example: We can define a Set data structure for holding sets of values of any type a.

module Set where
  
data Set a =
    Empty
  | Set [a]
  deriving Show   
 
insert :: Set a -> a -> Set a
insert (Set l) x = Set (x:l)
insert (Empty) x = Set [x]

contains :: Eq a => a -> Set a -> Bool
contains x (Empty) = False
contains x (Set l) = x `elem` l
        
However, what if we also want user to have the option of using a data structure for sets that uses an underlying binary search tree implementation? We can first define a class for set data structures that specifies the signatures of the functions that a set data structure must support. We can even define new functions that rely only on the signature.

module SetClass where

class IsSet s where
  empty :: s a
  insert :: Ord a => s a -> a -> s a
  contains :: (Ord a, Eq a) => a -> s a -> Bool
  
  inserts :: Ord a => s a -> [a] -> s a
  inserts s (x:xs) = inserts (insert s x) xs
  inserts s []     = s
        
This makes it possible to provide two different implementations of polymorphic set data structures that are both members of this class. Notice that we added the Ord a constraint on the element type a in case we want to compare them (e.g., if we are using a binary search tree implementation of sets).

module SetList where

import SetClass
  
data SetList a =
    Empty
  | Set [a]
  deriving Show   

instance IsSet SetList where
  empty = Empty

  insert (Set l) x = Set (x:l)
  insert (Empty) x = Set [x]

  contains x (Empty) = False
  contains x (Set l) = x `elem` l
        

module SetTree where

import SetClass

data SetTree a =
    Leaf
  | Node a (SetTree a) (SetTree a)
  deriving Show   

instance IsSet SetTree where
  empty = Leaf

  insert (Leaf       ) x = Node x Leaf Leaf
  insert (Node y s s') x =
    if x < y then 
      Node y (insert s x) s' 
    else
      Node y s (insert s' x)

  contains x (Leaf       ) = False
  contains x (Node y s s') =
    if y == x then
      True
    else if x < y then
      contains x s
    else
      contains x s'
        
Example: Suppose we have a Set data structure for holding sets of values of any type a that has a corresponding size :: Set a -> Int function.

module Set where
  
data Set a =
    Empty
  | Set [a]
  deriving (Eq, Show)   
 
size :: Set a -> Int
size (Set l) = length l
size (Empty) = 0
        
We can make it possible to order sets by their size using the < operator by making the type Set a (for any possible a) a member of the Ord type class.

instance Eq a => Ord (Set a) where
  s < s' = size s < size s'
        
Note that the instance declaration is prepended with the condition that the type a must be in the Eq type class. This is because the definition of Set a contains a deriving Eq clause, which means that a type Set a can only be used in a module if the parameter type a is a member of the Eq type class (otherwise, Haskell would not be able to compare two sets for equality using the automatically derived definition of (==) generated by the deriving Eq clause).

[link]  7.15. Folds, Unfolds, and Algebraic Properties of Functional Programs

We have seen that the operational semantics of declarative and functional languages support an approach to problem solving that involves first defining a model of some system, and then querying that model to answer questions about that system or explore its properties. We can define this approach to programming with more consistency and mathematical rigor by identifying the mathematical transformations that are typically employed and composed when using this approach. In particular, most component operations fall into two categories:
  • unfold (or anamorphism) operations are used to define or construct a data value;
  • fold (or catamorphism) operations are used to reduce, simplify, break down, or transform a data value.
A few special cases of the composition of a fold followed by an unfold are commonly used, as well (these are commonly implemented using comprehensions when the data value being transformed is a set or list):
  • map (sometimes known as metamorphism) operations are used to modify all the components of a data value using a single, often local transformation that is applied to all the components across the entire data value;
  • filter operations are used to remove (or select) only some of the components of a data value.

The ability to describe an algorithm using a composition of unfold, fold, map, and filter operations imposes fewer restrictions on how that computation can be carried out (when compared to a description of the same algorithm using a sequential list of operations). A description using fold, map, and so on may still impose some hierarchical dependencies on the computation, but the operational semantics of such a description allows an implementation (or a compiled version) of the algorithm to perform operations in parallel unless such parallelization is explicitly forbidden by the algorithm itself through explicit sequential dependencies introduced by the programmer. In other words, the default interpretation of such an algorithm definition is inherently parallelizable.

Defining the above operations on algebraic data types (including polymorphic algebraic data types) is often a straightforward process (in fact, it can often be automated). The Haskell base library provides functions for all of the above operations on Haskell lists, and many library functions are implemented using these functions; the Haskell compiler is optimized to compile these functions into efficient machine code. Most of the above functions are also supported by languages such as Python.
Example: We can begin by implementing a fold operation on Haskell Integer lists that does nothing.

fold :: [Integer] -> [Integer]
fold []     = []
fold (x:xs) = x : fold xs
        
We can modify the above to replace the two list constructors (i.e., (:) and []) in the input list with a binary operator and a base case (in this case, (+) and 0). This way, the fold function now computes the sum of the elements in the list.

fold :: [Integer] -> Integer
fold []     = 0
fold (x:xs) = x + fold xs
        
We can further generalize the above so that it takes the two replacement functions as arguments. The two functions are a replacement for (:), so it must be of type Integer -> Integer -> Integer, and a replacement for [], which must be of type Integer.

fold :: (Integer -> Integer -> Integer) -> Integer -> [Integer] -> Integer
fold f b []     = b
fold f b (x:xs) = f x (fold f b xs)
        
We can now use the above fold implementation to define many common functions on lists of integers.

sum :: [Integer] -> Integer
sum = fold (+) 0

product :: [Integer] -> Integer
product = fold (*) 1

maximum :: [Integer] -> Integer
maximum = fold max 0 -- Assumes non-negative integers.

minimum :: [Integer] -> Integer
minimum l = fold min (fold max 0 l) l
        
Example: The fold function for the default right-associative list implementation supported by Haskell (i.e., with the built-in constructors (:) and []) is called foldr.

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f base []     = base
foldr f base (x:xs) = f x (foldr f base xs)
        
The foldr function can be used to implement many of the usual library functions one might find useful when working with lists, and this is what is done in the Haskell libraries. Below are some examples (some of these are simplified versions of what is actually found in the libraries).

sum :: Num a -> [a] -> a
sum xs = foldr (+) 0 xs

max :: Ord a => a -> a -> a
max x y = if x > y then x else y

maximum :: Ord a => [a] -> a
maximum xs = foldr max 0 xs
        
Example: We can implement any map operation as a fold. Suppose we want to add the constant 1 to every element in an Integer list. For example, we want to achieve the following using a call to foldr:

*> [x + 1 | x <- [1,2,3,4,5]]
[2,3,4,5,6]
        
We can accomplish this by defining a modified list node constructor function that adds 1 to its first argument, then builds the list node:

addOneThenCons :: Integer -> [Integer] -> [Integer]
addOneThenCons x xs = (x + 1) : xs
        
We can now use the above together with foldr:

*> foldr addOneThenCons [] [1,2,3,4,5]
[2,3,4,5,6]
        
Using λ abstractions, we can avoid having to explicitly define addOneThenCons:

*> foldr (\x xs -> (x+1) : xs) [] [1,2,3,4,5]
[2,3,4,5,6]
        
Example: We can implement any filter operation as a fold. For example, suppose we want to only keep positive integers from an integer list:

*> [x | x <- [-3,-2,-1,0,1,2,3], x > 0]
[1,2,3]
        
We can accomplish this by defining a modified list node constructor function that only adds a new list node if the data inside it is a positive integer:

consIfPositive :: Integer -> [Integer] -> [Integer]
consIfPositive x xs = if x > 0 then x : xs else xs
        
We can now use the above together with foldr:

*> foldr consIfPositive [] [-3,-2,-1,0,1,2,3]
[1,2,3]
        
Using λ abstractions, we can avoid having to explicitly define consIfPositive:

*> foldr (\x xs -> if x > 0 then x : xs else xs) [] [-3,-2,-1,0,1,2,3]
[1,2,3]
        
Example: We can implement an easily parallelizable version of quicksort using filter operations:

qsort :: [Integer] -> [Integer]
qsort []     = []
qsort (x:xs) = qsort [y | y <- xs, y < x] ++ [x] ++ qsort [y | y <- xs, y >= x]
        
Example: We can implement a fold operation that works for associative binary operators that recursively splits the problem into two parts over and over until a base case is reached.

fold :: (a -> a -> a) -> a -> [a] -> a
fold f b []  = b
fold f b [x] = x
fold f b xs  =
  fold f b (take (length xs `div` 2) xs)
    `f` 
  fold f b (drop (length xs `div` 2) xs)
        
If each recursive invocation were run on a separate thread, processor, server, and so on, this would maximum the amount of parallelization that can be achieved in performing this fold operation.
Example (filter, map, and fold operations in SQL-like languages): SQL-like languages (as well as the relational algebra on which they are based) can be viewed as supporting fold, map, and filter operations (and compositions thereof) on a particular data structure: a table. This allows databases to be distributed across multiple storage devices, and it makes it possible to simultaneously query different portions of a database, stored on different devices, using multiple computational devices (processors, virtual machines, servers, and so on) running in parallel.

For example, suppose an SQL table consists of some number of rows, and each row has an entry of the form (Name, Age):

+---------------+
|    People     |
+---------------+
| Name  |  Age  |
+-------+-------+
| Alice |  20   |
| Bob   |  17   |
| Carl  |  23   |
+-------+-------+
        
The following SQL query allows a user to retrieve only the Name entries in the table; this query amounts to a map operation:

> SELECT Name from People

+-------+
| Name  |
+-------+
| Alice |
| Bob   |
| Carl  |
+-------+
        
The following SQL query allows a user to retrieve only the entries within a certain Age range; this query amounts to a filter operation:

> SELECT * from People where Age >= 18

+---------------+
| Name  |  Age  |
+-------+-------+
| Alice |  20   |
| Carl  |  23   |
+-------+-------+
        
The following SQL query allows a user to retrieve the sum of the Age fields of all the entries; this query amounts to fold operation (it is often called an aggregation operation):

> SELECT SUM(Age) from People

+-------+
|  Age  |
+-------+
|  60   |
+-------+
        
We can easily simulate all of the above capabilities using another language that supports the declarative and functional programming paradigms, such as Haskell. Suppose we have the following definitions:

type Name = String
type Age = Integer
type Table = [(Name, Age)]

people = [("Alice", 20), ("Bob", 17), ("Carl", 23)]
        
We can then perform the following queries:

*> [name | (name, age) <- people]
["Alice", "Bob", "Carl"]

*> [(name, age) | (name, age) <- people, age >= 18]
[("Alice", 20), ("Carl", 23)]

*> foldr (+) 0 [age | (name, age) <- people]
60
        
Equivalently, we can use map, filter, foldr, and λ abstractions instead of list comprehensions (the Haskell compiler simply converts list comprehensions into some combination of these during compilation):

*> map (\(name, age) -> name) people
["Alice", "Bob", "Carl"]

*> map fst people
["Alice", "Bob", "Carl"]

*> filter (\(name, age) -> age >= 18) people]
[("Alice", 20), ("Carl", 23)]

*> foldr (+) 0 (map snd people)
60
        
All of the above can also be done using Python list comprehensions:

>>> People = [("Alice", 20), ("Bob", 17), ("Carl", 23)]

>>> [name for (name, age) in People]
["Alice", "Bob", "Carl"]

>>> [(name, age) for (name, age) in People if age >= 18]
[('Alice', 20), ('Carl', 23)]

>>> sum([age for (name, age) in People])
60
        
Python also supports map, filter, reduce (similar to Haskell's foldr), and lambda abstractions:


>>> list(map((lambda person: person[0]), People))
['Alice', 'Bob', 'Carl']

>>> list(filter((lambda person: person[1] >= 18), People))
[('Alice', 20), ('Carl', 23)]

>>> from functools import reduce
>>> reduce(lambda x,y: x + y, [age for (name, age) in People])
60
        
Example (filter and map operations in JavaScript libraries): Many JavaScript libraries, such as jQuery/jQuery UI, node.js, d3, and others support an abstraction for manipulating web documents (which are often used as application components on the web today) that is organized around filter and map operations.

In jQuery, it is possible to select all the elements with a certain tag in an HTML document and then to specify a function that will be applied to each element selected. This corresponds to the composition of a filter and map operation.

For example, the following code selects all the li elements and updates the text content of each to indicate the item number corresponding to that element. Notice that the .text() function takes a function as an argument (the function supplied as an argument takes an index as an input and returns a string):

$("ul li").text(
  function( index ) {
    return "item number " + ( index + 1 );
  }
);
        
More generally, it's possible to update individual document elements using the .each() function.

[link]  7.16. Folds, Monads, and Algebraic Properties of Programs

In this section we introduce several consequences of viewing programs as a collection of functions (possibly assembled from building blocks such as folds and unfolds) that can be composed.
Example: Suppose we have the following definition for a polymorphic data type Tree a:

data Tree a =
    Node a (Tree a) (Tree a)
  | Leaf
  deriving Show
        
To determine how to put together a foldTree function for values of type Tree a, we first list all the constructors for Tree a (we could also get this information using the :t command in GHCi):

Node :: a -> Tree a -> Tree a -> Tree a
Leaf :: Tree a
        
A fold function will simply replace all instances of Node and Leaf in a tree with different value or function (for example, let's call Node's replacement n and Leaf's replacement l); this means that the replacement values and functions must have a type that has the same structure (i.e., they must take the same number of arguments) as the constructors they replace. However, they will fold into some new type of value b, so we replace every instance of Tree a with the new result type b:

n :: a -> b -> b -> b
l :: b
        
We can now write our foldTree function to take the above two arguments, and then the tree that is being folded as the last argument. Every Node will be replaced with n and every Leaf will be replaced with l. Notice that in the case of Node, we also recursively fold the trees, since we need to turn them into the result type b before we can apply n :: a -> b -> b -> b.

foldTree :: (a -> b -> b -> b) -> b -> Tree a -> b
foldTree n l (Node x t1 t2) = n x (foldTree n l t1) (foldTree n l t2)
foldTree n l Leaf           = l
        
Notice that a mapTree function is just a special case of fold in which n x t1 t2 = Node (f x) t1 t2 for some function f, and where l = Leaf, since we want to change each value of type a inside the tree, but not the nodes of the tree themselves. This also means we no longer need n and l as arguments; we only need f.

mapTree :: (a -> c) -> (Tree a -> Tree c)
mapTree f (Node x t1 t2) = Node (f x) (mapTree f t1) (mapTree f t2)
mapTree f Leaf           = Leaf
        
To reiterate, notice that the relationship between the two function is mapTree f = foldTree (\x -> Node (f x)) Leaf (and we could have defined mapTree in this way):

mapTree f = foldTree (\x -> Node (f x)) Leaf
        
In the above, we are taking advantage of Haskell's support for partially applying functions, since Node is partially applied to its result, which means Node (f x) is a function still waiting for two more arguments (the two subtrees). Alternatively but equivalently, we could have written the below definition:

mapTree f t = foldTree (\x t1 t2 -> Node (f x) t1 t2) Leaf t
        
Example: Common operations on abstract syntax trees (such as evaluation and execution) can also be represented as fold operations. Suppose we have the following implementation for an abstract syntax for formulas (which we have seen in previous examples):

data Formula = 
    T
  | F
  | Not Formula
  | And Formula Formula
  | Or Formula Formula
  deriving Show
        
We first list all the constructors for Formula (we could also get this information using the :t command in GHCi):

T :: Formula
F :: Formula
Not :: Formula -> Formula
And :: Formula -> Formula -> Formula
Or :: Formula -> Formula -> Formula
        
A fold function will simply replace all instances of each constructor in a tree with different value or function (for example, let's call And's replacement a and T's replacement t); this means that the replacement values and functions must have a type that has the same structure (i.e., they must take the same number of arguments) as the constructors they replace. However, they will fold into some new type of value b, so we replace every instance of Formula with the new result type b:

t :: b
f :: b
n :: b -> b
a :: b -> b -> b  
o :: b -> b -> b
        
We can now write our foldFormula function to take the above five arguments, and then the formula tree that is being folded as the last argument.

foldFormula :: b -> b -> (b -> b) -> (b -> b -> b) -> (b -> b -> b) -> Formula -> b
foldFormula t f n a o (T          ) = t
foldFormula t f n a o (F          ) = f
foldFormula t f n a o (Not formula) = n (foldFormula t f n a o formula) 
foldFormula t f n a o (And f1 f2  ) = a (foldFormula t f n a o f1) (foldFormula t f n a o f2)
foldFormula t f n a o (Or  f1 f2  ) = o (foldFormula t f n a o f1) (foldFormula t f n a o f2)
        
We can now implement two evaluation algorithms easily: one that evaluates the tree as a native Haskell Bool result, and one that evaluates the tree as a native Haskell Int result:

evalAsBool :: Formula -> Bool
evalAsBool = foldFormula True False not (&&) (||) 

evalAsInt :: Formula -> Int
evalAsInt = foldFormula 1 0 ((-) 1) (*) max
        
Notice that the foldFormula function acts as a form of encapsulation for the Formula data structure. Suppose we want to change the data type definition to the following (using a terminology for formulas that corresponds to terminology from the study of logic):

data Formula = 
    Top
  | Bottom
  | Neg Formula
  | Conj Formula Formula
  | Disj Formula Formula
  deriving Show
        
We could then change the definition of foldFormula once, and would not need to change our implementations of evalAsBool and evalAsInt at all.

foldFormula :: b -> b -> (b -> b) -> (b -> b -> b) -> (b -> b -> b) -> Formula -> b
foldFormula t f n a o (Top         ) = t
foldFormula t f n a o (Bottom      ) = f
foldFormula t f n a o (Neg formula ) = n (foldFormula t f n a o formula) 
foldFormula t f n a o (Conj f1 f2  ) = a (foldFormula t f n a o f1) (foldFormula t f n a o f2)
foldFormula t f n a o (Disj  f1 f2 ) = o (foldFormula t f n a o f1) (foldFormula t f n a o f2)
        
Example: Suppose we are working with functions that may not always successfully return a result. We may want to document this in the type of each function. For example, suppose we have the following functions.

neg :: Integer -> Integer
neg n = -1 * n
        
log2 :: Integer -> Integer
log2 1 = 0
log2 2 = 1
log2 4 = 2
log2 _ = -1 -- Undefined in all other cases.
        
One problem with the definition of log2 is that the output -1 is ambiguous: it may be an actual result or an error. We can document that this result is exceptional by introducing a new parametric polymorphic data type that allows us to return either a result or a constructor indicating that there is no result. In the Haskell Prelude, the Maybe a data type is defined for such situations (although we could create our own, as well).

data Maybe a = Just a | Nothing
        
We can now modify our implementation for log2.

log2 :: Integer -> Maybe Integer
log2 1 = Just 0
log2 2 = Just 1
log2 4 = Just 2
log2 _ = Nothing
        
However, we then run into a problem: we can no longer apply another function like neg directly to the result of log2 because Maybe Integer and Integer do not unify, so the Haskell type inference rule for function application is not satisfied. To address this, we can create a parametric polymorphic map function for Maybe a.

mapMaybe :: (a -> b) -> (Maybe a -> Maybe b)
mapMaybe f (Just x ) = Just (f x)
mapMaybe f (Nothing) = Nothing
        
We can now apply neg to an input of type Maybe Integer.

*> (mapMaybe neg) (log2 (-2))
Nothing
*> (mapMaybe neg) (log2 4)
Just (-2)
        
Since mapMaybe can be described as "lifting" a function of type a -> b to a more powerful function Maybe a -> Maybe b, another common name for the map operation is lift.

liftMaybe :: (a -> b) -> (Maybe a -> Maybe b)
liftMaybe = mapMaybe
        
There is one more problem with this approach, however. What if we want to compose two functions that might both return results that are erroneous or undefined? We can see that we run into a problem.

*> :t (liftMaybe log2) (log2 (-2))
Maybe (Maybe Integer)
        
However, we can address this by introducing another function that has a very natural definition, which we call a join operation.

joinMaybe :: Maybe (Maybe a) -> Maybe a
joinMaybe (Just (Just x)) = Just x
joinMaybe (Just Nothing ) = Nothing
joinMaybe (Nothing      ) = Nothing
        
We can now combine multiple operations and still obtain a result of type Maybe a.

*> joinMaybe $ (liftMaybe log2) (log2 (-2))
Nothing
        
When lift and join operations are defined for a container type, we sometimes call that type a monad. One application of this concept is in dealing with different kinds of errors in a large software application with many modular, interdepending components that can each return their own collection of erroneous or undefined outputs. It would not make sense to require that every component should handle the erroneous outputs of every other component (potentially leading to O(n2) different combinations to consider throughout the application).
A more tractable approach is to have each modular component provide a lift function for its particular error monad, and to have a dedicated module for all the join functions. This way, the problem of composing erroneous outputs is encapsulated and modularized in the definitions of the join operations, and every other component can write code under the assumption that it only needs to handle non-erroneous and well-defined inputs.

[link]  Review 1. Programming Language Concepts

This section contains a collection of review problems going over all the course material. These problems are an accurate representation of the kinds of problems you may see on a quiz or exam.
Exercise: Suppose you are given the following type definitions for representing simple English sentences:

type Noun = String
type Adjective = String
type Verb = String
type Adverb = String

data NounPhrase = AN Adjective NounPhrase | N Noun
data VerbPhrase = AV Adverb VerbPhrase | V Verb
data Sentence = S NounPhrase VerbPhrase
      
  1. Determine the minimal substitution that solves the following equation under pattern-matching unification, or explain why no solution can exist (by demonstrating what base case causes an issue):
    S (AN a (N "cat")) (V b)
    =
    S (AN "yellow" (N "cat")) (V "runs")
  2. Determine the minimal substitution that solves the following equation under pattern-matching unification, or explain why no solution can exist (by demonstrating what base case causes an issue):
    S (AN a (AN b (N "cat"))) (V c)
    =
    S (AN "yellow" (N "cat")) (V "runs")
  3. Given the above data type definition, is it possible to write a single Haskell function pattern that will match any sentence in which "cat" is the subject?
Exercise: Adjust the following grammar definition so that it accepts exactly the same set of token sequences, but is not left-recursive:
number n
::=
0 | 1 | ... | 9
|
( n )
|
n * n
Exercise: Write down a fold function for the follow data type.

data Tree =
      Red Tree Tree 
    | Blue Tree Tree 
    | Green Integer
      
Exercise: Complete the following inference rules.
[???]
                                                  ⊢          : TyBool 
 Γ ⊢ print e ;               :          
[???]
 Σ ⊎ {xv} ,           ⇓                                        ,                               ⇓           
           , x := e ; s                      
Exercise: Suppose you are given the following grammar definition and operational semantics:
actor r
::=
foo | bar
action a
::=
change r | is r
  
 is ris r 
  
 change foois bar 
  
 change baris foo 
According to the above operational semantics, to what should change foo evaluate?
Exercise: Answer the following questions by drawing diagrams (your diagrams may need to incorporate self-loops).
  1. Determine which of the following terms refer to data structures, and which terms refer to algorithms. Draw a flow chart incorporating all of the above components that demonstrates how they might interact in an actual implementation (there may be more than one correct answer, but the way they interact must be reasonable):
    • input file;
    • abstract syntax trees;
    • parser;
    • compiler;
    • type checker;
    • error message;
    • loop unroller;
    • machine instruction sequences.
  2. Draw a flow chart incorporating all of the below components that demonstrates how they might interact in an actual implementation:
    • exhaustive case generator;
    • abstract syntax trees;
    • compiler;
    • interpreter;
    • machine instruction sequences;
    • simulator;
    • outputs;
    • output comparison function.

[link]  Appendix A. Using gsubmit

In this course, you will submit your assignments using gsubmit. This section reproduces and extends some of the instructions already made available by the BU Computer Science Department.

[link]  A.1. Register for a CS account

You must obtain a CS account to use the csa machines maintained by the CS Dept. You will need to physically visit the undergraduate computing lab located at 730 Commonwealth Avenue, on the third floor in room 302.

[link]  A.2. Download SSH/SCP client software

You will need an SCP or SFTP client (such as WinSCP for Windows or CyberDuck for OS X) to copy files from your local computer to your csa home directory. If you are using Windows, you will also need an SSH client (such as PuTTY).

[link]  A.3. Submitting assignments using gsubmit

A typical workflow can be described as follows.
  1. You assemble your assignment solution file(s) on your own computer or device.
    local
    device
    hw1
    hw1.py
        your csa2/csa3
    home directory
        your gsubmit
    directory for CS 320
  2. You log into csa2 or csa3 using an SCP or SSH client and create a directory for your submission in your CS account home directory. Note that in the examples below %> represents a terminal prompt, which may look different on your system.
    
    %> cd ~
    %> mkdir hw1
              
    local
    device
    hw1
    hw1.py
        your csa2/csa3
    home directory
    hw1


        your gsubmit
    directory for CS 320
  3. If you have not already done so (e.g., if you were using an SSH client in the previous step), you log into csa2 or csa3 using an SCP client and copy your completed file(s) into that directory.
    local
    device
    hw1
    hw1.py
    your csa2/csa3
    home directory
    hw1
    hw1.py
        your gsubmit
    directory for CS 320
  4. If you have not already done so, you log into csa2 or csa3 using an SSH client and run the gsubmit commands to copy the files from your CS account home directory to the gsubmit directories to which the course staff has access.
    
    %> cd ~
    %> gsubmit cs320 hw1
              
    local
    device
    hw1
    hw1.py
    your csa2/csa3
    home directory
    hw1
    hw1.py
    your gsubmit
    directory for CS 320
    hw1
    hw1.py
  5. To view your submitted files, you can use the following command:
    
    %> gsubmit cs320 -ls
              
    To look at a file that has already been submitted, you can use:
    
    %> gsubmit cs320 -cat hw1/hw1.py
              
    After grades are posted (normally, this will be announced on the mailing list and in lecture), you can check your grade using:
    
    %> gsubmit cs320 -cat grade.hw1.txt
              

[link]  Appendix B. Python Reference

The Python programming language will be among the languages we use in this course. This language supports the object-oriented, imperative, and functional programming paradigms, has automatic memory managememt, and natively supports common high-level data structures such as lists and sets. Python is often used as an interpreted language, but it can also be compiled.

[link]  B.1. Obtaining Python

The latest version of Python 3 can be downloaded at: https://www.python.org/downloads/. In this course, we will require the use if Python 3, which has been installed on all the CS Department's undergraduate computing lab machines, as well as on csa2/csa3.

[link]  B.2. Assembling a Python module

The simplest Python program is a single file (called a module) with the file extension .py. For example, suppose the following is contained within a file called example.py:

# This is a comment in "example.py".
# Below is a Python statement.
print("Hello, world.")
      
Assuming Python is installed on your system, to run the above program from the command line you can use the following (you may need to use python3, python3.2, python3.3, etc. depending on the Python installation you're using). Note that in the examples below %> represents a terminal prompt, which may look different on your system.

%> python example.py
Hello, world.
      
If you run Python without an argument on the command line, you will enter Python's interactive prompt. You can then evaluate expressions and execute individual statements using this prompt; you can also load and execute a Python module file:

%> python
Python 3.2 ...
Type "help", "copyright", "credits" or "license" for more information.
>>> exec(open("example.py").read()) # Load "example.py" module.
Hello, world.
>>> x = "Hello." # Execute an assignment statement.
>>> print(x)     # Execute a "print" statement.
Hello.
>>> x            # Evaluate a string expression.
'Hello.'
>>> 1 + 2        # Evaluate a numerical expression.
3
      

[link]  B.3. Common data structures (i.e., Python expressions)

Python provides native support for several data structures that we will use throughout this course: integers, strings, lists, tuples, sets, and dictionaries (also known as finite maps). In this subsection, we present how instances of these data structures are represented in Python, as well as the most common operations and functions that can be applied to these data structure instances.
  • Booleans consist of two constants: True and False.
    • The usual logical operations are available using the operators and, or, and not.
    
    >>> True                                      # A boolean constant.
    True
    >>> False                                     # A boolean constant.
    False
    >>> True and False or True and (not False)    # A boolean expression.
    True
              
  • Integers are written as in most other programming languages (i.e., as a sequence of digits).
    • The usual arithmetic operations are available using the operators +, *, -, and /. The infix operator // represents integer division, and the infix operators ** represents exponentiation. Negative integers are prefixed with the negation operator -.
    • The usual relational operators ==, !=, <, >, <=, >= are available.
    • The int() function can convert a string that looks like an integer into an integer.
    
    >>> 123                                       # An integer constant.
    123
    >>> 1 * (2 + 3) // 4 - 5                      # An integer expression.
    -4
    >>> 4 * 5 >= 19                               # A boolean expression involving integers.
    True
    >>> int("123")                                # A string being converted into an integer
    123
              
  • Strings are delimited by either ' or " characters. Strings can be treated as lists of single-character strings. Another way to look at this is that there is no distinction between a character and a string: all characters are just strings of length 1. Multiline strings can be delimited using """ or ''' (i.e., three quotation mark characters at the beginning and end of the string literal).
    • The empty string is denoted using '' or "".
    • Two strings can be concatenated using +.
    • The function len() returns the length of a string.
    • Individual characters in a string can be accessed using the bracketed index notation (e.g., s[i]). These characters are also strings themselves.
    
    >>> 'Example.'                                # A string.
    'Example.'
    >>> "Example."                                # Alternate notation for a string.
    'Example.'
    >>> len("ABCD")                               # String length.
    4
    >>> "ABCD" + "EFG"                            # String concatenation.
    'ABCDEFG'
    >>> "ABCD"[2]                                 # Third character in the string.
    'C'
              
  • Lists are similar to arrays: they are ordered sequences of objects and/or values. The entries of a list can be of a mixture of different types, and lists containing one or more objects are delimited using [ and ], with the individual list entries separated by commas. Lists cannot be members of sets.
    • The empty list is denoted using [].
    • Two lists can be concatenated using +.
    • The function len() returns the length of a list.
    • Individual entries in a list can be accessed using the bracketed index notation (e.g., a[i]).
    • To check if a value is in a list, use the in relational operator.
    
    >>> [1,2,"A","B"]                             # A list.
    [1, 2, 'A', 'B']
    >>> [1, 2] + ['A','B']                        # Concatenating lists.
    [1, 2, 'A', 'B']
    >>> len([1,2,"A","B"] )                       # List length.
    4
    >>> [1,2,"A","B"][0]                          # First entry in the list.
    1
    >>> 1 in [1, 2]                               # List containment check.
    True
              
  • Tuples are similar to lists (they are ordered, and can contain objects of different types), except they are delimited by parentheses ( and ), with entries separated by commas. The main distinction between lists and tuples is that tuples are hashable (i.e., they can be members of sets).
    • The empty tuple is denoted using ().
    • A tuple containing a single object x is denoted using (x, ).
    • Two tuples can be concatenated using +.
    • A tuple can be turned into a list using the list() function.
    • A list can be turned into a tuple using the tuple() function.
    • The function len() returns the length of a tuple.
    • Individual entries in a tuple can be accessed using the bracketed index notation (e.g., t[i]).
    • To check if a value is in a tuple, use the in relational operator.
    
    >>> (1,2,"A","B")                             # A tuple.
    (1, 2, 'A', 'B')
    >>> (1,)                                      # Another tuple.
    (1,)
    >>> (1, 2) + ('A','B')                        # Concatenating tuples.
    (1, 2, 'A', 'B')
    >>> list((1, 2, 'A','B'))                     # A tuple being converted into a list.
    [1, 2, 'A', 'B']
    >>> tuple([1, 2, 'A','B'])                    # A list being converted into a tuple.
    (1, 2, 'A', 'B')
    >>> len((1,2,"A","B"))                        # Tuple length.
    4
    >>> (1,2,"A","B")[0]                          # First entry in the tuple.
    1
    >>> 1 in (1, 2)                               # Tuple containment check.
    True
              
  • Sets are unordered sequences that cannot contain duplicates. They are a close approximation of mathematical sets. Sets cannot be members of sets.
    • The empty set is denoted using set().
    • The methods .union() and .intersect correspond to the standard set operations.
    • A list or tuple can be turned into a set using the set() function.
    • A set can be turned into a list or tuple using the list() or list() function, respectively.
    • The function len() returns the size of a set.
    • To access individual entries in a set, it is necessary to turn the set into a list or tuple.
    • To check if a value is in a set, use the in relational operator.
    
    >>> {1,2,"A","B"}                             # A set.
    {1, 2, 'A', 'B'}
    >>> ({1,2}.union({3,4})).intersection({4,5})  # Set operations.
    {4}
    >>> set([1, 2]).union(set(('A','B')))         # Converting a list and a tuple to sets.
    {'A', 1, 2, 'B'}
    >>> len({1,2,"A","B"})                        # Set size.
    4
    >>> 1 in {1,2,"A","B"}                        # Tuple containment check.
    True
              
  • Frozen sets are like sets, except they can be members of other sets. A set can be turned into a frozen set using the frozenset() function.
    
    >>> frozenset({1,2,3})                        # A frozen set.
    frozenset({1, 2, 3})
    >>> {frozenset({1,2}), frozenset({3,4})}      # Set of frozen sets.
    {frozenset({3, 4}), frozenset({1, 2})}
              
  • Dictionaries are unordered collections of associations between some set of keys and some set of values. Dictionaries are also known as finite maps.
    • The empty dictionary is denoted using {}.
    • The list of keys that the dictionary associates with values can be obtained using list(d.keys()).
    • The list of values that the dictionary contains can be obtained using list(d.values()).
    • The function len() returns the number of entries in the dictionary.
    • Individual entries in a dictionary can be accessed using the bracketed index notation (e.g., d[key]).
    
    >>> {"A":1, "B":2}                            # A dictionary.
    {'A': 1, 'B': 2}
    >>> list({"A":1, "B":2}.keys())               # Dictionary keys.
    ['A', 'B']
    >>> list({"A":1, "B":2}.values())             # Dictionary values.
    [1, 2]
    >>> len({"A":1, "B":2})                       # Dictionary size.
    2
    >>> {"A":1, "B":2}["A"]                       # Obtain a dictionary value using a key.
    1
              

[link]  B.4. Function, procedure, and method invocations

Python provides a variety of ways to supply parameter arguments when invoking functions, procedures, and methods.
  • Function calls and method/procedure invocations consist of the function, procedure, or method name followed by a parenthesized, comma-delimited list of arguments. For example, suppose a function or procedure example() is defined as follows:
    
    def example(x, y, z):
      print("Invoked.")
      return x + y + z
              
    To invoke the above definition, we can use one of the following techniques.
    • Passing arguments directly involves listing the comma-delimited arguments directly between parentheses.
      
      >>> example(1,2,3)
      Invoked.
      6
                    
    • The argument unpacking operator (also known as the *-operator, the scatter operator, or the splat operator) involves providing a list to the function, preceded by the * symbol; the arguments will be drawn from the elements in the list.
      
      >>> args = [1,2,3]
      >>> example(*args)
      Invoked.
      6
                    
    • The keyword argument unpacking operator (also known as the **-operator) involves providing a dictionary to the function, preceded by the ** symbol; each named paramter in the function definition will be looked up in the dictionary, and the value associated with that dictionary key will be used as the argument passed to that parameter.
      
      >>> args = {'z':3, 'x':1, 'y':2}
      >>> example(**args)
      Invoked.
      6
                    
  • Default parameter values can be specified in any definition. Suppose the following definition is provided.
    
    def example(x = 1, y = 2, z = 3):
      return x + y + z
              
    The behavior is then as follows: if an argument corresponding to a parameter is not supplied, the default value found in the definition is used. If an argument is supplied, the supplied argument value is used.
    
    >>> example(0, 0)
    3
    >>> example(0)
    5
    >>> example()
    6
              

[link]  B.5. Comprehensions

Python provides concise notations for defining data structures and performing logical computations. In particular, it support a comprehension notation that can be used to build lists, tuples, sets, and dictionaries.
  • List comprehensions make it possible to construct a list by iterating over one or more other data structure instances (such as a list, tuple, set, or dictionary) and performing some operation on each element or combination of elements. The resulting list will contain the result of evaluating the body for every combination.
    
    >>> [ x for x in [1,2,3] ]
    [1, 2, 3]
    >>> [ 2 * x for x in {1,2,3} ]
    [2, 4, 6]
    >>> [ x + y for x in {1,2,3} for y in (1,2,3) ]
    [2, 3, 4, 3, 4, 5, 4, 5, 6]
              
    It is also possible to add conditions anywhere after the first for clause. This will filter which combinations are actually used to add a value to the resulting list.
    
    >>> [ x for x in {1,2,3} if x < 3 ]
    [1, 2]
    >>> [ x + y for x in {1,2,3} for y in (1,2,3) if x > 2 and y > 1 ]
    [5, 6]
              
  • Set comprehensions make it possible to construct a set by iterating over one or more other data structure instances (such as a list, tuple, set, or dictionary) and performing some operation on each element or combination of elements. The resulting list will contain the result of evaluating the body for every combination. Notice that the result will contain no duplicates because the result is a set.
    
    >>> { x for x in [1,2,3,1,2,3] }
    {1, 2, 3}
              
  • Dictionary comprehensions make it possible to construct a dictionary by iterating over one or more other data structure instances (such as a list, tuple, set, or dictionary) and performing some operation on each element or combination of elements. The resulting dictionary will contain the result of evaluating the body for every combination.
    
    >>> { key : 2 for key in ["A","B","C"] }
    {'A': 2, 'C': 2, 'B': 2}
              

[link]  B.6. Other useful built-in functions

The built-in function type() can be used to determine the type of a value. Below, we provide examples of how to check whether a given expression has one of the common Python types:

>>> type(True) == bool
True
>>> type(123) == int
True
>>> type("ABC") == str
True
>>> type([1,2,3]) == list
True
>>> type(("A",1,{1,2})) == tuple
True
>>> type({1,2,3}) == set
True
>>> type({"A":1, "B":2}) == dict
True
      

[link]  B.7. Common Python definition and control constructs (i.e., Python statements)

A Python program is a sequence of Python statements. Each statement is either a function definition, a variable assignment, a conditional statement (i.e., if, else, and/or elif), an iteration construct (i.e., a for or while loop), a return statement, or a break or continue statement.
  • Variable assignments make it possible to assign a value or object to a variable.
    
    x = 10
              
    It is also possible to assign a tuple (or any computation that produces a tuple) to another tuple:
    
    (x, y) = (1, 2)
              
  • Function and procedure definitions consist of the def keyword, followed by the name of the function or procedure, and then by one or more arguments (delimited by parentheses and separated by commas).
    
    def example(a, b, c):
        return a + b + c
              
  • Conditional statements consist of one or more branches, each with its own boolean expression as the condition (with the exception of else). The body of each branch is an indented sequence of statements.
    
    def fibonacci(n):
        # Computes the nth Fibonacci number.
        if n <= 0:
            return 0
        elif n <= 2:
            return 1
        else:
            return fibonacci(n-1) + fibonacci(n-2)
              
  • Iteration constructs make it possible to repeat a sequence of statements over and over. The body of an iteration construct is an indented sequence of statements.
    • The while construct has a boolean expression as its condition (much like if). The body is executed over and over until the expression in the condition evaluates to False, or a break statement is encountered.
      
      def example1(n):
          # Takes an integer n and returns the sum of
          # the integers from 1 to n-1.
          i = 0
          sum = 0
          while i < n:
              sum = sum + i
              i = i + 1
          return sum
      
      def example2(n):
          # Takes an integer n and returns the sum of
          # the integers from 1 to n-1.
          i = 0
          sum = 0
          while True:
              sum = sum + i
              i = i + 1
              if i == n:
                  break
          return sum
                    
    • The for construct makes it possible to repeat a sequence of statements once for every object in a list, tuple, or set, or once for every key in a dictionary.
      
      def example3(n):
          # Takes an integer n and returns the sum of
          # the integers from 1 to n-1.
          sum = 0
          for i in range(0,n):
              sum = sum + i
          return sum
      
      def example4(d):
          # Takes a dictionary d that maps keys to
          # integers and returns the sum of the integers.
          sum = 0
          for key in d:
              sum = sum + d[key]
          return sum
                    

[link]  Appendix C. Haskell Reference

The Haskell programming language will be among the languages we use in this course. This language supports the functional programming paradigm, has automatic memory managememt, and natively supports algebraic data types. Haskell is both an interpreted and compiled language.

[link]  C.1. Obtaining Haskell

The latest version of Haskell can be downloaded with the Haskell Platform: http://www.haskell.org/platform/. The Haskell Platform has been installed on all the CS Department's undergraduate computing lab machines, as well as on csa2/csa3.

[link]  C.2. Haskell modules and declarations

Haskell code is organized into modules. There is one named module per Haskell file, and the file name should match the module name (e.g., the file Example.hs should contain the Example module definition). The module body consists of a series of declarations, which can appear in any order; the only exception is that if there are multiple declarations for a function, they must all be grouped together.

module Example where

g (x) = f(x) + f(x);

f (0) = 0;
f (x) = f(x - 1) + 1;

-- This is a comment.
      
Each declaration in a Haskell module falls into one of two categories: an expression-level declaration, or a type-level declaration. Any declaration that defines variables or functions is an expression-level declaration; any declaration that defines a type is a type-level declaration.

-- Expression-level declaration (defines a new function called "f").
f(x) = 2 * x                      

-- Type-level declaration (defines a new data type called "Tree".
data Tree = Leaf | Node Tree Tree 
      

[link]  C.3. Common data structures (i.e., Haskell expressions)

The Haskell base library (sometimes also called the "Prelude", and imported implicitly and automatically into every module) has a large number of standard and conventional data type and function declarations. These declarations provide native support for a variety of data structures and operations that we will use throughout this course: integers, strings, lists, and tuples, among others. In this subsection, we present how instances of these data structures are represented in Haskell, as well as the most common operations and functions that can be applied to these data structure instances.
  • The Bool data type supports two constructors: True and False.
    • The usual logical operations are available using the operators &&, ||, and not.
    • The built-in syntax if ... then ... else ... is also available (similar to Python's ... if ... else ... and C/C++/Java's ternary operator ... ? ... : ....
    
    *> True
    True
    *> False
    False
    *> True || False && True && (not False)
    True
    *> if True then "Yes" else "No"
    "Yes"
              
  • Integers are written as in most other programming languages (i.e., as a sequence of digits). Haskell supports a bounded-size type Int and an unbounded-size type Integer.
    • The usual arithmetic operations are available using the operators +, *, -, /, and ^. The Prelude function div represents integer division. The Prelude functions min and max make it possible to compute the minimum and maximum of two integers, respectively, and the function minimum and maximum make it possible to compute the minimum and maximum of a list of integers. Negative integers are prefixed with the negation operator -.
    • The usual relational operators ==, /= (not equal), <, >, <=, >= are available.
    
    *> 123
    123
    *> 1 * (2 + 3) `div` 4 - 5
    -4
    *> 4 * 5 >= 19
    True
    *> max 10 20
    20
    *> minimum [1,2,3]
    1
              
  • Strings are delimited by " characters (individual characters are always delimited using '). Strings can be treated as lists of characters.
    • The empty string is denoted using "".
    • Two strings can be concatenated using ++.
    • The Preludefunction length returns the length of a string.
    
    *> "Example."        -- A string.
    "Example."
    *> length "ABCD"     -- String length.
    4
    *> "ABCD" ++ "EFG"   -- String concatenation.
    "ABCDEFG"
              
  • Lists are ordered sequences of expressions. The entries of a list must all be of the same type. Lists are delimited using [ and ], with the individual list entries separated by commas.
    • The empty list is denoted using [].
    • Elements can be added to the front of a list using :.
    • Two lists can be concatenated using ++.
    • The Prelude function length returns the length of a list.
    • To check if a value is in a list, use the Prelude function elem (note that this relies on the == operation being defined on the type of the elements in the list).
    
    *> [1,2,3]            -- A list.
    [1,2,3]
    *> 1 : [2,3]          -- Adding an element to the front.
    [1,2,3]
    *> [1,2,3] ++ [4,5]   -- Concatenating lists.
    [1,2,3,4,5]
    *> length [1,2,3,4]   -- List length.
    4
    *> 1 `elem` [1, 2]    -- List containment check.
    True
              
  • Tuples are ordered, and can contain expressions of different types. They are delimited by parentheses ( and ), with entries separated by commas.
    • The empty tuple is denoted using ().
    • There are no tuples containing a single element; (e) is equivalent to e for any Haskell expression e.
    
    *> (1,2,3)  -- A tuple.
    (1,2,3)