BU CAS CS 320 Fall 2014

NOTE: This page contains all the examples presented during the lectures, as well as all the homework assignments. Click here to go back to the main page with the course information and schedule.

1. Introduction, Background, and Motivation
- 1.1. Overview
- 1.2. Background and prerequisites
2. Defining Programming Languages
- 2.1. Sets of character strings
- 2.2. Regular expressions
- 2.3. Sets of token sequences
- 2.4. Language syntax and Backus-Naur Form (BNF) notation
3. Parsing
4. Semantics, Evaluation, and Interpretation
- 4.1. Formally defining an abstract syntax
- 4.2. Denotational semantics and operational semantics
- 4.3. Evaluation of expressions
- 4.4. Execution of sequences of statements
- 4.5. Interpreters
- 4.6. Assignment #2: More Parsing Algorithms and Interpreters
5. Compilation
Midterm: Programming Languages, Interpreters, and Compilers
6. Static Analysis and Abstract Interpretation
- 6.1. Monomorphic type systems and type checking
- 6.2. Abstract interpretation
7. Declarative (and Functional) Programming Language Paradigms
Review #1: Programming Language Concepts
Final: Programming Language Concepts
Appendix A. Using gsubmit
- A.1. Register for a CS account
- A.2. Download SSH/SCP client software
- A.3. Submitting assignments using gsubmit
Appendix B. Python
- B.1. Obtaining Python
- B.2. Assembling a Python module
- B.3. Common data structures (i.e., Python expressions)
- B.4. Function, procedure, and method invocations
- B.5. Comprehensions
- B.6. Other useful built-in functions
- B.7. Common Python definition and control constructs (i.e., Python statements)
Appendix C. Haskell
- C.1. Obtaining Haskell
- C.2. Haskell modules and declarations
- C.3. Common data structures (i.e., Haskell expressions)

[link] 1. Introduction, Background, and Motivation

[link] 1.1. Overview

A large variety of modern technology, from mobile devices and personal computers to datacenters and entire infrastructures, are programmable. These entities are controlled, operated, maintained, and monitored using a variety of interfaces and languages, which might take the form of graphical user interfaces, libraries, APIs, embedded languages, domain-specific languages, and general-purpose programming languages. Many of these interfaces and languages are defined and designed according to principles and conventions that have been developed over the last several decades. There are three major themes around which this course is organized. Each of the concepts, examples, and problems discussed in this course will relate to one or more of these themes:

Defining and working with programming languages: We will define and become familiar with the basic concepts that underlie the definition and design of common programming languages, and we will learn to identify and implement the common components found in static analysis tools, interpreters, and compilers.
Programming paradigms: You should now be familiar with the imperative and object-oriented programming paradigms. In this course, we will provide a high-level picture of the landscape of different programming paradigms, and will focus in particular on the declarative and functional programming paradigms. We will learn to identify problems that can be solved more efficiently, more concisely, and with greater generality using programming languages that support the functional and declarative paradigms. This includes gaining experience taking advantage of features such as comprehensions, algebraic data types, static type checking, and higher-order functions, and techniques such as recursion.
Languages as solutions: The purpose of a programming language is to provide a data representation and storage format, a communication protocol, or a control interface that can be used across time and space, between different humans and/or devices. Thus, as with any language, any programming language can be viewed as a solution to a representation, communication, or control problem. We will learn to identify when a programming language may be an appropriate solution for a problem, what trade-offs should be considered when deciding whether to design a point solution or a language, and what are some of the appropriate mathematical concepts and software tools for doing so.

Throughout this course, we use formal mathematical conventions and notations developed by the community of computer scientists and mathematicians to communicate with one another about the definitions and implementations of programming languages. Some of these notations are used throughout mathematics and computer science, while others are more obscure and are used primarily by the community of programming language theorists. Throughout your career, you will be required to learn a variety of new notations for a variety of new technologies. This course provides an opportunity to practice learning new notations quickly. This is a valuable skill to possess even if your area of expertise lies or will lie outside the topics covered in this course.

[link] 1.2. Background and prerequisites

Basic logic. It is expected that the reader is familiar with the basic concepts of mathematical logic, including formulas, predicates, relational operators, and terms.

Basic set theory. It is expected that the reader is familiar with the notion of a set (both finite and infinite), set size, and the set comprehension notation, e.g.,

{1,2,3,4} is a set

{ x | x ∈ ℤ, x > 0 }

{1, 2, 3, ...}

and the membership and subset relations between elements and sets, e.g.,

∈

{ 1, 2, 3 }

{2, 3}

⊂

{ 1, 2, 3 }

Programming using imperative and object-oriented language. It is expected that the reader is familiar with at least one imperative, object-oriented language, such as Java, C++, or Python.

[link] 2. Defining Programming Languages

In order to define and reason about programming languages, and in order to write automated tools and algorithms that can operate on programs written using programming languagues, we must be able to define formally (i.e., mathematically) what is a programming language and what is a program.

[link] 2.1. Sets of character strings

In computer science, one common way to mathematically model a formal language is to introduce a finite set of symbols (an alphabet). A language is then any subset of the set of all strings consisting of symbols from that alphabet.

[link]

Definition: An alphabet is a finite set A. We will call elements a ∈ A of the set characters. The typical alphabet we will use in this course is the set of 128 ASCII characters. However, any finite set can be an alphabet.

[link]

Definition: Given an alphabet A, a character string or string is any ordered finite sequence of characters from that alphabet. We will denote the empty string (containing no characters) using the symbol ε, and we will denote the length of a string s using the notation |s|.

[link]

Example: The set A = {A,B,C} is an alphabet. AAA, AB, B, CBA, and ε are all examples of strings built using the alphabet A.

[link]

Definition: Given an alphabet A, let U be the set of all finite strings that can be built using characters from A (including the empty string, which we will call ε). In other words, we can say:

{ s | s is a finite string of characters from alphabet A }

Any subset L ⊂ U is a language. That is, any set of character strings (whether the set is finite or infinite) is a language.

[link]

Example: The set A = {X, Y, Z} is an alphabet. The set of strings { ε, X, Y, Z } is one example of a language. The infinite set of strings { ε, X, XX, XXX, XXXX, ... } is also a language. The infinite set of strings { YZ, YZYZ, YZYZYZ, ... } is also a language.

[link]

Example: The set A = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is an alphabet. The following set of strings can be a language for representing positive integers between 1 and 9999 (inclusive):

{ s | s is a string of characters from A, s does not begin with 0, 1 ≤ |s| ≤ 4 }

[link] 2.2. Regular expressions

If languages are merely subsets of the set of all finite character strings that can be built using the characters in an alphabet, then we can always specify any finite subset (and thus language) by simply writing down every string in the language. However, what if we want to write down the definition of that subset using a shorter representation, or we want to specify an infinitely large subset of character strings (e.g., all strings consisting of one or more copies of the character a: {a, aa, aaa, ...})?

Regular expressions are a notation for precisely and concisely defining certain sets of character strings.

[link]

Definition: Given an alphabet A, a regular expression over A can contain any of the characters in A as well as any number of the following special characters:

the "or" symbol | consisting of single vertical bar;
the grouping symbols ( and ), which must always be balanced the same way that parentheses must be balanced in arithmetic expressions;
the + and * symbols.

The subset of character strings that a regular expression defines can be determined in the following way:

a regular expression r that contains no special characters defines a set of strings containin just one string: {r};
if a regular expression r₁ defines a set S₁ and a regular expression r₂ defines a set S₂, then the regular expression r₁ | r₂ defines the set of character strings S₁ ∪ S₂;
if a regular expression r defines a set S, then the regular expression (r) defines the set of character strings S (i.e., there is no difference);
if a regular expression r defines a set S, then the regular expression r+ defines the set of character strings S' that consists of all possible concatenations of any of the strings in S;
if a regular expression r defines a set S, then the regular expression r* defines the set of character strings S' that consists of all possible concatenations of any of the strings in S, as well as the empty string.

[link]

Example: Assume that our alphabet consists of all alphanumeric characters. For each of the following regular expressions, describe verbally what set of character strings they define:

abcd
(abcd)+
red | green | blue
(red | green | blue)+
(0+) | (1+)
(0 | 1)+

[link]

Example: Assume that our alphabet consists of all alphanumeric characters. For each of the following verbal descriptions of sets of character strings, find a regular expression that defines the described set:

the set of character strings that consist entirely of vowels;
the set of all integers (negative and positive);
the set of all arithmetic expressions involving binary numbers, addition, and subtraction;
the set of all words that begin with a vowel and end with a vowel;
the set of all numbers of even length.

[link]

Example: In practice, programming languages that provide libraries of functions and procedures for working with regular expressions also support other special characters. For example, Python regular expressions may contain some of the following special characters:

the $, $ special characters make it possible to include parentheses in expressions in a way that does not cause them to interpreted as regular expression grouping symbols;
the special symbol \s matches a single whitespace character;
the special symbol [0-9] matches a single numeric digit;
the special symbol [a-z] matches a single lowercase letter;
the special symbol [A-Z] matches a single uppercase letter;
the special symbol [a-zA-Z] matches a single letter;
the special symbol [a-zA-Z0-9] matches a single alphanumeric character.

[link]

Example: In the Python programming language, the re module provides functionality for automatically checking whether a string matches a particular regular expression. In order to check whether a string exactly matches a regular expression, it is necessary to wrap the regular expression in parentheses and then add special markers to ensure that the regular expression matches from the beginning of the string to the end of the string.

If a match succeeds, a match object is returned that contains additional information (e.g., the position of the match); otherwise, None is returned.

>>> import re

>>> re.match(r"^(a|b|c)$", "a") # Succeeds.
<_sre.SRE_Match object; span=(0, 1), match='a'>

>>> re.match(r"^(a|b|c)$", "def") # Fails.
None

>>> re.match(r"^((red|green|blue)+)$", "redgreenblueredblue")
<_sre.SRE_Match object; span=(0, 19), match='redgreenblueredblue'>

>>> re.match(r"^([a-zA-Z0-9]+)$", "redgreenblueredblue")
<_sre.SRE_Match object; span=(0, 19), match='redgreenblueredblue'>

>>> re.match(r"^([a-zA-Z0-9]+)$", "!@#$")
None

[link]

Fact: Regular expressions are not powerful enough to describe many common languages in which we may be interested. Examples of sets of character strings that cannot be defined using regular expression include:

the set of arithmetic expressions with balanced parentheses;
the set of all palindromes;
the set of all strings in which the same string is repeated exactly two times in a row.

For languages such as the above, a more powerful tool for describing sets of character strings is needed.

[link] 2.3. Sets of token sequences

Unlike human languagues, programming languages usually have a relatively small collection of symbol strings (e.g., commands or instructions) that are used to construct programs. Thus, we can adjust the definition of what constitutes a language to account for this.

[link]

Definition: Given an alphabet A, a token is a finite, non-empty (usually short) string of characters from that alphabet.

[link]

Definition: Given a set of tokens T, let U be the set of all finite sequences that can be built using tokens from T (including the empty sequence, which we will call ε). In other words, we can say:

{ s | s is a finite sequence of tokens from T }

Any subset L ⊂ U is a language. That is, any set of token sequences (whether the set is finite or infinite) is a language.

[link]

Example: Consider the following set of tokens:

{ true, false, or, and, not, (, ), , }

The set of token sequences that represent valid boolean formulas is a language:

{

or ( false , and ( true , false ) ),

and ( true , false ),

not ( false ),

true,

false,

... }

If a language is just a subset of the set of all possible token sequences, how do we mathematically specify interesting subsets?

[link] 2.4. Language syntax and Backus-Naur Form (BNF) notation

If a language is just a set of finite token strings, then defining a language amounts to defining this set. How can we define this set? By introducing a collection of rules or constraints governing how characters and/or tokens can be assembled to form strings or sequences in the language.

[link]

Definition: Given a token set T and a language L consisting of finite sequences of tokens from T, the syntax of L is the set of rules defining exactly which token sequences are in L. These rules are sometimes also called syntactic rules, a formal grammar, or simply a grammar.

There are many possible ways by which we could represent syntactic rules, and these rules can be classified, or stratified, according to their expressive power. A more extensive coverage of this topic is beyond the scope of this course, and is normally presented in courses on the theory of computation and automata. In this course, we will focus on two particular kinds of grammar: regular grammars and context-free grammars. The most common representation for such grammars is Backus-Naur Form, abbreviated henceforward as BNF.

[link]

Definition: A grammar definition consists of one or more productions (or production rules). Each production has the following form:

non-terminal

::=

terminal_or_non-terminal ... terminal_or_non-terminal

⋮

terminal_or_non-terminal ... terminal_or_non-terminal

The left-hand side (to the left of the ::= symbol) in each production is called a non-terminal. The right-hand side of each production is an unordered collection of choices separated by the | symbol. Each choice is a sequence of non-terminals (which must appear once on the left-hand side of a production) or terminals (a terminal is a token).

These production rules in a grammar's BNF representation can be viewed both as a way to construct an element (i.e., a token sequence that is a program) in the language, or as a way to break down a token sequence piece by piece until nothing is left.

[link]

Example: Let T be a token set:

{ true }

The following is a very simple programming language that contains only a single possible token sequence consisting of the single token true:

program

::=

true

In this case, the language is finite and small, so we can actually write it down as a set:

{ true }

[link]

Example: We can extend the language by adding another token:

{ true, false }

The following BNF grammar definition now contains two choices (each choice is a sequence consisting of a single terminal):

program

::=

true

false

This programming language now contains two token sequences:

{ true, false }

[link]

Example: We can extend the language definition further:

{ true, false, or, and, not, (, ), , }

The following BNF grammar definition now contains five choices (each choice is a sequence consisting of non-terminals and terminals):

program

::=

true

false

and ( program , program )

or ( program , program )

not ( program )

This programming language now contains infinitely many finite token sequences:

{

or ( false , and ( true , false ) ),

and ( true , false ),

not ( false ),

true,

false,

... }

[link]

Example: Let us consider another example: a language of positive integers.

{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }

We can define the following grammar:

number

::=

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

1 number | 2 number | 3 number | 4 number

5 number | 6 number | 7 number | 8 number | 9 number

However, the above does not allow us to have a 0 in any number with more than one digit. One way to fix this (there are many other ways) is to introduce more productions into the grammar:

nozero

::=

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

digit

::=

0 | nozero

digits

::=

digit | digit digits

number

::=

digit | nozero digits

Note that the grammar may contain multiple productions. Any non-terminal defined in a production can appear on the right-hand side of any of the productions.

[link]

Example: We can extend the language by adding a production for statements, and allowing a program to be a sequence of one or more statements.

{ true, false, or, and, not, (, ), ,, print, skip, ; }

The following is the new grammar definition for this language.

program

::=

statement

statement program

statement

::=

print formula ;

skip ;

formula

::=

true

false

not ( formula )

and ( formula , formula )

or ( formula , formula )

[link] 3. Parsing

Given a programming language definition, we want to have the ability to operate on programs written in that language using a computer. To do so, we need to convert the character string representations of programs in that programming language into instances of a data structure; each data structure instance would then be a representation of the program as data.

[link] 3.1. Concrete and abstract syntaxes

[link]

Definition: Given an alphabet, token set, and grammar definition (e.g., represented using BNF notation), we define the concrete syntax to be the set of all character strings that conform to the grammar definition. We call a particular character string that conforms to the grammar definition a concrete syntax instance.

[link]

Definition: For a particular programming language definition, we define the abstract syntax to be the set of all data structure instances that correspond to a character string that conforms to the grammar definition for that language. An instance of the abstract syntax is sometimes called a parse tree.

[link]

Example: Consider again the language that conforms to the following grammar:

program

::=

true | false | not ( program )

and ( program , program ) | or ( program , program )

The following is the character string of one possible program in the language. This character string is an instance of the concrete syntax of the language.

and (or (and (true, false), not(false)), or (true, false))

The above character string might be converted into a structured representation of the program within some host language (i.e., the programming language being used to operate on these programs: checking for errors, interpreting, or compiling the program). Below, we present one possible Python representation of the instance of the abstract syntax (i.e., the parse tree) corresponding to the concrete syntax instance above. This representation uses nested Python dictionaries to represent the parse tree, with strings being used to represent node labels and leaves.

{ "And": [
    { "Or": [
        { "And": ["True","False"]},
        { "Not": ["False"]}
      ]
    },
    { "Or": ["True","False"]}
  ]
}

[link] 3.2. Lexing (a.k.a., tokenizing) and parsing

[link]

Definition: A lexer or tokenizer is an algorithm that converts an instance of the concrete syntax of a language (i.e., a character string that conforms to the grammar definition for the language) into a sequence of tokens.

[link]

Example: Consider again the language that conforms to the following grammar:

program

::=

true | false | not ( program )

and ( program , program ) | or ( program , program )

The following Python implementation of a tokenizing algorithm for this language uses regular expressions to split a character string into a sequence of individual tokens from the token set.

import re
def tokenize(s):
    # Use a regular expression to split the string into
    # tokens or sequences of zero or more spaces.
    tokens = [t for t in re.split(r"(\s+|true|false|and|or|not|,|$|$)", s)]

    # Throw out the spaces and return the result.
    return [t for t in tokens if not t.isspace() and not t == ""]

Below is an example input and output.

>>> tokenize("and (or (and (true, false), not(false)), or (true, false))")

['and', '(', 'or', '(', 'and', '(', 'true', ',', 'false', ')', ',', 'not',
'(', 'false', ')', ')', ',', 'or', '(', 'true', ',', 'false', ')', ')']

[link]

Definition: A parser is an algorithm that converts a token sequence into an instance of the abstract syntax (i.e., a parse tree).

The tokenizer and parser are then composed to transform a character string into a parse tree.

character string
(concrete syntax)

⇒

lexer/
tokenizer

⇒

token
sequence

⇒

parser

⇒

parse tree
(abstract syntax)

Often, the tokenizer and parser are together called a parser. In situations where this can cause confusion, we will refer to the actual process that converts token sequences into parse trees as the parsing algorithm.

character string
(concrete syntax)

⇒

lexer/
tokenizer

⇒

token
sequence

⇒

parsing
algorithm

parser

⇒

parse tree
(abstract syntax)

The BNF representation of a grammar can be converted into a parsing algorithm that turns a token sequence into an abstract syntax data structure instance (i.e., a parse tree). How easily this can be done depends on the properties of the grammar.

[link]

Fact: Given a BNF representation of a grammar, if for every production in the grammar, each choice begins with a unique terminal (i.e., a terminal that is not the first terminal in any other choice within that production), then we say the grammar is in LL(1) form, and we can implement a predictive recursive descent parsing algorithm to parse any token sequence that conforms to this grammar (note that these are only sufficient conditions for the grammar to be in LL(1) form; less restrictive conditions also exist).

A predictive recursive descent parser can effectively run in linear time; it decomposes the token sequence from left to right while assembling a parse tree.

[link]

Example: Consider again the language that conforms to the following grammar:

program

::=

true | false | not ( program )

and ( program , program ) | or ( program , program )

The following Python implementation of a predictive recursive descent parsing algorithm for this language builds a parse tree using the nested dictionary representation seen in a previous example. This recursive algorithm takes a single argument: a sequence of tokens. It returns two results: a parse tree, and the remainder of the token sequence. Note that the order in which the choices in the production are being handled is not determined by the order of the choices in the production. The choices in a production are unordered; any parser implementation that captures all the possible choices conforms to the grammar definition.

def parse(tokens):
    if tokens[0] == 'true':
        return ('True', tokens[1:])

    if tokens[0] == 'false':
        return ('False', tokens[1:])

    if tokens[0] == 'not' and tokens[1] == '(':
        (e1, tokens) = parse(tokens[2:])
        if tokens[0] == ')':
          return ({'Not':[e1]}, tokens[1:])

    if tokens[0] == 'or' and tokens[1] == '(':
        (e1, tokens) = parse(tokens[2:])
        if tokens[0] == ',':
            (e2, tokens) = parse(tokens[1:])
            if tokens[0] == ')':
                return ({'Or':[e1,e2]}, tokens[1:])

    if tokens[0] == 'and' and tokens[1] == '(':
        (e1, tokens) = parse(tokens[2:])
        if tokens[0] == ',':
            (e2, tokens) = parse(tokens[1:])
            if tokens[0] == ')':
                return ({'And':[e1,e2]}, tokens[1:])

Below is an example input and output. Notice that no tokens are left in the token sequence once the result is returned.

>>> import json
>>> (tree, tokens) =\
      parse(tokenize(\
        "and (or (and (true, false), not(false)), or (true, false))"\
        )\
      )
>>> print(json.dumps(tree, sort_keys=True, indent=2))
{
  "And": [
    {
      "Or": [
        {
          "And": [
            "True",
            "False"
          ]
        },
        {
          "Not": [
            "False"
          ]
        }
      ]
    },
    {
      "Or": [
        "True",
        "False"
      ]
    }
  ]
}
>>> print(tokens)
[]

[link]

Fact: If we relax the conditions on the grammar definition that make it LL(1) by dropping the requirement that the first terminal in each sequence within a production must be unique, we can no longer use predictive recursive descent parsing algorithm to parse a language corresponding to this grammar because the first terminal in each sequence within a production no longer uniquely determines which choice within a production should be used to continue parsing a token sequence. However, as long as every sequence within every production starts with a terminal, we can implement a backtracking recursive descent parsing algorithm to parse token sequence in the language.

[link]

Example: In the previus example, the grammar only allowed prefix logical operators. Suppose we wanted to parse token sequences for a grammar with infix operators.

program

::=

true | false | not ( program )

( program and program ) | ( program or program )

It is no longer possible to implement a predictive recursive descent parser. We must instead employ backtracking, and we must also keep track of whether we have consumed all the tokens.

def parse(tmp, top = True):
    tokens = tmp[0:]
    if tokens[0] == 'true':
        tokens = tokens[1:]
        if not top or len(tokens) == 0:
            return ('True', tokens)

    tokens = tmp[0:]
    if tokens[0] == 'false':
        tokens = tokens[1:]
        if not top or len(tokens) == 0:
            return ('False', tokens)

    tokens = tmp[0:]
    if tokens[0] == 'not' and tokens[1] == '(':
        tokens = tokens[2:]
        r = parse(tokens, False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == ')':
                tokens = tokens[1:]
                if not top or len(tokens) == 0:
                    return ({'Not':[e1]}, tokens)

    tokens = tmp[0:]
    if tokens[0] == '(':
        tokens = tokens[1:]
        r = parse(tokens, False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == 'or':
                tokens = tokens[1:]
                r = parse(tokens, False)
                if not r is None:
                    (e2, tokens) = r
                    if tokens[0] == ')':
                        tokens = tokens[1:]
                        if not top or len(tokens) == 0:
                            return ({'Or':[e1,e2]}, tokens)

    tokens = tmp[0:]
    if tokens[0] == '(':
        tokens = tokens[1:]
        r = parse(tokens, False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == 'and':
                tokens = tokens[1:]
                r = parse(tokens, False)
                if not r is None:
                    (e2, tokens) = r
                    if tokens[0] == ')':
                        tokens = tokens[1:]
                        if not top or len(tokens) == 0:
                            return ({'And':[e1,e2]}, tokens)

The above code is far too repetitive. However, we can take the parts that repeat and turn them into a loop body that loops over all the possible choices in the production.

def parse(tmp, top = True):
    seqs = [\
        ('True', ['true']), \
        ('False', ['false']), \
        ('Not', ['not', '(', parse, ')']), \
        ('And', ['(', parse, 'and', parse, ')']), \
        ('Or', ['(', parse, 'or', parse, ')']) \
        ]

    # Try each choice sequence.
    for (label, seq) in seqs:
        tokens = tmp[0:]
        ss = [] # To store matched terminals.
        es = [] # To collect parse trees from recursive calls.

        # Walk through the sequence and either
        # match terminals to tokens or make
        # recursive calls depending on whether
        # the sequence entry is a terminal or
        # parsing function.
        for x in seq:
            if type(x) == type(""): # Terminal.

                if tokens[0] == x: # Does terminal match token?
                    tokens = tokens[1:]
                    ss = ss + [x]
                else:
                    break # Terminal did not match token.

            else: # Parsing function.

                # Call parsing function recursively
                r = x(tokens, False)
                if not r is None:
                    (e, tokens) = r
                    es = es + [e]

        # Check that we got either a matched token
        # or a parse tree for each sequence entry.
        if len(ss) + len(es) == len(seq):
            if not top or len(tokens) == 0:
                return ({label:es} if len(es) > 0 else label, tokens)

[link] 3.3. More parsing examples and building parsers for other classes of grammar

[link]

Exercise: Consider the following grammar definition:

command

::=

start

suspend

wake

terminate

reboot

if condition then command

if condition then command else command

repeat command

while condition then command

condition

::=

power low

temperature high

temperature very low

user input

Suppose we want to implement a parser for the above grammar. A partial implementation of a parser for the above grammar (what has been presented in lecture so far) is provided below.

def command(tokens, top = True):
    seqs = {\
       ("Start", "start"),\
       ("Suspend", "suspend"),\
       ("Wake", "wake"),\
       ("Terminate", "terminate"),\
       ("Reboot", "reboot")\
   }
    for (key, value) in seqs:
        if tokens[0] == value:
            tokens = tokens[1:]
            if not top or len(tokens) == 0:
                return (key, tokens)

    if tokens[0] == 'repeat':
        r = command(tokens[1:], False)
        if not r is None:
            (e, tokens) = r
            if not top or len(tokens) == 0:
                return ({"Repeat": [e]}, tokens)

    if tokens[0] == 'while':
        r = condition(tokens[1:], False)
        if not r is None:
            (e1, tokens) = r
            if tokens[0] == 'then':
                 r = command(tokens[1:], False)
                 if not r is None:
                     (e2, tokens) = r
                     if not top or len(tokens) == 0:
                         return ({"While": [e1,e2]}, tokens)

def condition(tokens, top = True):
    seqs = [\
       ("PowerLow", ["power", "low"]),\
       ("TempHigh", ["temperature", "high"]),\
       ("UserInput", ["user", "input"])\
   ]
    for (key, seq) in seqs:
        if tokens[0] == seq[0]:
            if tokens[1] == seq[1]:
                tokens = tokens[2:]
                if not top or len(tokens) == 0:
                    return (key, tokens)

It is possible to replace the condition() function with a generic parser for base cases that contain an arbitrary number of terminals in each sequence. Any call to condition(...) can then be replaced with a call to parseBaseCases(seqsCondition, ...).

seqsCondition = [\
       ("PowerLow", ["power", "low"]),\
       ("TempHigh", ["temperature", "high"]),\
       ("TempVeryLow", ["temperature", "very", "low"]),\
       ("UserInput", ["user", "input"])\
       ]

def parseBaseCases(seqs, tokens, top = True):
    for (key, seq) in seqs:
         # Check if token sequence matches sequence.
        i = 0
        for terminal in seq:
            if terminal == tokens[i]:
                pass
            else:
                break
            i = i + 1

        # Check if the previous loop succeeded.
        if i == len(seq):
            tokens = tokens[len(seq):]
            if not top or len(tokens) == 0:
                return (key, tokens)

[link]

Example: Consider the following grammar definition:

formula

::=

true

false

not formula

( formula )

formula and formula

formula or formula

Implementing a naive recursive descent parser, predictive or backtracking, would not work for this grammar. Consider what would happen if we ran the following code on any input:

def parse(tokens):
    if tokens[0] == 'true':
        return ('True', tokens[1:])

    if tokens[0] == 'false':
        return ('False', tokens[1:])

    # ...

    # Recursive call, but no tokens consumed.
    (e1, tokens) = parse(tokens)
      if tokens[0] == 'and':
        (e2, tokens) = parse(tokens[1:])
        return ({'And':[e1,e2]}, tokens[1:])

    # ...

The above code never terminates, and after a large number of recursive calls are made, the Python interpreter returns an error indicating it is out of stack space. To get around this problem, one option is to perform left-recursion elimination on the grammar so that a recursive call never occurs first for any of the choices.

formula

::=

left and formula

left or formula

left

::=

true

false

not formula

( formula )

The above is usually acceptable if the operator, such as and, is commutative or right-associative. However, if the operator is left-associative, the above strategy would not necessarily lead to a correct parse tree. Can you explain why? In such a scenario, other techniques would need to be employed. If the operator is indeed associative or right-associative, however, the parser implementation could then look something like the following:

def formula(tokens):
    (e1, tokens) = left(tokens)
      if tokens[0] == 'and':
        (e2, tokens) = formula(tokens[1:])
        return ({'And':[e1,e2]}, tokens[1:])

    # ...

def left(tokens):
    if tokens[0] == 'true':
        return ('True', tokens[1:])

    if tokens[0] == 'false':
        return ('False', tokens[1:])

    # ...

Note that performing left factoring does not change the definition of the concrete syntax, the definition of the abstract syntax of a language (i.e., the set of parse trees), or the meaning of a language. Left factoring is merely a strategy for converting a grammar definition into a definition that is easier to implement using a recursive descent parser; it is an implementation strategy, not a definition. Thus, the resulting parse trees should contain no record or indication that left-recursion elimination was performed on the grammar before the parser was implemented.

[link] 3.4. Assignment #1: Grammars, Lexing, and Parsing

In this assignment you will be implementing lexing and parsing algorithms using Python. You must submit a single Python source file named hw1/hw1.py. Please follow the gsubmit directions.

Your solutions to each of the problem parts below will be graded on their correctness, concision, and mathematical legibility. The different problems and problem parts rely on the lecture notes and on each other; carefully consider whether you can use functions from the lecture notes, or functions you define in one part within subsequent parts.

A testing script with many test cases is available for download: hw1-tests.py. You should be able to place it in the same directory as hw1.py and run it. Feel free to modify or extend it as you see fit.

Implement a function tokenize() that takes two arguments: a list of terminals and a concrete syntax character string to tokenize. It should build a regular expression, use that regular expression to convert the second string into a token sequence (represented in Python as a list of strings), and then return that token sequence. You should not assume the tokens will always be separated by spaces in the concrete syntax.

>>> tokenize(["red", "blue", "#"], "red#red blue# red#blue")
['red', '#', 'red', 'blue', '#', 'red', '#', 'blue']

Implement a predictive recursive descent parsing algorithm directions() that takes a token sequence (represented in Python as a list of strings) and parses that token sequence into a parse tree according to the following grammar definition:

directions

::=

forward ; directions

reverse ; directions

left turn ; directions

right turn ; directions

stop ;

Each node in the parse tree should be labelled "Forward", "Reverse", "LeftTurn", "RightTurn", or "Stop". An example input and output are provided below. Note that the output below omits the token list. It is fine if your implementation of directions() returns an empty token list along with the parse tree.

>>> directions(tokenize(\
               ["forward","reverse","left","right","turn","stop",";"],\
               "forward; forward; forward; right turn; reverse; forward; stop;"\
              ))
{'Forward': [{'Forward': [{'Forward': [{'RightTurn': [{'Reverse': [{'Forward': ['Stop']}]}]}]}]}]}

For the sake of legibility, we reproduce the output above with indentation using the json module. Your implementation does not need to format its results in this way and can produce it in the above format.

{ "Forward": [
    { "Forward": [
        { "Forward": [
            { "RightTurn": [
                { "Reverse": [
                    { "Forward": ["Stop"]}
                   ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

In this problem you will implement a complete parser for a programming language with the following grammar definition (note that in this language, variables are always marked by an initial @ character and numeric literals are always marked by an initial # character):

program

::=

print formula ; program

print term ; program

assign @ variable := term ; program

end ;

formula

::=

true

false

not ( formula )

and ( formula , formula )

or ( formula , formula )

equal ( term , term )

less than ( term , term )

greater than ( term , term )

term

::=

plus ( term , term )

mult ( term , term )

log ( term )

@ variable

# number

variable

::=

non-empty character strings containing only alphabetical characters

number

::=

[1-9][0-9]*

In addition to a function corresponding to the tokenization algorithm, the parser implementation will consist of five interdependent Python functions: program(), formula(), term(), variable(), and number(). Each function corresponds to one of the productions. Below is an implementation of number():

import re
def number(tokens):
    if re.match(r"^([1-9][0-9]*)$", tokens[0]):
        return ({"Number": [int(tokens[0])]}, tokens[1:])

You should examine each of the other productions to determine whether a predictive recursive descent parser or a backtracking recursive descent parser is required. You should also make calls within each function to one or more of the other functions as needed.

Your algorithm must parse all programs in this programming language correctly. However, it need not return useful error messages if the input provided does not conform to the grammar definition.

Implement a function variable() that takes a single token string as an argument and, if the token sequence conforms to the grammar definition, returns a tuple containing two values: an instance of the abstract syntax (i.e., parse tree) with the root node label "Variable", and the remaining, unparsed suffix of the input token sequence.
Implement a function term() that takes a single token string as an argument and, if the token sequence conforms to the grammar definition, returns a tuple containing two values: an instance of the abstract syntax (i.e., parse tree) with an appropriate root node label ("Plus", "Mult", "Log", "Variable", or "Number"), and the remaining, unparsed suffix of the input token sequence.

>>> term(['log', '(', 'mult', '(', '#', '2', ',', '#', '3', ')', ')'])
{ "Log": [
    { "Mult": [
        { "Number": [2]},
        { "Number": [3]}
      ]
    }
  ]
}
Implement a function formula() that takes a single token string as an argument and, if the token sequence conforms to the grammar definition, returns a tuple containing two values: an instance of the abstract syntax (i.e., parse tree) with an appropriate root node label ("True", "False", "Not", "And", "Or", "Equal", "LessThan", or "GreaterThan"), and the remaining, unparsed suffix of the input token sequence.
Implement a function program() that takes a single token string as an argument and, if the token sequence conforms to the grammar definition, returns a tuple containing two values: an instance of the abstract syntax (i.e., parse tree) with an appropriate root node label ("Print", "Assign", or "End"), and the remaining, unparsed suffix of the input token sequence.
Implement a function complete() that takes a single string as an argument. If the input string conforms to the grammar of the programming language, the function should return a parse tree corresponding to that input string.

Extend or modify your parser from the previous problem to support the following grammar:

program

::=

print formula ; program | print term ; program

assign @ variable := term ; program

end ;

formula

::=

true | false

not ( formula ) | and ( formula , formula ) | or ( formula , formula )

equal ( term , term ) | less than ( term , term ) | greater than ( term , term )

( formula && formula ) | ( formula || formula )

( term == term ) | ( term < term ) | ( term > term )

term

::=

plus ( term , term ) | mult ( term , term ) | log ( term ) | @ variable | # number

( term + term ) | ( term * term )

variable

::=

non-empty character strings containing only alphabetical characters

number

::=

positive integers or negative integers preceded by −

The language that conforms to the above grammar is a strict superset of the language in the previous problem, so your solution should behave the same way as before on the subset of the language in the previous problem. You may use the same parse tree labels for both the prefix and infix notation for each construct and operator (&& corresponds to and, || corresponds to or, == corresponds to equal, < corresponds to less than, > corresponds to greater than, + corresponds to plus, and * corresponds to mult).

[link] 4. Semantics, Evaluation, and Interpretation

[link] 4.1. Formally defining an abstract syntax

While the abstract syntax of a programming language is the set of data structure instances that represent programs, it is also useful to model the abstract syntax as a mathematical object in its own right. This makes it possible to define formally (i.e., mathematically, independently of any implementation language, platform, operating system, and so on) the meaning of a language, and how programs can be run. It also makes it possible to formally define analyses on programs, as well as properties of transformations over programs.

Typically, an abstract syntax definition will closely match a concrete syntax definition, except that there is no need to specify the token set, and redundant syntactic constructs and syntactic sugar will be eliminated. For example, parentheses are a syntactic convention that is not necessary if one is working with parse trees because parse trees are already grouped implicitly due to the tree structure of the abstract syntax instance.

[link]

Example: The following is an example of an abstract syntax definition.

formula

::=

true | false | not formula | formula and formula | formula or formula

Notice the omission of the parentheses. Also, there is no need to be concerned with the fixity (i.e., infix vs. prefix) of binary operators, since this definition is not being used to implement a parsing algorithm.

[link] 4.2. Denotational semantics and operational semantics

The abstract syntax of a programming language is a set of symbolic objects (i.e., the abstract syntax instances, such as programs) that have no meaning unless a meaning is assigned to them. There are two ways in which we can assign meaning to these objects. We can choose to assign a mathematical object to each abstract syntax instance, or we can define a collection of deterministic transformations that specify how we can convert each abstract syntax instance into another abstract syntax instance. Roughly speaking, assigning a mathematical object to each program tells us what it means, while specifying symbolic converion rules tells us how to run the program.

[link]

Definition: The denotational semantics of an abstract syntax is a mapping from the set of abstract syntax instances A to some mathematical set of objects D, which is often called a semantic domain or just domain. The mapping from A to D itself is often denoted using the circumfix Oxford double bracket notation [[ ... ]], and the definition of a denotational semantics of A (i.e., the definition of this mapping [[ ... ]]) is often specified using a collection of inference rules.

[link]

Definition: Let A be an abstract syntax of a programming language. The operational semantics of an abstract syntax is a set of rules that specify how each abstract syntax instance a ∈ A can be transformed some kind of object that represents the result of performing the computation described by A.

There are distinct kinds of operational semantics, such as small-step semantics and big-step semantics (also known as natural semantics). In this course, the operational semantics we will be using is closest to big-step semantics, with some simplifications. We adopt this particular approach to defining operational semantics because it corresponds more closely to a functional, recursive implementation of an algorithm for interpreting programs.

The operational semantics for a programming language represents a contract, a set of constraints, or a set of requirements that an algorithm that implementing an interpreter or compiler of that language must respect in order to be considered correct. However, whoever builds an implementation of an interpreter or compiler for a language has full freedom and flexibility in how they choose to implement the interpreter in all other aspects as long as it conforms to the operational semantics. This is what makes it possible to introduce optimizations into interpreters and compilers (such as optimizations to improve performance or reduce use of memory) while preserving the correctness of their behavior. The operational semantics for a programming language is defined using a collection of inference rules.

[link]

Definition: An inference rule is a notation used within mathematics and computer science to define relationships between mathematical facts and formulas. Each inference rule consists of a horizontal line with zero or more logical formulas above the line and one logial formula below the line. The logical formulas above the line are called premises, and the formula below the line is called the conclusion.

[Name-of-Inference-Rule]

premise premise

conclusion

[Example]

sun is out sky is clear

it is not raining

An inference rule can be interpreted as a portion of a larger algorithm. The premises specify the recursive calls, or calls to other functions, that may need to be made, and the results that are obtained from those invocations. The conclusion specifies what inputs can be handled by that inference rule, and what outputs should be returned given those inputs and the premises.

[Algorithm-Case]

input₁ ⇓ output₁ input₂ ⇓ output₂

input₀ ⇓ output₀

Note that in the above, input₁ and input₂ may depend on input₀, and output₀ may depend on output₁ and output₂. In other words, one could rewrite an inference rule in the following way using natural language:

[Algorithm-Case]

invoking this or another algorithm with input₁ yields output₁

given input₀, and if premises above are true, then output output₀

[link] 4.3. Evaluation of expressions

The abstract syntax, or a subset of the abstract syntax, of a programming language is considered to be a set of expressions if the language's operational semantics do not impose any restrictions on the order in which a computation can operate on the expression to produce a result, called a value. This is possible because expressions usually represent operations with no side effects, such as emitting output to a screen, reading or writing files, looking at a clock, controlling a device, and so on.

[link]

Definition: Let A be an abstract syntax of a programming language, and let V be some subset of A that we will call the value set. This set will represent the possible meanings of parse trees in A, and it will represent the possible results of evaluating parse trees in A. Values that can occur directly within abstract syntax trees of the language (e.g., numeric and string literals, constructors, and so on) are usually called constants.

[link]

Definition: An evaluation algorithm converts any abstract syntax tree that represents an expression into an abstract syntax tree that represents a value.

expressions
(abstract syntax)

⇒

evaluation
algorithm

⇒

values
(abstract syntax)

[link]

Example: Define the abstract syntax according to the following grammar, with A consisting of all formula abstract syntax instances, and V = {true, false} consisting of all value abstract syntax instances:

formula

::=

value | not formula | formula and formula | formula or formula

value

::=

true | false

The following is a definition of an operational semantics for this language.

[True]

true ⇓ true

[False]

false ⇓ false

[Not-True]

f ⇓ true

not f ⇓ false

[Not-False]

f ⇓ false

not f ⇓ true

[And-True-True]

f₁ ⇓ true f₂ ⇓ true

f₁ and f₂ ⇓ true

[And-True-False]

f₁ ⇓ true f₂ ⇓ false

f₁ and f₂ ⇓ false

[And-False-True]

f₁ ⇓ false f₂ ⇓ true

f₁ and f₂ ⇓ false

[And-False-False]

f₁ ⇓ false f₂ ⇓ false

f₁ and f₂ ⇓ false

[Or-True-True]

f₁ ⇓ true f₂ ⇓ true

f₁ or f₂ ⇓ true

[Or-True-False]

f₁ ⇓ true f₂ ⇓ false

f₁ or f₂ ⇓ true

[Or-False-True]

f₁ ⇓ false f₂ ⇓ true

f₁ or f₂ ⇓ true

[Or-False-False]

f₁ ⇓ false f₂ ⇓ false

f₁ or f₂ ⇓ false

[link]

Example: The rules in the above example are numerous, and having this many rules in a definition becomes impractical (especially with more complex operators and language constructs). To address this, we can define a meta-language on the set of values V = {true, false}.

¬ true

false

¬ false

true

true ∧ true

true

true ∧ false

false

false ∧ true

false

false ∧ false

false

true ∨ true

true

true ∨ false

true

false ∨ true

true

false ∨ false

false

The definition then becomes more concise.

[True]

true ⇓ true

[False]

false ⇓ false

[Not]

f ⇓ v

not f ⇓ ¬ v

[And]

f₁ ⇓ v₁ f₂ ⇓ v₂

f₁ and f₂ ⇓ v₁ ∧ v₂

[Or]

f₁ ⇓ v₁ f₂ ⇓ v₂

f₁ or f₂ ⇓ v₁ ∨ v₂

We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this simple language. The functions vnot, vand, and vor correspond to ¬, ∧, and ∨, respectively, in the operational semantics.

def vnot(v):
    if v == 'True':  return 'False'
    if v == 'False': return 'True'

def vand(v1, v2):
    if v1 == 'True'  and v2 == 'True':  return 'True'
    if v1 == 'True'  and v2 == 'False': return 'False'
    if v1 == 'False' and v2 == 'True':  return 'False'
    if v1 == 'False' and v2 == 'False': return 'False'

def vor(v1, v2):
    if v1 == 'True'  and v2 == 'True':  return 'True'
    if v1 == 'True'  and v2 == 'False': return 'True'
    if v1 == 'False' and v2 == 'True':  return 'True'
    if v1 == 'False' and v2 == 'False': return 'False'

Node = dict
Leaf = str

def evaluate(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Not':
                f = children[0]
                v = evaluate(f)
                return vnot(v)
            elif label == 'And':
                # Notice that we can make the recursive calls
                # below in any order we like; the order of the
                # calls is not specified by the operational
                # semantics.
                f1 = children[0]
                v1 = evaluate(f1)
                f2 = children[1]
                v2 = evaluate(f2)
                return vand(v1, v2)
            elif label == 'Or':
                f2 = children[1]
                v2 = evaluate(f2)
                f1 = children[0]
                v1 = evaluate(f1)
                return vor(v1, v2)
    elif type(e) == Leaf:
        if e == 'True':
            return 'True'
        if e == 'False':
            return 'False'

In fact, we could convert the above so that it returns a Python value directly, rather than an abstract syntax tree corresponding to a value.

def evaluate(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Not':
                f = children[0]
                v = evaluate(f)
                return not v # Use the Python not operator.
            elif label == 'And':
                f1 = children[0]
                v1 = evaluate(f1)
                f2 = children[1]
                v2 = evaluate(f2)
                return v1 and v2 # Use the Python and operator.
            elif label == 'Or':
                f2 = children[1]
                v2 = evaluate(f2)
                f1 = children[0]
                v1 = evaluate(f1)
                return v1 or v2 # Use the Python or operator.
    elif type(e) == Leaf:
        if e == 'True':
            return True # Use the Python True constant.
        if e == 'False':
            return False # Use the Python False constant.

[link] 4.4. Execution of sequences of statements

The abstract syntax, or a subset of the abstract syntax, of a programming language is considered to be a set of statements if the operational semantics for that syntax impose a sequential ordering on the computation they represent. This usually corresponds to the notion of a single point of control flow traversing the program's abstract syntax as the program is executed (which is a good model of a computer architecture that contains only a single processor that can perform one instruction at a time).

Languages that have statements and a notion of control flow are called imperative. Languages that have only expressions and no statements can be called pure or functional, although that terminology will often have additional meanings in such a context that we will go over in subsequent sections.

[link]

Definition: Let A be an abstract syntax of a programming language, and let V be some subset of A. Let O be a set of possible outputs (e.g., a text terminal or file).

In this course, we will denote a possible output in O using the symbol o.
We denote the empty output using o₀.
We denote an output o preceded by the value v as v;o (that is, v is the first value in the output v;o, and o is the rest).
Given two outputs o₁ and o₂, o₁;o₂ represents two outputs in sequence: o₁ followed by o₂.

[link]

Definition: An execution algorithm converts any abstract syntax tree that represents a sequence of statements into an abstract syntax tree that represents an output.

statement sequence
(abstract syntax)

⇒

execution
algorithm

⇒

output
(record of side effects)

Thus, we can represent an output in Python using, for example, a list of values.

[link]

Example: We present an operational semantics for a simple programming language with statements.

program

::=

print formula ; program | end ;

formula

::=

value | not formula | formula and formula | formula or formula

value

::=

true | false

Recall that O represents the set of outputs (e.g., to the terminal), with o₀ representing the empty output and v;o representing a terminal o with the value v as its first output.

[Statement-Print]

p ⇓ o f ⇓ v

print f ; p ⇓ v;o

[Statement-End]

end ; ⇓ o₀

[Formula-True]

true ⇓ true

[Formula-False]

false ⇓ false

[Formula-Not]

f ⇓ v

not f ⇓ ¬ v

[Formula-And]

f₁ ⇓ v₁ f₂ ⇓ v₂

f₁ and f₂ ⇓ v₁ ∧ v₂

[Formula-Or]

f₁ ⇓ v₁ f₂ ⇓ v₂

f₁ or f₂ ⇓ v₁ ∨ v₂

We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this simple language. We represent outputs in O as Python lists of values, because lists preserve order. Note that we refer to the implementation of evaluate() from a previous example.

def execute(s):
    if type(s) == Leaf:
        if s == 'End':
            return []
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(f) # Implemented elsewhere.
                o = execute(p)
                return [v] + o

As before, we can actually use the Python print() function and turn this into an interpreter that uses the Python runtime system directly to interpret the program. However, we must be careful about where we put the call to print(): it must be invoked in a way that conforms to a pre-order traversal of the syntax tree.

def execute(s):
    if type(s) == Leaf:
        if s == 'End':
            return []
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(f)
                print(v) # Must appear before recursive call.
                o = execute(p)
                return [v] + o

In an operational semantics for a language with variables and variable assignment, a data structure called an environment is usually used to represent what variables are in scope, as well as the values assigned to variables. We introduce a mathematical definition of an environment so that we can employ it within our operational semantics definitions.

[link]

Definition: Given a programming language definition with a set of values V and a set of possible variable names X (as defined by the concrete syntax), an environment or context is a mapping from a subset of the variables X to the set of values V.

In this course, we will usually denote an environment using the symbol Σ.
Given a variable name x ∈ X, we denote by Σ(x) the value that the environment Σ assigns to x.
An empty environment (one that does not map any variable names to any values) is denoted using Σ₀.
Given an environment Σ, a variable x ∈ X, and a value v ∈ V, we denote by Σ ⊎ {x ↦ v} the extension of the environment Σ with a new mapping from x to v (i.e., if Σ' = Σ ⊎ {x ↦ v} then Σ'(x) = v).

We can update our diagram for an execution algorithm.

environment

statement sequence
(abstract syntax)

⇒

execution
algorithm

⇒

output
(record of side effects)

Within implementations of interpreters and compilers, environments will often be represented with finite maps, associative lists, dictionary, or stacks, depending on the structure and semantics of the language.

[link]

Example: We present an operational semantics for a simple programming language with variables and variable assignment.

program

::=

print formula ; program | assign variable := formula ; program | end ;

formula

::=

value | variable | not formula | formula and formula | formula or formula

variable

::=

any valid variable name as defined by the concrete syntax

value

::=

true | false

Notice the use of an environment Σ, also known as a context, that maps variable names to values.

[Statement-Assign]

Σ ⊎ {x ↦ v}, p ⇓ o Σ, f ⇓ v

Σ, assign x := f ; p ⇓ o

[Statement-Print]

Σ, p ⇓ o Σ, f ⇓ v

Σ, print f ; p ⇓ v;o

[Statement-End]

Σ, end ; ⇓ o₀

[Formula-True]

Σ, true ⇓ true

[Formula-False]

Σ, false ⇓ false

[Formula-Not]

Σ, f ⇓ v

Σ, not f ⇓ ¬ v

[Formula-And]

Σ, f₁ ⇓ v₁ Σ, f₂ ⇓ v₂

Σ, f₁ and f₂ ⇓ v₁ ∧ v₂

[Formula-Or]

Σ, f₁ ⇓ v₁ Σ, f₂ ⇓ v₂

Σ, f₁ or f₂ ⇓ v₁ ∨ v₂

[Formula-Variable]

Σ(x) = v

Σ, x ⇓ v

We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this language. We represent environments as dictionaries in which variable names are labels, and these labels map to values in the language. The functions vnot, vand, and vor are the same as those in a previous example.

def evaluate(env, e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Not':
                f = children[0]
                v = evaluate(env, f)
                return vnot(v)
            elif label == 'And':
                f1 = children[0]
                v1 = evaluate(env, f1)
                f2 = children[1]
                v2 = evaluate(env, f2)
                return vand(v1, v2)
            elif label == 'Or':
                f2 = children[1]
                v2 = evaluate(env, f2)
                f1 = children[0]
                v1 = evaluate(env, f1)
                return vor(v1, v2)
            elif label == 'Variable':
                x = children[0]
                if x in env:
                    return env[x]
                else:
                    print(x + " is unbound.")
                    exit()
    elif type(e) == Leaf:
        if e == 'True':
            return 'True'
        if e == 'False':
            return 'False'

def execute(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return (env, [])
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(env, f)
                o = execute(env, p)
                return [v] + o
            if label == 'Assign':
                children = s[label]
                x = children[0]['Variable'][0]
                f = children[1]
                p = children[2]
                v = evaluate(env, f)
                env[x] = v
                o = execute(env, p)
                return o

If an abstract syntax parse tree node for a statement can have two or more children that are also statement sequences, and if the operational semantics of the language require that statements must be executed consecutively (i.e., not in parallel), then the execution algorithm must also return a new copy of the environment, modified to reflect the variable assignments that might have occurred in the statement sequence.

environment

statement sequence
(abstract syntax)

⇒

execution
algorithm

⇒

environment

output
(record of side effects)

[link]

Example: We present an operational semantics for an extension of a language in a previous example.

program

::=

print formula ; program

assign variable := formula ; program

repeat twice { program } program

end ;

⋮

An example of a program in this language is presented below:

print true;
repeat twice {
print false;
}
print true;

The operational semantics inference rules are the same as those in a previous example. However, we must modify the inference rules for statements. In particular, the predicate ⇓ that defines execution of statements must also specify the new environment that exists after a statement is executed. This is to capture the fact that a program may have variable assignments in a block of code that may be executed multiple times (or, in other languages with branching constructs, may be executed conditionally).

[Statement-Assign]

Σ₁ ⊎ {x ↦ v}, p ⇓ Σ₂, o Σ₁, f ⇓ v

Σ₁, assign x := f ; p ⇓ Σ₂, o

[Statement-Print]

Σ₁, p ⇓ Σ₂, o Σ₁, f ⇓ v

Σ₁, print f ; p ⇓ Σ₂, v;o

[Statement-Repeat-Twice]

Σ₁, p₁ ⇓ Σ₂, o₁ Σ₂, p₁ ⇓ Σ₃, o₂ Σ₃, p₂ ⇓ Σ₄, o₃

Σ₁, repeat twice { p₁ } p₂ ⇓ Σ₄, o₁;o₂;o₃

[Statement-End]

Σ, end ; ⇓ Σ, o₀

We can convert the above operational semantics inference rules into a Python implementation of an interpreter for this language. The implementation of evaluate() is the same as the one in a previous example, while the implementation of execute() from that example must be modified so that it also returns an updated version of the environment data structure.

def execute(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return (env, [])
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                children = s[label]
                f = children[0]
                p = children[1]
                v = evaluate(env, f)
                (env, o) = execute(env, p)
                return (env, [v] + o)
            if label == 'Assign':
                children = s[label]
                x = children[0]['Variable'][0]
                f = children[1]
                p = children[2]
                v = evaluate(env, f)
                env[x] = v
                (env, o) = execute(env, p)
                return (env, o)
            if label == 'RepeatTwice':
                children = s[label]
                body = children[0]
                rest = children[1]
                env1 = env
                (env2, o1) = execute(env1, body)
                (env3, o2) = execute(env2, body)
                (env4, o3) = execute(env3, rest)
                return (env4, o1 + o2 + o3)

[link] 4.5. Interpreters

An interpreter is an algorithm that invokes evaluation and execution algorithms as required by the operational semantics for a language.

program
(abstract syntax)

⇒

evaluation
algorithm

execution
algorithm

⇒

output and/or
value(s)

Often, the term interpreter will refer to the entire toolchain of components that make it possible to convert the character string (i.e., concrete syntax) representation of a program in the source language into value(s) and/or a record of the side effects of a program. In these cases, we will refer to the individual component that performs interpretation as the interpretation algorithm or interpretation component.

character string
(concrete syntax)

⇒

parser

⇒

abstract
syntax

⇒

interpretation
algorithm

interpreter

⇒

output and/or
value(s)

[link] 4.6. Assignment #2: More Parsing Algorithms and Interpreters

In this assignment you will implement a full, working interpreter for a small imperative programming language. You must submit two Python source file named hw2/parse.py and hw2/interpret.py. Please follow the gsubmit directions and remember to put your files in the hw2 directory.

Your solutions to each of the problem parts below will be graded on their correctness, concision, and mathematical legibility. The different problems and problem parts rely on the lecture notes and on each other; carefully consider whether you can use functions from the lecture notes, or functions you define in one part within subsequent parts. You may want to import the following library functions in your code:

from math import log, floor

A testing script with several test cases is available for download: hw2-tests.py. Feel free to modify or extend it as you see fit.

In this problem you will implement a parser for a simple programming language. All the functions you define in this problem should appear in the file parse.py. You may adapt the techniques and/or code from Assignment #1 where appropriate to complete this problem. Recall that all parsers take a sequence of tokens (and possibly a flag) as their input, and return a tuple containing a parse tree and a sequence of remaining tokens.

Include in your code the parsers variable() and number() that can parse token sequences that conform to the following two productions. The functions should return a string or integer, respectively, without wrapping the result in a dictionary. If the first token in the token sequence supplied to a function does not correspond to the production, the function should return None.
variable
::=
any alphanumeric string beginning with a lowercase alphabetical letter
number
::=
any valid integer that is positive or negative
Implement a parser formula() that can parse token sequences that conform to the following production. You may need to perform left-recursion elimination on a production before you can implement a working parser. You should use the abstract syntax tree node labels "Xor", "Not", "Parens", "Variable", "True", and "False". If the token sequence is not in the language defined by the production, the function should return None (this will be tested, and is necessary to receive full credit).
formula
::=
formula xor formula

|
not ( formula )

|
( formula )

|
variable

|
true

|
false
Implement a parser term() that can parse token sequences that conform to the following productions. These productions ensure that the traditional order of operations for addition and multiplication is preserved in a parse tree. However, you may need to perform left-recursion elimination on a production before you can implement a working parser. You should use the abstract syntax tree node labels "Plus", "Mult", "Log", "Parens", "Variable", and "Number". If the token sequence is not in the language defined by the production, the function should return None (this will be tested, and is necessary to receive full credit).
term
::=
factor + term

|
factor
factor
::=
factor * factor

|
log ( term )

|
( term )

|
variable

|
number

Implement a parser program() that can parse token sequences that conform to the following production. You may want to implement a helper parser function expression(). You should use the abstract syntax tree node labels "Print", "Assign", "If", "While", and "End". Note that the last choice in the production defining program is meaningful: it means that an empty token sequence is a valid program. The parse tree corresponding to an empty program should be "End". If the token sequence is not in the language defined by the production, the function should return None (this will be tested, and is necessary to receive full credit).

program

::=

print expression ; program

assign variable := expression ; program

if expression { program } program

while expression { program } program

expression

::=

formula

term

Recall that the last program subtree in each of the choices in the production defining program represent the rest of the program. For example, in the choice "if expression { program } program", the first program subtree is the body of the if statement, while the second program subtree is the rest of the program. For another example, the following is one possible program in this language:

if true {
print true;
}
print false;

In this problem you will implement an interpreter for a simple programming language. All the functions you define in this problem should appear in the file interpret.py. Your interpreter must conform to the operational semantics for this language, as presented in each of the problem parts below.

Implement a function evalTerm(env, t) that takes an environment env and a parse tree t as its two arguments. The function should return the value that corresponds to the evaluation of the parse tree t, as determined by the operational semantics below.
[Term-Number]

Σ, n ⇓ n
[Term-Variable]
Σ(x) = v
Σ, x ⇓ v
[Term-Parens]
Σ, t ⇓ v
Σ, ( t ) ⇓ v
[Term-Log]
Σ, t ⇓ v
Σ, log ( t ) ⇓ ⌊ log₂ (v) ⌋
[Term-Plus]
Σ, t₁ ⇓ v₁           Σ, t₂ ⇓ v₂
Σ, t₁ + t₂ ⇓ v₁ + v₂
[Term-Mult]
Σ, t₁ ⇓ v₁           Σ, t₂ ⇓ v₂
Σ, t₁ * t₂ ⇓ v₁ ⋅ v₂
Implement a function evalFormula(env, f) that takes an environment env and a parse tree f as its two arguments. The function should return the value that corresponds to the evaluation of the parse tree f, as determined by the operational semantics below (let ⊕ represent the logical exclusive or operation, which can be implemented with the Python built-in relational operator !=).
[Formula-True]

Σ, true ⇓ true
[Formula-False]

Σ, false ⇓ false
[Formula-Variable]
Σ(x) = v
Σ, x ⇓ v
[Formula-Parens]
Σ, f ⇓ v
Σ, ( f ) ⇓ v
[Formula-Not]
Σ, f ⇓ v
Σ, not f ⇓ ¬ v
[Formula-Xor]
Σ, f₁ ⇓ v₁           Σ, f₂ ⇓ v₂
Σ, f₁ xor f₂ ⇓ v₁ ⊕ v₂

Implement a function execProgram(env, s) that takes an environment env and a parse tree s as its two arguments. The function should return a tuple containing the output (represented as a list of values) and an updated environment that represent the result of the execution of the program s as determined by the operational semantics below. Note that end ; corresponds to the empty token sequence and is used as a placeholder to make the notation unambiguous.

[Statement-Print]

Σ₁, p ⇓ Σ₂, o Σ₁, e ⇓ v

Σ₁, print e ; p ⇓ Σ₂, v;o

[Statement-Assign]

Σ₁ ⊎ {x ↦ v}, p ⇓ Σ₂, o Σ₁, e ⇓ v

Σ₁, assign x := e ; p ⇓ Σ₂, o

[Statement-If-False]

Σ₁, p₂ ⇓ Σ₂, o₁ Σ₁, e ⇓ false

Σ₁, if e { p₁ } p₂ ⇓ Σ₂, o₁

[Statement-If-True]

Σ₁, p₁ ⇓ Σ₂, o₁ Σ₂, p₂ ⇓ Σ₃, o₂ Σ₁, e ⇓ true

Σ₁, if e { p₁ } p₂ ⇓ Σ₃, o₁;o₂

[Statement-While-False]

Σ₁, p₂ ⇓ Σ₂, o₁ Σ₁, e ⇓ false

Σ₁, while e { p₁ } p₂ ⇓ Σ₂, o₁

[Statement-While-True]

Σ₁, p₁ ⇓ Σ₂, o₁ Σ₂, while e { p₁ } p₂ ⇓ Σ₃, o₂ Σ₁, e ⇓ true

Σ₁, while e { p₁ } p₂ ⇓ Σ₃, o₁;o₂

[Statement-End]

Σ, end ; ⇓ Σ, o₀

Implement a function interpret() that takes a single string as its argument. The function should tokenize the string and parse it to generate a parse tree. If parsing fails, it should return None; otherwise, it should execute the parse tree to obtain an output. It should then return the output. Your implementation only needs to interpret entire programs; you do not need to interpret stand-alone terms or formulas.

Extra credit: Implement one or more of the following extensions and optimizations (listed in order of increasing difficulty).
1. Add the infix relational operators == and < to the language. You will need to extend both the parsing and evaluation algorithms.
2. Optimize your implementation of execProgram() not to evaluate the expression in an assignment statement if the variable being assigned never appears in the rest of the program. You may want to implement a helper function that checks if a variable appears in a parse tree corresponding to a program.

[link] 5. Compilation

Compilation involves translating or transforming one programming language (sometimes called the source language) into another (often called the target language). It is often the case that the target language is a more "low-level" language that more closely corresponds to the capabilities of a physical (i.e., electronic) device, such as the central processor (CPU), memory, and other devices that constitute a modern computer. However, today, compilation could also refer to any transformation from one programming language to another. For example, it might be necessary to compile PHP to ASP, which are both server-side web application scripting languages, in order to migrate software from one operating system to another.

[link] 5.1. History, background, and context

One way to summarize a very thin slice of the history of general-purpose computers and their relationship with programming languages is to consider how the increasingly complex structure of a computer both enabled and required increasingly abstract programming languages and programming language features.

The three major categories of concept involved are the program, the computing device, and the physical laws upon which the computing device is based. As long as the physical laws being used are predictable (i.e., consistent across time and space), we can build computing devices that behave the same way even if they are operated in different places or at different times. We can then interpret what that device does in some meaningful way (e.g., as arithmetic operations). To perform different kinds of tasks using the device, we would need some kind of way to change its initial configuration (i.e., to "program" it).

program (e.g., input/configuration, instructions, etc.)

computing device (e.g., difference engine, modern CPU, quantum computer, etc.)

physical laws (mechanics, electricity, chemistry, biology, quantum mechanics, etc.)

Until the 1940s, all artifacts that could be considered computing devices (e.g., an abacus, Babbage's difference engine, etc.) were single-purpose machines that could only perform the same operation (but possibly on different inputs, which corresponded to different initial configurations of the device). In the early 1940s, a number of computers were built (such as the ENIAC and Colossus) that could be physically reconfigured to perform different operations. Arguably, the "program" was the machine's physical configuration. In the late 1940s, computers were developed (or improved, as in the case of ENIAC) that could operate according to a program stored in a memory device. At this point, both the program and the input were stored in a single, unified memory component; the program told the central processing unit (CPU) what operations to perform, and what modifications to make to memory. This characteristic is part of what's known as the Von Neumann architecture.

memory

input

program

central processing unit (CPU)

Subsequent improvements to computing device components in the 1950s and onward primarily served to make the components smaller, cheaper, faster, and more reliable while still conforming to the Von Neumann architecture. However, the programs controlling these devices continued to become more and more sophisticated.

Machine languages were developed in the 1940s (particularly in the very late 1940s) and the 1950s. These consisted of small collections of instructions that corresponded to operations that could be performed by the CPU. A program was then a list of machine language instructions, and the abstraction this supplied to the programmer was that of a block of memory indexed by an integer range, and a single processor that could perform a sequence of operations on memory. The diagram below provides a stylized depiction of the view from a user's or programmer's perspective.

machine program

machine languauge

CPU

memory

hardware

In part because using machine languages to specify the desired behavior of computers was time-consuming and error-prone, more human-friendly and legible languages were developed in the mid and late 1950s, such as FORTRAN, LISP, and COBOL. These programs were mechanically transformed into machine language programs using a compiler, which was itself a program. For example, after a human wrote a FORTRAN program, it could be compiled once into a machine language program by running the compiler once. The machine language program this produced could then be run any number of times or copied and reused later.

FORTRAN program

FORTRAN compiler

machine program

machine languauge

CPU

memory

hardware

Special-purpose applications could be written using either a machine language or a high-level language.

application

high-level program

compiler

machine program

machine languauge

CPU

memory

hardware

Once it became possible for non-experts and multiple users to use a computer with a variety of different components beyond the memory and CPU (due to advances in many areas, including reductions in cost and the availability of better interface devices and special-purpose components), operating systems were developed to make it possible to manage multiple applications that might run by different users and might employ different components. However, all the applications would need to be either written using a machine language, or written in a language that could be compiled into a machine language. The operating system would then manage how these machine programs could use the various components and resources available to the computer (including the resource of time).

user A

application

user B

application

high-level
program

compiler

machine
program

machine languauge

operating system

CPU

memory

input device

output device

other device

hardware

One could collapse this into a relatively simple diagram by omitting the portions of the diagram that do not represent components that either appear to be running to the user, or are actually running on the device.

application

operating system

hardware

The above diagram might represent a single computer in a single physical location running a single operating system and a few applications. Today, this picture is much more complicated. While many devices are still captured by the above diagram, in many other cases the actual abstractions provided are much more rich and diverse, and they reflect the diverse ecosystem of capabilities and components available to a user, programmer, or application.

app.

web app. platform

browser

app.

API

web service

virtual machine

dist. platform

web server

virtualization layer

controller

host operating system

node

server

hardware (web/cloud server)

Today, the picture is much more complicated. Many applications running on a device actually provide capabilities that are assembled from a collection of underlying layers of potentially interdependent components. Every component in every layer must provide an interface and a collection of abstractions that capture its capabilities and communicate them to users (and expose them to the components on the layers above). Effectively, every component in every layer has a programming language, and assembling multilayered applications may require many compilation steps at various layers.

[link] 5.2. Transformations between abstract syntaxes

[link]

Definition: Given two abstract syntaxes A and B (i.e., both A and B are sets of parse trees that conform to two distinct grammars), an algorithm that always terminates and maps parse trees in A to parse trees in B is called a compilation algorithm, and it is said to compile programs in the language corresponding to A into programs in the language corresponding to B. Note that while a compiler must always terminate on all inputs, it is allowed to return an error on some inputs.

Note the similarities and distinctions between an interpreter and a compiler. Both interpreters and compilers map some set of abstract syntax trees to some other set of abstract syntax trees (in the case of interpreters, the resulting abstract syntax trees are usually values that are simpler than programs). However, an interpreter might not terminate (e.g., if the program it is interpreting does not terminate when it is evaluated and/or executed), while a compiler must terminate. Furthermore, an interpreter must keep track of both values and side effects (such as output), while a compiler is simply transforming one program into another program without concerning itself with what the program might do when evaluated/executed.

source language
abstract syntax

⇒

compiler

⇒

target language
abstract syntax

Often, the term compiler will refer to the entire toolchain of components that make it possible to convert the character string (i.e., concrete syntax) representation of a program in the source language into the concrete syntax representation of the program in the target language. In these cases, we will refer to the individual component that performs compilation as the compilation algorithm or compilation component.

source language
concrete syntax

libraries

⇓

parser

linker

⇒

target language
concrete syntax

⇓

⇑

source language
abstract syntax

⇒

compilation
algorithm

⇒

target language
abstract syntax

Furthermore, a compilation algorithm will often consist of multiple stages. Each stage may perform some specific task, such as simplifying the features used in a program (e.g., eliminating procedures and replacing them with labels and branching statements, or eliminating variable names and replacing them with explicit memory addresses), or it may perform optimizations (e.g., elimination of unreachable code blocks). The intermediate representation (IR) between each stage might be a distinct abstract syntax data structure.

source
language
abstract
syntax

⇒

comp.
alg. A

⇒

IR
#1

⇒

comp.
alg. B

⇒

IR
#2

⇒

comp.
alg. C

⇒

target
language
abstract
syntax

[link] 5.3. Machine languages and bytecodes

[link]

Definition: Given a model of a computer that consists of a single central processing unit (CPU) that can perform some basic operations, and a block of memory (accessible to the CPU) that is organized using some kind of sequential and/or hierarchical addressing scheme (e.g., positive integers in some range represent memory locations), a machine language, assembly language, or machine code is an imperative programming language that consists of instructions that can directly be executed by a single CPU.

Note that in modern systems and networks, machine code might not be executed by a CPU directly. Instead, a program might be running on an operating system that might itself be running on a virtual machine, which is being simulated by a program that runs on a physical computer but emulates the behavior of a (perhaps very different) CPU and memory.

[link]

Definition: A bytecode is a low-level programming language that is similar to a machine language (i.e., the instructions in the language are primarily for manipulating memory and performing basic arithmetic and logical operations) but is not platform-dependent (i.e., dependent on a particular CPU model), and itself needs to be either interpreted (e.g., by a run-time system or virtual machine) or compiled into a machine language.

Note that there is a distinction between machine code and bytecode, though this distinction is increasingly harder to discern in practice with the advent of widespread virtualization. Usually, machine code is said to be executed when it is run on its native CPU, and simulated if it is being executed by some software application that emulates the behavior of a CPU; bytecode is typically executed by an interpreter, which might be called a run-time system or a virtual machine (depending on the features and capabilities of the interpreter, some of which will be discussed in more detail in subsequent sections). Examples of bytecodes include the Java bytecode (interpreted by the Java Virtual Machine, or JVM) and the Python bytecode.

[link]

Example: Consider the following target language, which has many characteristics of a machine language or bytecode. A program consists of an ordered sequence of instructions that can manipulate a block of memory addresses.

program

::=

instruction program

instruction

::=

label label

branch label address

branch label

set address number

copy address address

add

label

::=

any valid label

address

::=

any positive integer or zero

number

::=

any positive integer or zero

We can implement a simple Python simulator (or virtual machine, if the above is a bytecode) for this language.

# Simulate a machine running the program represented by the string s.
def simulate(s):
    instructions = [l.split(" ") for l in s.split("\n")]
    mem = {0: 0}
    control = 0
    while control < len(instructions):
        inst = instructions[control]
        if inst[0] == 'label':
            pass
        # For two arguments, test the memory address in the second
        # argument; given one argument, branch unconditionally.
        if inst[0] == 'branch':
            if len(inst) == 2 or mem[int(inst[2])] != 0:
                control = instructions.index(['label', inst[1]])
                continue
        if inst[0] == 'set':
            mem[int(inst[1])] = int(inst[2])
        if inst[0] == 'copy':
            mem[int(inst[2])] = mem[int(inst[1])]
        if inst[0] == 'add':
            mem[0] = mem[1] + mem[2]
        control = control + 1
    return mem[0]

A program in this language might look as follows:

>>> p = '''
  label start
  set 1 2
  set 2 3
  branch skip 1
  set 2 40
  label skip
  add
  '''
>>> simulate(p)
5

[link]

Example: In this example, the target language will be a slightly more complex machine language than the one presented in a previous example. A program consists of an ordered sequence of instructions that can manipulate a block of memory addresses. The abstract syntax for the language is presented below.

program

::=

instruction program

instruction

::=

label label

goto label

branch label address

jump address

set address number

copy

add

label

::=

any alphanumeric character string

address

::=

any integer

number

::=

any integer

When a machine is running a program, the state of the machine consists of a single block of memory (indexed by integers: both negative and positive) and a single integer representing the machine's position in the sequence of instructions (this integer represents which instruction in the program currently has control, i.e., which instruction is being executed by the CPU). The memory has eight special memory locations, which are defined below:

memory address 0: the address of the result of an add operation;
memory address 1: the address of the first input to an add operation;
memory address 2: the address of the second input to an add operation;
memory address 3: the address containing the "from" address for a copy operation;
memory address 4: the address containing the "to" address for a copy operation;
memory address 5: the output buffer (set to −1 after every step);
memory address 6: the address of the program's control index.

The meaning of each instruction is described below:

label l: there is no effect and control is passed to the next instruction;
goto l: control immediately moves to the program instruction label l;
branch l a: control immediately moves to the program instruction label l if memory address a contain a non-zero integer;
jump a: if the integer i is stored in memory address a, control moves immediately to the ith instruction in the program;
set a n: memory address a is set to the integer n;
copy: if a is the integer stored in memory address 3, and b is the integer stored in memory address 4, then the contents of memory address a are copied to memory address b;
add: the contents of memory location 1 and memory location 2 are added and stored in memory location 0.

We can implement a simple Python simulator for this language.

As an example of how this language might be used, suppose we want to write a program in this language to copy the contents of memory address 10 to memory address 20. The machine language program presented below would have this behavior:

set 3 10
set 4 20
copy

In fact, we can write a Python program that can generate an instance of the above for any two addresses. Notice that the copy() function below can be viewed as a macro that generates a specific machine language program.

def copy(frm, to):
   return [\
      'set 3 ' + str(frm),\
      'set 4 ' + str(to),\
      'copy'\
   ]

Suppose we have another macro increment() that takes a single argument, a memory address, and generates a machine program that increments the value at the address specified memory address. We could then implement the following macro, which will set an entire region of memory to a particular constant.

def setAllToInRegion(constant, startAddr, endAddr):
  # Memory location 8 contains the constant.
  # Memory location 9 contains the counter.
  return [\
        'set 8 ' + str(constant),\
        'set 9 ' + str(startAddr),\
        'label loop'\
      ] \
      # Set second argument (destination) of copy.
      + copy(9,4)\
      + [\
        # Set first argument (source) for copy.
        'set 3 8',\
        'copy'\
      ]\
      # Increment the counter.
      + increment(9) \
      # If counter is endAddr, break out of loop;
      # otherwise, go back to the beginning of the loop.
      # We check by adding -endAddr to the counter.
      # First argument of add.
      + copy(9,1) \
      + [\
        # Second argument of add.
        'set 2 -' + str(endAddr),\
        'add',\
        'branch loop 0'\
      ]

Running setAllToInRegion() on specific inputs generates a machine language program for a particular region.

>>> setAllToInRegion(3, 10, 100)
['set 8 3',
'set 9 10',
'label loop',
'set 3 9',
'set 4 4',
'copy',
'set 3 8',
'copy',
'# increment() done as homework.',
'set 3 9',
'set 4 1',
'copy',
'set 2 -100',
'add',
'branch loop 0']

[link] 5.4. Compiling expressions to a machine language or bytecode

In order to compile a high-level language to a machine language that only provides the CPU and indexed memory abstractions, it is necessary to transform the high-level language's parse tree into a series of instructions that manipulate memory in a way that appears identical to the behavior of the high-level program when it is being interpreted.

Since machine language programs describe how to manipulate memory, it is a good idea to plan how the machine program will utilize the memory available to it to store intermediate results it needs to compute while the program is running.

[link]

Definition: The region of memory that a machine program uses to store values (i.e., results of computing expressions) is called the heap.

As a machine program continues running and computing more results, it grows the heap by adding values to new memory locations. To keep track of this at the time of compilation, the compiler might maintain a counter that stores the current "top" of the heap. When a machine program is running, it may itself maintain a heap pointer to keep track of the last item stored on the heap, primarily so that it knows where to look for more space if it wants to store more values in memory.

[link]

Example: Suppose we are starting with the following source language:

formula

::=

true | false | not formula | formula and formula | formula or formula

How can we implement a compiler that converts formulas in the source language into sequences of instructions in the target language in the previous example?

We can approach this problem by constructing a function compileFormula() that takes as its input the parse tree, and a variable heap that represents the highest address in memory that contains a computed value (in our case, 1 can represent true and 0 can represent false). The function returns a list of machine language instructions that computes that value, as well as the memory address where that value would be stored if anyone actually executed (or simulated) the machine language program.

def compileFormula(f, heap = 3, fresh = 0):
    if type(f) == Leaf:
        if f == 'True':
            # Find a new memory address on the heap.
            heap = heap + 1
            # Generate instruction to store the integer representing True on the heap.
            inst = 'set ' + str(heap) + ' 1'
            # Return the instruction list and top of the heap.
            return ([inst], heap, fresh)
        if f == 'False':
            # Find a new memory address on the heap.
            heap = heap + 1
            # Generate instruction to store the integer representing False on the heap.
            inst = 'set ' + str(heap) + ' 0'
            # Return the instruction list and top of the heap.
            return ([inst], heap, fresh)
    if type(f) == Node:
        for label in f:
            children = f[label]
            if label == 'Not':
                # Compile the subtree f to obtain the list of
                # instructions that computes the value represented
                # by f.
                f = children[0]
                (insts, heap, fresh) = compileFormula(f, heap, fresh)
                # Generate more instructions to change the memory
                # location in accordance with the definition of the
                # Not operation.
                instsNot = \
                   ["branch setZero" + str(fresh) + " " + str(heap),\
                    "set " + str(heap) + " 1",\
                    "branch finish" + str(fresh),\
                    "label setZero" + str(fresh),\
                    "set " + str(heap) + " 0",\
                    "label finish" + str(fresh)\
                   ]
                return (insts + instsNot, heap, fresh + 1)
            if label == 'Or':
                # Compile the two subtrees and get the instructions
                # lists as well as the addresses in which the results
                # of computing the two subtrees would be stored if someone
                # were to run those machine instructions.
                f1 = children[0]
                f2 = children[1]
                (insts1, heap2, fresh1) = compileFormula(f1, heap, fresh)
                (insts2, heap3, fresh2) = compileFormula(f2, heap2, fresh1)
                # Increment the heap counter so we store the
                # result of computing Or in a new location.
                heap4 = heap3 + 1
                # Add instructions that compute the result of the
                # Or operation.
                instsOr = \
                   ["copy " + str(heap2) + " 1",\
                    "copy " + str(heap3) + " 2",\
                    "add",\
                    "branch setOne" + str(fresh2) + " 0",\
                    "branch finish" + str(fresh2),\
                    "label setOne" + str(fresh2),\
                    "set 0 1",\
                    "label finish" + str(fresh2),\
                    "copy 0 " + str(heap4)\
                   ]
                return (insts1 + insts2 + instsOr, heap4, fresh2 + 1)

[link] 5.5. Compiling statement sequences and procedures to a machine language or bytecode

[link]

Definition: The region of memory that a machine program uses to store information that makes it possible to simulate calls to named procedures is called the stack.

In its simplest form, the stack is simply a region of memory that holds integers corresponding to program locations. The top of the stack always corresponds to the program location to which control must move once the procedure that is currently being executed (or simulated) reaches the end of its body.

[link]

Example: Suppose we are starting with the following source language:

program

::=

variable := formula ; program

print formula ; program

procedure variable { program } program

call variable ; program

formula

::=

How can we implement a compiler that converts formulas in the source language into sequences of instructions in the target language in a previous example?

As an example, suppose we begin with the following program.

procedure printProc {
  print true;
  print false;
}
call printProc;
print false;
call printProc;
print true;

If we want to compile the above program, we will need to ensure that the machine language program passes control to the block of machine code representing the procedure body, and then passes it back when the body finishes running.

However, we must compile the procedure into a single block of machine code. How will that block know whether to come back to the point in the machine language program corresponding to the first call printProc; statement, or to the point corresponding to the second call printProc; statement?

We will need to make sure that the call printProc; are compiled into machine code that stores the index of the current instruction at each call point at a particular memory location (in this example, suppose it is memory address 7). Then, the block of machine code into which the procedure body is compiled can always consult the same memory location to determine to which point it should pass control back after the machine code for the procedure body finishes executing. Because control should not go back to the goto instruction but beyond it, it may be necessary to increment the value representing the program location before using the jump instruction. We provide machine language pseudocode below that does just that.

# The procedure.
goto printProcEnd   # Skip procedure body.
label printProc
set 5 1             # print true;
set 5 0             # print false;

# Return control to the last calling point.
< increment integer stored in address 7 by 2 >
jump 7
label printProcEnd

# The first call.
< copy integer from address 6 to address 7 >
goto printProc

set 5 0             # print false;

# The second call.
< copy integer from address 6 to address 7 >
goto printProc

set 5 1             # print true;

The above approach might work if we never call a procedure from inside another procedure. However, it will not work if a call instruction appears inside the body of a procedure because that call instruction will be compiled into machine language instructions that overwrite the contents of memory address 7, making it impossible to determine the program location to which control must return after the outermost procedure body finishes. To address this, it is possible to have the machine language program maintain a stack starting at memory location 7 that will keep track of all the program locations to which control must return after each nested procedure call finishes. We do so in the machine language pseudocode below.

# Set stack pointer to top of stack at address 8.
set 7 8

# The procedure.
goto printProcEnd   # Skip procedure body.
label printProc
set 5 1             # print true;
set 5 0             # print false;

# Return control to the last calling point.
< copy integer from top of stack
  (i.e., integer/address contained in address 7) to address 1 >

< set address 2 to some constant
  (to skip the calling machine instructions) >

< add contents of address 1 and address 2 >
jump 0
label printProcEnd

# The first call.
< increment integer in address 7 (the stack pointer) >
< copy integer from address 6 to top of stack
  (i.e., address/integer specified in address 7) >

goto printProc

set 5 0             # print false;

# The second call.
< increment integer in address 7 (the stack pointer) >
< copy integer from address 6 to top of stack
  (i.e., address/integer specified in address 7) >

goto printProc

set 5 1             # print true;

[link] 5.6. Assignment #3: Compilation and Review of Interpretation

In this assignment you will practice implementing interpretation and compilation algorithms using Python. You must submit four Python source files:

hw3/parse.py (there is no need to modify this file);
hw3/interpret.py;
hw3/machine.py;
hw3/compile.py.

Please follow the gsubmit directions and remember to put your files in the hw3 directory.

Your solutions to each of the problem parts below will be graded on their correctness, concision, and mathematical legibility. The different problems and problem parts rely on the lecture notes and on each other; carefully consider whether you can use functions from the lecture notes, or functions you define in one part within subsequent parts.

A testing script with several test cases is available for download: hw3-tests.py. You should be able to place it in the same directory with the other assignment files and run it. Feel free to modify or extend it as you see fit.

In this problem, you will implement an interpreter for a high-level language. All the functions you define should be included in the file hw3/interpret.py. The abstract syntax for the language is presented below:

term

::=

number | variable | term + term

formula

::=

expression

::=

term | formula

program

::=

print expression ; program

variable := expression ; program

if expression { program } program

while expression { program } program

procedure variable { program } program

call variable ; program

The relevant portions of the operational semantics are provided below:

[Term-Variable]

Σ(x) = t Σ, t ⇓ v

Σ, x ⇓ v

[Formula-Variable]

Σ(x) = f Σ, f ⇓ v

Σ, x ⇓ v

[Formula-Or-Short]

Σ, f₁ ⇓ true

Σ, f₁ or f₂ ⇓ true

[Formula-Or]

Σ, f₁ ⇓ v₁ Σ, f₂ ⇓ v₂

Σ, f₁ or f₂ ⇓ v₁ ∨ v₂

[Formula-And-Short]

Σ, f₁ ⇓ false

Σ, f₁ and f₂ ⇓ false

[Formula-And]

Σ, f₁ ⇓ v₁ Σ, f₂ ⇓ v₂

Σ, f₁ and f₂ ⇓ v₁ ∧ v₂

[Statement-Assign]

Σ₁ ⊎ {x ↦ e}, p ⇓ Σ₂, o

Σ₁, assign x := e ; p ⇓ Σ₂, o

[Statement-Procedure]

Σ₁ ⊎ {x ↦ p₁}, p₂ ⇓ Σ₂, o

Σ₁, procedure x { p₁ } p₂ ⇓ Σ₂, o

[Statement-Call]

Σ₁(x) = p₁ Σ₁, p₁ ⇓ Σ₂, o₁ Σ₂, p₂ ⇓ Σ₃, o₂

Σ₁, call x ; p₂ ⇓ Σ₃, o₁;o₂

The remaining rules [Term-Number], [Term-Plus], [Formula-True], [Formula-False], [Formula-Not], [Statement-Print], [Statement-If-False], [Statement-If-True], [Statement-While-False], [Statement-While-True], and [Statement-End] are exactly the same as those in Assignment #2. Notice the subtle differences between the above rules for [Formula-Variable], [Term-Variable], and [Statement-Assign] and the versions of these in Assignment #2: in the above rules, expressions are only evaluated at the time a variable is encountered, and not at the time of assignment. Also notice that the environment now also contains mappings from procedure names to procedure bodies.

Implement a function evalTerm(env, t) that takes an environment env and a parse tree t as its two arguments. The function should return the value that corresponds to the evaluation of the parse tree t, as determined by the operational semantics.
Implement a function evalFormula(env, f) that takes an environment env and a parse tree f as its two arguments. The function should return the value that corresponds to the evaluation of the parse tree t, as determined by the operational semantics.
Implement a function execProgram(env, s) that takes an environment env and a parse tree s as its two arguments. The function should return a tuple containing an updated environment and the output (represented as a list of values) that represent the result of the execution of the program s as determined by the operational semantics.

In this problem you will implement several helper functions for building sequences of instructions in a machine language. The file hw3/machine.py already contains a simulator for the machine language with which you will be working; the machine language is defined in detail in a previous example in the lecture notes. Although any correct implementation is acceptable, it is suggested that you follow the conventions below:
- use negative memory addresses, starting at -1, for the stack;
- use memory address 7 to store the memory address of the top of the stack;
- use memory addresses 8 and higher for the heap (i.e., results of computations).
All the functions you define should be included in the file hw3/machine.py.
1. Implement a function increment(addr) that takes a single integer argument addr. The function should return a list of instructions (in which each instruction is represented as a string in the list). These instructions should correspond to a machine language program that increments by 1 the integer stored in the memory location addr and cleans up any memory addresses it used in the process by setting them back to 0.
2. Implement a function decrement(addr) that takes a single integer argument addr. The function should return a sequence of instructions (represented as a Python list of strings). These instructions should correspond to a machine language program that decrements by 1 the integer stored in the memory location addr and cleans up any memory addresses it used in the process by setting them back to 0.
3. Implement a function call(name) that takes a single argument: name is a string corresponding to the name of a procedure. The function should return a sequence of instructions that performs the following operations:
  - update the integer stored in the memory address that contains the address of the top of the call stack (i.e., decrement it, since the stack is in the part of memory indexed using negative integers);
  - store the current program location at the top of the call stack;
  - increment the value at the top of the call stack so it refers to the location in the program to which control should return after the end of the procedure being invoked;
  - goto the procedure body that corresponds to the procedure name supplied;
  - update the integer stored in the memory address that contains the address of the top of the call stack (i.e., increment it, since the stack is in the part of memory indexed using negative integers).
  The third step above is crucial: failing to specify the correct return location in the program can lead to an infinite loop.
4. Implement a function procedure(name, body) that takes two arguments: name is a string corresponding to the name of a procedure, and body is a sequence of machine language instructions (represented as a Python list of strings). The function should return a sequence of instructions that includes:
  - a goto instruction so that the procedure body is skipped by default if instructions are being executed sequentially;
  - a label identifying the start of the procedure body;
  - the procedure body
  - instructions to jump back to the machine language program location that invoked the procedure;
  - a label identifying the end of the procedure body.
In this problem you will implement a compiler; your solutions for this problem should appear in the file hw3/compile.py. The source language of the compiler will be the language for which you implemented an interpreter in Problem #1. The target language will be the machine language with which you worked in Problem #2.
1. Implement a function compileTerm(env, t, heap) that takes three arguments: env is a mapping from variables to memory addresses, t is a term parse tree, and heap is the memory address of the current top of the heap. The function should return a tuple (insts, addr, heap) in which insts is a sequence of machine language instructions (represented as a Python list of strings) that perform the computation represented by the parse tree, addr is the address of the result, and heap is an integer representing the memory of the top of the heap after the computation is performed.
2. Implement a function compileFormula(env, f, heap). The requirements for this function are the same as those for compileTerm(env, t, heap), except that it must handle formula parse trees.
3. Implement a function compileProgram(env, s, heap) that takes three arguments: env is a mapping from variables to memory addresses, s is a program parse tree, and heap is the memory address of the current top of the heap. The function should return a tuple (env, insts, heap) in which env is an updated environment, insts is a sequence of machine language instructions (represented as a Python list of strings) that perform the computation represented by the parse tree, and heap is an integer representing the memory of the top of the heap after the computation is performed.
4. Implement a function compile(s) that takes a single string s that is a concrete syntax representation of a program in the source programming language and returns its compiled form: a sequence of instructions in the target machine language.

[link] 5.7. Common compiler optimizations and register allocation

Optimization algorithms are typically transformations of the abstract syntax of a language. In a compiler, this abstract syntax is usually that of a particular intermediate representation used within the compiler, but optimizations might occur at any stage in the compilation process.

source
language
abstract
syntax

⇒
⇐

optimization
algorithm #1

⇓

compilation
algorithm A

⇓

intermediate
representation
(IR) #1

⇒
⇐

optimization
algorithm #2

⇓

compilation
algoritm B

⇓

intermediate
representation
(IR) #2

⇒
⇐

optimization
algorithm #3

⇓

compilation
algorithm C

⇓

target
language
abstract
syntax

⇒
⇐

optimization
algorithm #4

[link]

Definition: Loop unrolling is an optimization that involves duplicating loop bodies a constant number of times, typically to reduce the number of conditional branch instructions that must occur if the number of loop iterations is known at compile time. Depending on the particular language and compiler, the loop unrolling optimization could be applied to high-level language abstract syntax trees, intermediate representation abstract syntax trees, and sometimes to machine language programs.

[link]

Example: Suppose we are compiling a program written in a high-level language that looks like the following.

i = 0;
while (i < 1000000) {
i++;
}

One option is to compile the above program into the following sequence of machine language instructions (assume that the variable i corresponds to memory address 123).

label whileLoop
< increment integer in address 123 >
< compare integer in address 123 to integer 1000000 and store result in address 200 >
branch whileLoop 200

If we execute the above machine language program, we will execute 1,000,000 comparison operations. We can reduce the cost of the comparison operations in this case by duplicating the body of the loop some number of times.

i = 0;
while (i < 1000000) {
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
  i++;
}

Notice that in the above program, we have reduced the number of comparison operations by a factor of 10. However, to achieve this we had to increase the size of the machine language program by a factor of 10.

label whileLoop
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< increment integer in address 123 >
< compare integer in address 123 to integer 1000000 and store result in address 200 >
branch whileLoop 200

[link]

Definition: Constant folding involves replacing subtrees (typically expressions) within an abstract syntax tree with values (the value should normally correspond to the result of evaluating the expression).

[link]

Example: Suppose we have the following abstract syntax tree.

{'Print': [
  {'Plus': [
    {'Mult':[{'Number':[3]}, {'Number':[4]}]},
    {'Variable':['x']}
  ]}
]}

An optimization algorithm that performs constant folding might traverse the above tree, recognize that the subtree {'Mult':[{'Number':[3]}, {'Number':[4]}]} can be evaluated in a finite amount of time, and then replace it with the value to which it evaluates. The overall tree would then look as follows.

{'Print': [
  {'Plus': [
    {'Number':[12]},
    {'Variable':['x']}
  ]}
]}

[link]

Definition: An abstract syntax tree is in single static assignment or SSA form if every variable appears exactly once on the left-hand side of an assignment statement.

An abstract syntax tree can be converted into SSA form by converting every variable on the left-hand side of an assignment statement so that it is unique, and then replacing all occurrences of that variable with the version of that variable that appears in the last assignment above the occurrence.

[link]

Example: Suppose we have the following program in a high-level programming language.

a := 2;
b := 5;
print a + b;
f := true and false;
print f;
a := 7;
g := true;
h := false;
b := 0;
print b;
print h;
print g;
print a;

If we converted the abstract syntax tree that represents the above program into SSA form, we might obtain an abstract syntax tree that represents the following program.

a_0 := 2;
b_0 := 5;
print a_0 + b_0;
f_0 := true and false;
print f_0;
a_1 := 7;
g_0 := true;
h_0 := false;
b_1 := 9;
print b_1;
print h_0;
print g_0;
print a_1;

Notice that each variable in the original program has been associated with an index so that no variable in the new program ever appears on the left-hand of an assignment more than once.

[link]

Definition: Register allocation involves assigning a CPU register to each variable within the abstract syntax tree representing a program (typically an intermediate representation in single static assignment form). This usually requires building an interference graph of the variables within the program, and then solving the graph coloring problem where each register represents a color.

[link]

Example: When compiling list or set comprehensions in languages such as Haskell and Python, the operational semantics of comprehensions allows for parallelization of the individual steps in a way that iterative constructs do not allow. This is because in a comprehension, each "iteration" or "step" of the comprehension is evaluated under the same environment. In an imperative looping statement, the modified environment emitted by one iteration is fed into the execution of the next iteration, so the effects of each iteration accumulate over time. This requires executing each iteration one after the other, and keeping track of the changing environment.

Thus, it is possible to compile a program so that the individual "iterations" or "steps" of a comprehension are evaluated on different processors, or even different computers, and then reassembled into the final results. This is not necessarily possible with a looping statement.

You can see this in the two different operational semantics rules below (the first for an imperative looping statement, the second for a comprehension expression).

[Stmt-For]

Σ₀ ⊎ {x ↦ 1}, p ⇓ Σ₁ Σ₁ ⊎ {x ↦ 2}, p ⇓ Σ₂ Σ₂ ⊎ {x ↦ 3}, p ⇓ Σ₃

Σ₀, for x in {1,2,3}: p ⇓ Σ₃

[Exp-Comprehension]

Σ ⊎ {x ↦ 1}, p ⇓ v₁ Σ ⊎ {x ↦ 2}, p ⇓ v₂ Σ ⊎ {x ↦ 3}, p ⇓ v₃

Σ, [ e for x in {1,2,3} ] ⇓ [v₁, v₂, v₃]

The following diagram illustrates how the various optimization and compilation algorithms described in this section might fit together within the architecture of a compiler.

high-level
source
language
abstract
syntax

⇒
⇐

loop
unrolling

⇓

SSA conversion

⇓

IR in SSA form

⇒
⇐

constant
folding

⇓

compilation

⇓

low-level IR

⇒
⇐

⇓

instruction
selection

⇓

low-level
target
machine
language
abstract
syntax

[link] 5.8. Correctness of compilation algorithms

Compilers represent a unique kind of software application because their implementation can affect the quality of all the software applications that they compile. This means that compiler flaws can be particularly expensive; as a result, there may be many practical justifications for investing heavily in verifying that a compiler, or components thereof, behave correctly. However, to determine whether a compiler behaves correctly, it is first necessary to define what constitutes correct behavior.

One way to define correct behavior is to appeal to the operational semantics of the source and target languages.

source language
abstract syntax

⇒

compilation
algorithm

⇒

target language
abstract syntax

⇓

source language
operational semantics
(interpreter)

target language
operational semantics
(simulator)

⇓

source language
abstract syntax
for values/outputs

⇒

simple
conversion
algorithm for
values/outputs

⇒

target language
abstract syntax
for values/outputs

[link]

Definition: Suppose we are given a source and target language (including the definitions of the abstract syntax and the operational semantics for each language, as well as corresponding interpretation/simulation functions interpret() and simulate()), a simple conversion algorithm convert() between the sets of values and outputs in each language, and a compilation algorithm compile().

The compilation algorithm compile() is correct (or behaves correctly) if for all possible input programs p in the source language, the following equation is true:

simulate(compile(p))

convert(interpret(p))

Notice that the above definition of correctness is functional but perhaps not practical. If the operational semantics of one or both languages does not capture running time, security (e.g., stack overflows), or other concerns, the definition may be inadequate. We will discuss these issues in more detail in a subsequent section.

If the number of possible programs in the source language is infinite, it is impossible to check that the equation in the above definition holds for all possible programs through testing alone, though testing can provide some confidence. Nevertheless, it may be possible to mathematically prove that the equation holds without doing any testing at all.

Software testing and verification is a broad area of research and practice that spans multiple (sometimes disparate) disciplines and communities. In this subsection, we will focus on three specific categories of testing and verification that are useful for determining the correctness of compiler implementations. These three categories represent only a small (but important) subset of the approaches currently being developed and used.

[link]

Definition: There are at least three possible approaches to verifying that a compiler's behavior conforms to its specification. These include:

suites of individual test cases;
bounded exhaustive testing;
formal (usually inductive) proof of correctness over all inputs.

[link]

Example: Bounded exhaustive testing involves defining a metric on inputs (e.g., a mapping from inputs to an integer), and then testing all inputs for which the matric is below a certain integer bound. For example, let us consider a simple language for formulas:

formula

::=

true | false | not formula| formula and formula | formula or formula

We can define a metric on the set of abstract syntax trees that conform to the above definition: the depth of the tree.

Node = dict
Leaf = str

def metric(f):
    if type(f) == Leaf:
        return 1
    if type(f) == Node:
        for label in f:
            return 1 + max([metric(child) for child in f[label]])

We can then generate all trees whose metric falls within a specified bound.

def formulas(n):
    if n <= 0:
        []
    elif n == 1:
        return ['True', 'False']
    else:
        fs = formulas(n-1)
        fsN = []
        fsN += [{'Not':[f]} for f in fs]
        fsN += [{'And':[f1,f2]} for f1 in fs for f2 in fs]
        fsN += [{'Or':[f1,f2]} for f1 in fs for f2 in fs]
        return fs + fsN

It is then possible to use the above code to generate test cases automatically, and to check the equation specifying compiler correctness against each test case.

[link] Midterm. Programming Languages, Interpreters, and Compilers

This material is no longer available.

[link] 6. Static Analysis and Abstract Interpretation

Most compilers and interpreters perform basic error checking and static analysis before running or compiling a program. This serves a variety of purposes, some of which include:

catching errors in the program before incurring the cost (in time, power, etc.) or actually running it;
returning more detailed and user-friendly error messages;
determining information about the program that is required for the optimization and compilation algorithms (e.g., detecting opportunities for constant folding);
improving performance of interpreted or compiled programs by eliminating run-time error checking;
eliminating invalid test cases when doing bounded exhaustive testing.

Static analysis is related to a broader concept: abstract interpretation. It is possible to analyze programs along a variety of other dimensions that may of interest, including, for example, the program's range of outputs, the program's running time, the monetary cost of running the program, the program's power consumption, and so on.

Both static analysis and abstract interpretation algorithms are conventionally expected to terminate in a finite, and typically relatively fast, amount of time (e.g., it should be possible to run a new static analysis or abstract interpretation each and every time the program is compiled, even during an interactive programming and/or testing session).

In this section we present a notation for defining static analysis and abstract interpretation algorithms. This notation is very similar to the one we introduced for defining the operational semantics of programming languages.

source
language
concrete
syntax

target
language
abstract
syntax

⇓

⇑

parser

⇒

source
language
abstract
syntax

⇒

static
analysis
algorithms

⇒

abstract
syntax
or IR

⇒

compilation
algorithms

⇓

syntax
errors

other errors
(unbound variables,
type errors, etc.)

[link] 6.1. Monomorphic type systems and type checking

A type checking algorithm is a specific kind of static analysis algorithm. It represents a "low-resolution" evaluation of the abstract syntax tree of a program to determine the type that can be assigned to every subtree in that program.

[link]

Example: Suppose we are working with a programming language with the following abstract syntax.

expression

::=

true | false | number

not expression

expression and expression

expression or expression

expression + expression

The abstrac syntax for types is defined as follows:

type

::=

integer

boolean

We can define a set of type inference rules for this language. These rules constitute the type system for the language.

[Formula-True]

⊢ true: boolean

[Formula-False]

⊢ false: boolean

[Formula-Not]

⊢ e: boolean

⊢ not e: boolean

[Formula-And]

⊢ e₁: boolean ⊢ e₂: boolean

⊢ e₁ and e₂: boolean

[Formula-Or]

⊢ e₁: boolean ⊢ e₂: boolean

⊢ e₁ or e₂: boolean

[Term-Number]

⊢ number: integer

[Term-Plus]

⊢ e₁: integer ⊢ e₂: integer

⊢ e₁ + e₂: integer

We can convert the above inference rules into a type checking algorithm; the process is almost identical to the process used for converting an operational semantics into an interpreter implementation.

def typeCheck(e):
    if type(e) == Leaf:
        if e == 'True' or e == 'False':
            return 'boolean'
    if type(e) == Node:
        for label in e:
            if label == 'Number':
                 return 'integer'
            if label == 'Not':
                  [e1] = e[label]
                  t1 = typeCheck(e1)
                  if t1 != 'boolean':
                     print("not must be applied to a boolean")
                     return None
                  return 'boolean'

            # ...

            if label == 'Plus':
                [e1,e2] = e[label]
                t1 = typeCheck(e1)
                t2 = typeCheck(e2)
                if t1 != 'integer' or t2 != 'integer':
                    print("+ requires integer arguments")
                    return None
                return 'integer'

[link]

Example: Suppose we are working with a programming language with the following abstract syntax.

program

::=

variable := expression ; program

expression

::=

variable

constant

not expression

expression and expression

expression or expression

expression * expression

expression + expression

constant

::=

true | false | number

number

::=

any valid integer

variable

::=

any valid variable

type

::=

integer

boolean

void

We can define a set of type inference rules for this language. These rules constitute the type system for the language.

[Statement-Assign]

Γ ⊎ {x ↦ τ} ⊢ p: void Γ ⊢ e: τ

Γ ⊢ x := e ; p : void

[Statement-End]

Γ ⊢ end: void

[Variable]

Γ(x) = τ

Γ ⊢ x: τ

[Formula-True]

Γ ⊢ true: boolean

[Formula-False]

Γ ⊢ false: boolean

[Formula-Not]

Γ ⊢ e: boolean

Γ ⊢ not e: boolean

[Formula-And]

Γ ⊢ e₁: boolean Γ ⊢ e₂: boolean

Γ ⊢ e₁ and e₂: boolean

[Formula-Or]

Γ ⊢ e₁: boolean Γ ⊢ e₂: boolean

Γ ⊢ e₁ or e₂: boolean

[Term-Number]

Γ ⊢ number: integer

[Term-Plus]

Γ ⊢ e₁: integer Γ ⊢ e₂: integer

Γ ⊢ e₁ + e₂: integer

[Term-Mult]

Γ ⊢ e₁: integer Γ ⊢ e₂: integer

Γ ⊢ e₁ * e₂: integer

[link]

Example: Suppose we are working with a programming language that supports the definition of functions that take a single integer argument.

program p

::=

function f ( int x ) { return e ; } p

x := e ; p

expression e

::=

x | n | e + e | f ( e )

number n

::=

any valid integer

variable f, x

::=

any valid variable

type τ

::=

int

int → int

void

We can define a set of type inference rules for this language. These rules constitute the type system for the language.

[Statement-Assign]

Γ ⊎ {x ↦ int} ⊢ p: void Γ ⊢ e: int

Γ ⊢ x := e ; p : void

[Statement-Function]

Γ ⊎ {f ↦ int → int} ⊢ p: void Γ ⊎ {x ↦ int} ⊢ e: int

Γ ⊢ function f ( int x ) { return e ; } p : void

[Statement-End]

Γ ⊢ end: void

[Term-Variable]

Γ(x) = int

Γ ⊢ x: int

[Term-Number]

Γ ⊢ n: int

[Term-Plus]

Γ ⊢ e₁: int Γ ⊢ e₂: int

Γ ⊢ e₁ + e₂: int

[Term-Apply]

Γ(f) = int → int Γ ⊢ e: int

Γ ⊢ f ( e ): int

Before implementing a type checking algorithm, suppose we want to implement an interpreter for this language. Below is an example of how it might be implemented.

def evaluate(env, e):
   if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Number':
                n = children[0]
                return n
            elif label == 'Plus':
                [e1,e2] = children
                return evaluate(env, e1) + evaluate(env, e2)
            elif label == 'Variable':
                x = children[0]
                value = env[x]
                return value
            elif label == 'Apply':
                 [f, eArg] = children
                 f = f['Variable'][0] # Unpack.
                 x = env[f]['Function'][2]['Variable'][0]
                 eBody = env[f]['Function'][3] # Body.
                 vArg = evaluate(env, eArg)
                 envF = env.copy()
                 envF[x] = vArg
                 vResult = evaluate(envF, eBody)
                 return vResult

def execute(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return (env, [])
    elif type(s) == Node:
        for label in s:
            if label == 'Assign':
                [var, e, p] = s[label]
                x = var['Variable'][0]
                env[x] = evaluate(env, e)
                (env, o) = execute(env, p)
                return (env, o)
            if label == 'Function':
                [f, ty, x, e, p] = s[label]
                f = f['Variable'][0] # Unpack.
                env[f] = s
                (env, o) = execute(env, p)
                return (env, o)

A type checking algorithm implementation is presented below.

def tyExpr(env, e):
   if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Number':
                return 'TyInt'
            elif label == 'Variable':
                x = children[0]
                return env[x]
            elif label == 'Apply':
                 [f, eArg] = children
                 f = f['Variable'][0] # Unpack.
                 tyArg = tyExpr(env, eArg)
                 tyFunArg = env[f]['Arrow'][0]
                 if tyArg == tyFunArg:
                     return env[f]['Arrow'][1]

def tyProg(env, s):
   if type(s) == Leaf:
        if s == 'End':
            return 'Void'
   elif type(s) == Node:
        for label in s:
            if label == 'Assign':
                [x,e,p] = s[label]
                x = x['Variable'][0] # Unpack.
                tExpr = tyExpr(env, e)
                env[x] = tExpr
                tProg = tyProg(env, p)
                if tExpr == 'TyInt' and tProg == 'Void':
                    return 'Void'
            if label == 'Function':
                [f, tyArg, x, e, p] = s[label]
                name = f['Variable'][0]
                x = x['Variable'][0]
                envF = env.copy()
                envF[x] = 'TyInt'
                tBody = tyExpr(envF, e)
                env[name] = {'Arrow':[tyArg,tBody]}
                tProg = tyProg(env, p)
                return tProg

[link]

Example: Suppose we are working with a programming language that supports the definition of functions that take a single string argument. The type system of this language supports explicit tracking of the sizes of strings.

program p

::=

function f ( string[k] x ) { return e ; } p

x := e ; p

expression e

::=

x | s | e + e | f ( e )

string s

::=

any valid string literal

variable f, x

::=

any valid variable

type τ

::=

string[k]

string[k] → string[k]

void

type size k

::=

any valid integer

k + k

We can define a set of type inference rules for this language. These rules constitute the type system for the language.

[Statement-Assign]

Γ ⊎ {x ↦ string[k]} ⊢ p: void Γ ⊢ e: string[k]

Γ ⊢ x := e ; p : void

[Statement-Function]

Γ ⊎ {f ↦ string[k₁] → string[k₂]} ⊢ p: void Γ ⊎ {x ↦ string[k₁]} ⊢ e: string[k₂]

Γ ⊢ function f ( string[k₁] x ) { return e ; } p : void

[Statement-End]

Γ ⊢ end: void

[Term-Variable]

Γ(x) = string[k]

Γ ⊢ x: string[k]

[Term-Number]

|s| = k

Γ ⊢ s: string[k]

[Term-Concat]

Γ ⊢ e₁: string[k₁] Γ ⊢ e₂: string[k₂]

Γ ⊢ e₁ + e₂: string[k₁+k₂]

[Term-Apply]

Γ(f) = string[k₁] → string[k₂] Γ ⊢ e: string[k₁]

Γ ⊢ f ( e ): string[k₂]

[link] 6.2. Abstract interpretation

An abstract interpretation algorithm is a specific kind of interpreter for a programming language: it must always terminate (and, typically, it must run quickly), and it typically provides information about the given program along some dimension of interest (i.e., not the value(s) it produces).

[link]

Example: Suppose we are working with a simple general-purpose programming language with the following abstract syntax.

program p

::=

print e ; p

for x to n { p } p

procedure x { p } p

call x ; p

expression e

::=

n | e * e

number n

::=

any valid integer

variable x

::=

any valid variable

We want to define an abstract interpretation over the above language to a language that describes running times of programs. The language for running times has the following syntax:

running time t

::=

n | t + t | t * t | ( t )

We can define the following abstract interpretation of the language:

[Expression-Number]

⊢ n : 1

[Expression-Mult]

⊢ e₁: t₁ ⊢ e₁: t₂

⊢ e₁ * e₂: t₁ + t₂

[Statement-Print]

Γ ⊢ p: t₁ ⊢ e: t₂

Γ ⊢ print e ; p : t₁ + t₂

[Statement-For]

Γ ⊢ p₁: t₁ Γ ⊢ p₂: t₂

Γ ⊢ for x to n { p₁ } p₂ : (n * t₁) + t₂

[Statement-Procedure]

Γ ⊢ p₁: t₁ Γ ⊎ {x ↦ t₁} ⊢ p₂: t₂

Γ ⊢ procedure { p₁ } p₂ : t₂

[Statement-Call]

Γ(x) = t₂ Γ ⊢ p: t₁

Γ ⊢ call x ; p : t₁ + t₂

[Statement-End]

Γ ⊢ end : 0

We can implement the above inference rules as an abstract interpretation algorithm.

def timeExpr(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Number':
                return 1
            elif label == 'Mult':
                [e1,e2] = children
                return timeExpr(e1) + timeExpr(e2)

def timeProg(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return 0
    elif type(s) == Node:
        for label in s:
            if label == 'Print':
                [e,p] = s[label]
                return timeExpr(e) + timeProg(env, p)
            if label == 'For':
                [var, num, p1, p2] = s[label]
                n = num['Number'][0]
                return n * timeProg(env, p1) + timeProg(env, p2)

                # ...

[link]

Example: Suppose we are working with a simple programming language for reserving cloud computing resources with the following abstract syntax.

program p

::=

reserve d ; p

for x to n { p } p

procedure x { p } p

call x ; p

duration d

::=

n minutes

n hours

n days

number n

::=

any valid integer

fixed-point real number r

::=

any valid real number

variable x

::=

any valid variable

We want to define an abstract interpretation over the above language to a language that describes the monetary (i.e., dollar) cost of running a program. The language for costs has the following syntax:

cost c

::=

$r | c + c | n * c | ( c )

Suppose that a single reserve d ; statement reserves a distinct computing resources for the duration specified, and that the cost of any resource is always $0.60 per hour. We can define the following abstract interpretation of the language:

[Duration-Minutes]

⊢ n minutes : n ⋅ $0.01

[Duration-Hours]

⊢ n hours : n ⋅ $0.60

[Duration-Days]

⊢ n days : n ⋅ $14.40

[Statement-Reserve]

⊢ d: c₁ Γ ⊢ p: c₂

Γ ⊢ reserve d ; p : c₁ + c₂

[Statement-For]

Γ ⊢ p₁: c₁ Γ ⊢ p₂: c₂

Γ ⊢ for x to n { p₁ } p₂ : (n * c₁) + c₂

[Statement-Procedure]

Γ ⊢ p₁: c₁ Γ ⊎ {x ↦ c₁} ⊢ p₂: c₂

Γ ⊢ procedure x { p₁ } p₂ : c₂

[Statement-Call]

Γ(x) = c₂ Γ ⊢ p: c₁

Γ ⊢ call x ; p : c₁ + c₂

[Statement-End]

Γ ⊢ end : $0

We can implement the above inference rules as an abstract interpretation algorithm.

def costDuration(e):
    if type(e) == Node:
        for label in e:
            children = e[label]
            if label == 'Minutes':
                [num] = children
                n = num['Number'][0]
                return n * 0.01
            if label == 'Hours':
                [num] = children
                n = num['Number'][0]
                return n * 0.60
            if label == 'Days':
                [num] = children
                n = num['Number'][0]
                return n * 14.40

def costProg(env, s):
    if type(s) == Leaf:
        if s == 'End':
            return 0
    elif type(s) == Node:
        for label in s:
            if label == 'Reserve':
                [d,p] = s[label]
                 return costDuration(d) + costProg(env, p)
            if label == 'For':
                [var, num, p1, p2] = s[label]
                n = num['Number'][0]
                return n * costProg(env, p1) + costProg(env, p2)

            # ...

[link] 7. Declarative (and Functional) Programming Language Paradigms

In this section we introduce the notion of a programming language paradigm, define several major programming language paradigms, and discuss which paradigms are supported by some widely used modern programming languages (in practice, many languages support more than one paradigm). We assume the reader is already at least somewhat familiar with the imperative and object-oriented programming paradigms; thus, we study in detail the declarative programming paradigm (including an important subset, functional programming).

[link] 7.1. Programming language paradigms

A programming language's programming paradigm is the collection of abstractions, tools, and techniques it makes available to a programmer. While there is some disagreement about what exactly constitutes a programming paradigm and how programming paradigms can be distinguished and categorized, there is wide agreement that at least a few distinct, widely-employed programming paradigms exist. We can enumerate these paradigms, and we can list the programming language features (i.e., abstractions) that can be used to identify and distinguish between them:

imperative programming:
- mutable state and side effects:
  - indexed memory regions;
  - named, mutable variables;
  - ability to interact with other "external" devices (video card, network, etc.);
- sequential execution of statements:
  - one operation per time step (control flow);
  - branching and loops (direction of control flow);
- procedural programming:
  - definition and invocation of named procedures;
- parallelism achieved with forking, threads, and message passing;
object-oriented programming:
- data and methods organized into objects;
- multiple instances of objects interacting via method calls:
  - event handling;
- enforcement of encapsulation and modularity;
declarative programming:
- static, immutable descriptions of data, models, or problems:
  - user-defined data types:
    - algebraic data types;
  - type checking/inference.
- query sublanguage for examining/exploring static descriptions;
- functional programming:
  - referential transparency;
  - support for pattern matching/unification;
  - functions as values (first-class functions and higher-order functions).

Mathematically, it is possible to identify which paradigms are supported by a programming language by examining its operational semantics. We have already seen many examples of an operational semantics for an imperative language (such as this one and this one).

As a result of both historical trends and of practical necessity, most widely-used programming languages support multiple programming paradigms:

	imperative	procedural	object- oriented	declarative	functional	static type checking	garbage collection	side effects
machine languages	some support (indexed memory, branching)	no support	no support	no support	no support	none	no	yes
FORTRAN (initial release)	full support	no support	no support	no support	no support	very limited	no	yes
C	full support	full support	no support	no support	some support (function pointers)	very limited	no	yes
C++	full support	full support	full support	class declarations, some type checking	function pointers and some extensions (e.g., STL)	limited	no	yes
Java	full support	full support	full support	type system, class declarations, abstract classes, interfaces, type checking, and type inference	generics, interfaces, objects as wrappers, anonymous classes	extensive	yes	yes
PHP	full support	full support	full support	no support	some support	none	yes (≥ 5.3)	yes
Python	full support	full support	full support	no support	full support	none	yes	yes
JavaScript	full support	full support	full support	no support	full support	none	yes	yes
Haskell	meta-support (monads)	meta-support (functions)	no support	full support	full support	extensive	yes	some (IO monads)
SQL	some support (insertion/ deletion; sequences of statements in versions like MS-SQL)	some support (procedures in versions like MS-SQL)	no support	full support (definition of schemas and tables; ability to evaluate queries over tables)	some support (custom map and aggregation procedures in MS-SQL)	some	yes	some
Amazon Web Services APIs	some support	no support	no support	some support	some support	none	n/a	yes

While most languagues support many paradigms in practice, there is often a historically predominant paradigm that distinguished that programming language at the time it was introduced, that is supported natively by the programming language (i.e., without special extensions, libraries, or idioms), and that is employed by the communities of programmers who use that language. In the above, the paradigm most closely identified with each language is indicated with a dark green outline.

A note about the above chart: when it comes to domain-specific languages that have a very narrow purpose (such as the Amazon Web Services API) and are not meant to be general-purpose programming languages, support of a paradigm does not imply that the language supports that paradigm in general; instead, it means the language supports that paradigm insofar as it applies to that particular application domain.

[link] 7.2. Substitution and unification

In this course, we will focus on the declarative (and functional) programming language Haskell. The Haskell syntax and operational semantics natively support unification in multiple ways (specifically, pattern matching within expressions and polymorphism within the type system). Unification is a general concept that is applied in some form in all areas of mathematics and computer science. In this subsection, we will define unification.

[link]

Definition: Given a set of variables X and an abstract syntax A (i.e., a set of abstract syntax trees), a substitution σ is a mapping from variables names in X to abstract syntax trees in A. Given an abstract syntax tree a ∈ A, we denote the result of substituting all the variables in a according to σ using the notation σ(a).

You'll notice that we use the same mathematical object to represent both environments (a.k.a., contexts) and substitutions. Thus, we will use the same notation for representing manipulations involving substitutions.

In this course, we will usually denote a substitution using the symbol σ.
Given a variable name x ∈ X, we denote by σ(x) the abstract syntax tree that the environment σ assigns to x.
An empty substitution (one that does not map any variable names to any abstract syntax trees) is denoted using σ₀.
We use the notation dom(σ) to represent the set of variables that σ maps to abstract syntax trees.
Given two substitutions σ₁ and σ₂ such that dom(σ₁) ∩ dom(σ₂) = ∅ (i.e., the two substitutions do not overlap or conflict on any variables), we denote by σ₁ ⊎ σ₂ the substitution that can be obtained by combining the two individual substitutions. Thus, dom(σ₁ ⊎ σ₂) = dom(σ₁) ∪ dom(σ₂).

[link]

Example: Suppose we have the following abstract syntax (here, node is an infix operator for building tree nodes):

tree t

::=

leaf | x | t node t

variable x

::=

any valid variable

Suppose we are given the following substitution, which maps the variable u to the tree leaf and the variable v to the tree leaf node leaf:

{ u ↦ leaf, v ↦ leaf node leaf }

Applying the above substitution to a tree that does not contain any variable nodes u or v has no effect:

σ(leaf)

leaf

Applying the substitution to a tree containing the variables u and/or v would result in those variables being replaced by the corresponding tree inside σ (we add parentheses to the result for legibility):

σ(u node v)

(leaf) node (leaf node leaf)

[link]

Definition: Given an abstract syntax A and two abstract syntax trees a, b ∈ A, we say that a substitution σ unifies a and b if the following equation holds:

σ(a)

σ(b)

That is, applying the substitution to both trees produces exactly the same tree. The process of calculating the smallest σ that guarantees the above equality is called unification.

For those familiar with the concept of a magma, unification corresponds to the process of solving equations on magmas (or, more generally, any algebraic structure without any reduction or cancellation laws).

[link]

Fact: Suppose we are given an abstract syntax A and two abstract syntax trees a, b ∈ A that contain no variables. If a = b, then it is true that for all σ (including σ₀):

σ(a)

σ(b)

If a ≠ b, then it is true that for all σ:

σ(a)

≠

σ(b)

[link]

Example: Suppose we have the following abstract syntax:

::=

object | x | m ⊕ m | m ⊗ m | ( m )

variable x

::=

any valid variable

Unify the following two abstract syntax trees, if possible (i.e., solve the following equation):

(z ⊗ object) ⊕ x

(object ⊗ y) ⊕ (object ⊗ object)

We can proceed by breaking the problem down. Since the root nodes of both abstract syntax trees are ⊕, we can turn the above into two equations:

z ⊗ object

object ⊗ y

object ⊗ object

The first equation can be broken down further:

object

object ⊗ object

At this point, we can build up the substitution using the above information:

{ x ↦ object ⊗ object, y ↦ object, z ↦ object }

Thus, σ as defined above unifies the two sides of the equation; the equation has a solution.

The equality algorithm on abstract syntax trees is a degenerate version of a unification algorithm in which substitutions must always remain empty. This algorithm is implemented natively within Python (we have been using it every time we checked whether two trees represented as nested dictionaries are equivalent using the == operator), and Haskell supports the automatic derivation of this algorithm for any data type (using the deriving Eq directive).

[link]

Algorithm (equality): The following algorithm can determine whether two trees are equivalent.

equal(a, b): two abstract syntax trees a and b
1. if both a and b are leaf nodes and are equivalent
  1. return true
2. if both a and b have the same label and the same number of children
  1. in order from left to right, check each pair of corresponding children of a and b for equality
  2. return true if all the subtrees are equivalent
  3. return false otherwise

[link]

Algorithm (pattern matching unification): Whether unification can be computed efficiently, or at all, depends on the subset of the abstract syntax from which inputs are drawn (i.e., it depends on what restrictions are placed on the abstract syntax trees that may need to be unified). The following algorithm, which we will call pattern matching unification, is guaranteed to terminate quickly (i.e., polynomial time) as long as at least one side of the equation contains no variables. It is guaranteed to quickly find a solution if it exists, as long as all variables occur exactly once.

unify(a, b): two abstract syntax trees a and b
1. if both a and b are leaf nodes and are equivalent
  1. return the empty substitution σ₀
2. if a is a variable node representing a variable x
  1. return the substitution {x ↦ b}
3. if b is a variable node representing a variable x
  1. return the substitution {x ↦ a}
4. if both a and b have the same label and the same number of children
  1. in order from left to right, unify each pair of corresponding children of a and b
  2. as long as they do not overlap on any variables, combine the substitutions obtained above
  3. return the combined substitution

[link]

Example: The Python operational semantics (and interpreter) employ a pattern matching unification algorithm in some cases. For example, it is possible to assign to arbitrarily deep tree of nested tuples.

>>> (a,(b,c),(d,(e,f),g)) = (1,(2,3),(4,(5,6),7))
>>> (a,b,c,d,e,f,g)
(1,2,3,4,5,6,7)

[link] 7.3. Declarative programming languages

Any programming language that allows users to provide a static, immutable description of data, a model, or a problem supports the declarative programming paradigm to at least some degree. For example, languages that allow users to specify types and relationships between those types (e.g., Haskell types, Java classes, and so on) support a declarative programming paradigm for types.

In this course we will focus on declarative languages that allow users to specify data types, relationships between data types, and pure mathematical functions over those data types. In particular, we will look at the Haskell programming language.

In order to define the operational semantics for a declarative language, we must slightly modify the way in which we define an operational semantics by extending the premises and conclusions with an additional parameter representing the data, model, or problem description. We will adopt the convention of calling this parameter M. Thus, an intepreter for a declarative programming language first builds data, a model, or a problem description by parsing and collecting into a single structure all the declarations in a program. It then allows the user to query that structure.

In the Haskell programming language, a module is a collection of declarations (that can occur in any order) of variables, functions, and types. For example:

module Example where

c = 5;

g (0) = 0;
g (x) = f(x) + f(x);

f (x) = 2 * x + 1;

A Haskell interpreter first assembles the module without performing any evaluation, and then allows users to make queries about that module. Queries take the form of expressions, and the response to a query is the evaluation of that expression with respect to the entire module (i.e., all the definitions in the module).

*> c + c
10
*> f(4)
9

Haskell natively supports lists and tuples, and the evaluation algorithm employs the pattern matching unification algorithm. Suppose we have the following module:

module AnotherExample where

f (-1) = 1;
f (0) = 0;
f (1) = -1;

(a,(b,(c,d),e)) = (1,(2,(3,4),5));

g([x,y,z]) = [z,y,x];

We can query the above module as follows:

*> f(-1)
1
*> d
4
*> g([1,2,3])
[3,2,1]

[link] 7.4. Algebraic data types

Some declarative languages natively support abstract mathematical, logical, and algebraic structures such as BNF notation, abstract syntaxes, logical formulas and predicates, and inference rules. Haskell natively supports user-defined algebraic data types. An algebraic data type is effectively a user-defined abstract syntax; the Haskell syntax for defining algebraic data types corresponds closely to BNF notation. In the module below, we define an algebraic data type for representing binary trees.

module Tree where

data Tree = Leaf | Node Tree Tree;

Most Haskell interpreters and compilers support infix notation in data type definitions as well.

data Tree = Leaf | Tree `Node` Tree;

Once the above module is parsed and built, the user is allowed to construct instances of the Tree algebraic data type. However, in order to actually display instances of the data type, it is necessary to ask the Haskell interpreter to derive a definition of a show function automatically from the definition of the algebraic data type (something that is done implicitly in, for example, Python). Haskell requires that this be requested explicitly because it also allows users to define their own variants of such functions.

module Tree where

data Tree =
    Leaf
  | Node Tree Tree
  deriving Show;

We can now try building trees using this algebraic data type.

*> Leaf -- We represented this using "Leaf" in Python.
Leaf

*> Node Leaf Leaf -- We represented this using {"Node":["Leaf", "Leaf"]} in Python.
Node Leaf Leaf

*> Leaf `Node` Leaf
Node Leaf Leaf

*> Leaf `Node` ((Leaf `Node` Leaf) `Node` (Leaf `Node` Leaf))
Node Leaf (Node (Node Leaf Leaf) (Node Leaf Leaf))

Haskell's native support for pattern matching unification makes it possible to write concise definitions for functions that are defined over trees.

-- Count the number of nodes in a tree that have exactly one non-leaf child.
-- The spaces between the ( ... ) are just for legibility.
nodesWithOneLeaf (Node Leaf Leaf) = 0;
nodesWithOneLeaf (Node Leaf t   ) = 1 + nodesWithOneLeaf(t);
nodesWithOneLeaf (Node t    Leaf) = 1 + nodesWithOneLeaf(t);
nodesWithOneLeaf (Node t1   t2  ) = nodesWithOneLeaf(t1) + nodesWithOneLeaf(t2);
nodesWithOneLeaf (Leaf          ) = 0;

Algebraic data types are very well-suited for represented abstract syntaxes of languages, and pattern matching unification makes it possible to define concise implementations of the operational semantics of languages (as well as any other transformations over abstract syntax trees, including optimization algorithms, static analysis algorithms, and compilation algorithms).

[link]

Example: Suppose we want to define an abstract syntax and simple interpreter for formulas. We can use the declarations below to do so.

module Formula where

data Formula =
    T
  | F
  | Not Formula
  | And Formula Formula
  | Or Formula Formula
  deriving Show;

vNot (T) = F;
vNot (F) = T;

vAnd (T, T) = T;
vAnd (T, F) = F;
vAnd (F, T) = F;
vAnd (F, F) = F;

vOr (T, T) = T;
vOr (T, F) = T;
vOr (F, T) = T;
vOr (F, F) = F;

eval (T        ) = T;
eval (F        ) = F;
eval (Not f    ) = vNot (eval f);
eval (And f1 f2) = vAnd (eval f1, eval f2);
eval (Or  f1 f2) = vOr (eval f1, eval f2);

Using wildcards (written using the underscore _) in patterns makes it possible to write even more concise definitions.

vAnd (T, T) = T;
vAnd _      = F;

vOr (F, F) = F;
vOr _      = T;

eval (Not f    ) = vNot (eval f);
eval (And f1 f2) = vAnd (eval f1, eval f2);
eval (Or  f1 f2) = vOr (eval f1, eval f2);
eval (f)         = f;

To check equality of algebraic data type instances, it is necessary to either implement or derive the equality function ==. To have the Haskell interpreter or compiler derive it automatically, you can use deriving Eq. These can be combined with other deriving directives, such as in the example below.

data Formula =
    T
  | F
  | Not Formula
  | And Formula Formula
  | Or Formula Formula
  deriving (Eq, Show);

[link] 7.5. Assignment #4: Declarative Programming Languages

In this assignment you will implement an interpreter using Python for a declarative language, and you will practice using the declarative programming language Haskell. You must submit four files:

hw4/parse.py (there is no need to modify this file);
hw4/interpret.py;
hw4/Tree.hs;
hw4/Term.hs.

Please follow the gsubmit directions and remember to put your files in the hw4 directory.

Your solutions to each of the problem parts below will be graded on their correctness, concision, and mathematical legibility. The different problems and problem parts rely on the lecture notes and on each other; carefully consider whether you can use functions from the lecture notes, or functions you define in one part within subsequent parts.

A testing script with several test cases is available for download: hw4-tests.py. You should be able to place it in the same directory with the other assignment files and run it. Feel free to modify or extend it as you see fit.

In this problem you will implement helper functions for performing substitution and unification. Your solutions should be included in the file hw4/interpret.py.
1. Implement a function subst(s, a) that takes two arguments: a substitution s (represented as a Python dictionary), and an abstract syntax tree a. The function should return a new abstract syntax tree in which every variable in the tree that is in the domain of s has been substituted according to the substitution s. You may assume that variables are always represented using a subtree of the form {"Variable":[ ... ]}, as in all previous examples and assignments.
2. Implement a function unify(a, b) that takes two arguments: two syntax trees a and b. The function should return the smallest substitution s that satisfies the following equation:
  
  subst(s, a) == subst(s, b)
  
  You should implement the pattern matching unification algorithm for this assignment.

In this problem, you will implement an interactive interpreter for a declarative language (a small, untyped version of a subset of Haskell). All the functions you define should be included in the file hw4/interpret.py. The abstract syntax for the language is presented below; a complete parser that parses the concrete syntax for this language into the abstract syntax below can be found in hw4/parse.py. Notice that in the concrete syntax, constructor names always begin with an uppercase letter, while variable names always begin with a lowercase letter.

declaration d

::=

f ( p ) = e ; d

pattern p

::=

c p p

( p )

expression e

::=

c e e

( e )

e + e

f ( e )

number n

::=

0|[1-9][0-9]*

constructor c

::=

[A-Z][A-Za-z]*

variable x, f

::=

[a-z][A-Za-z]*

The relevant portions of the operational semantics are provided below:

Implement a function build(m, d) that takes a finite map m (i.e., a Python dictionary mapping names to lists of (pattern, expression) tuples) and a declaration parse tree d. The function should return the finite map m that represents the module definition that is assembled according to the operational semantics presented below:

[Declaration-Function-First]

f ∉ dom(M₁) M₁ ⊎ {f ↦ {(p, e)}} d ⇓ M₂

M₁, f ( p ) = e ; d ⇓ M₂

[Declaration-Function-More]

f ∈ dom(M₁) M₁ ⊎ {f ↦ (M₁(f) ⊎ {(p, e)})} d ⇓ M₂

M₁, f ( p ) = e ; d ⇓ M₂

[Declaration-End]

M, end ⇓ M

The abstract syntax for values is defined as follows:

value v

::=

c v v

Implement a function evaluate(m, env, e) that takes a module m, an environment env, and an expression abstract syntax tree e as its three arguments. The function should return the value that corresponds to the evaluation of the abstract syntax tree e, as determined by the operational semantics presented below. Note that the [Expression-Apply] requires using a unification algorithm to obtain a substitution σ that unifies p and v₁.

[Expression-Constructor-Args]

M, Σ, e₁ ⇓ v₁ M, Σ, e₂ ⇓ v₂

M, Σ, c e₁ e₂ ⇓ c v₁ v₂

[Expression-Constructor-No-Args]

M, Σ, c ⇓ c

[Expression-Number]

M, Σ, n ⇓ n

[Expression-Variable]

Σ(x) = v

M, Σ, x ⇓ v

[Expression-Plus]

M, Σ, e₁ ⇓ n₁ M, Σ, e₂ ⇓ n₂

M, Σ, e₁ + e₂ ⇓ n₁ + n₂

[Expression-Apply]

M, Σ, e₁ ⇓ v₁ (p, e₂) ∈ M(f) σ(p) = σ(v₁) M, Σ ⊎ σ, e₂ ⇓ v₂

M, Σ , f ( e₁ ) ⇓ v₂

Once build() and evaluate() are implemented, you can use the interact() function in hw4/interpret.py to query modules.

In this problem you will practice using algebraic data types in the Haskell programming language. All your definitions should be included in the file hw4/Tree.hs. You will be working with the following data type definition, which has already been included in the file:

data Tree = Leaf | Twig | Branch Tree Tree Tree

You may not import any additional modules or libraries.
1. Modify the data type definition for Tree in Tree.hs so that the functions (==) and show are automatically generated for this data type.
2. Implement a function twigs :: Tree -> Integer that returns an integer representing the total number of Twig values in the input tree.
  
  *> twigs (Branch (Branch Twig Leaf Leaf) Twig Twig)
  3
3. Implement a function branches :: Tree -> Integer that returns an integer representing the total number of branches in the input tree.
  
  *> branches (Branch Leaf (Branch Twig Leaf Leaf) Leaf)
  2
4. Implement a function height :: Tree -> Integer that returns an integer representing the height of the input tree. Trees consisting of a Leaf are defined to have height 0, while trees consisting of a Twig are defined to have height 1.
  
  *> height (Branch (Branch Leaf Leaf Twig) (Branch Leaf Leaf Leaf) Leaf)
  3
5. A tree is perfect if all the leaves of the tree are at the same depth and it has no twigs. Define a function perfect :: Tree -> Bool that returns True if the input tree is perfect and False otherwise.
  
  *> perfect (Branch (Branch Leaf Leaf Leaf) Leaf Leaf)
  False
  *> perfect (Branch Twig Twig Twig)
  False
  *> perfect (Branch (Branch Leaf Leaf Leaf) (Branch Leaf Leaf Leaf) (Branch Leaf Leaf Leaf))
  True
6. A tree is degenerate if all the branches are part of a single path. Equivalently, a tree is degenerate if all nodes have at least two non-node children. Define a function degenerate :: Tree -> Bool that returns True if the tree supplied is degenerate, and False otherwise.
  
  *> degenerate (Branch (Branch Twig Leaf Leaf) Twig Leaf)
  True
  *> degenerate (Branch (Branch Leaf Leaf Leaf) Leaf (Branch Leaf Leaf Leaf))
  False
7. Define a tree infinite :: Tree that has no leaves and no twigs.
In this problem you will implement a very simple interpreter using the Haskell programming language. All your definitions should be included in the file hw4/Term.hs. Implement the evaluate :: Term -> Integer function so that it evaluates abstract syntax trees of the type Term to obtain values (i.e., integers).

*> evaluate (Plus (Number 1) (Number 2))
3
*> evaluate (Mult (Number 4) (Abs (Plus (Number (-1)) (Number (-2)))))
12

[link] 7.6. Type system of the typed declarative language Haskell

Haskell natively supports defining and using algebraic data types, tuples, lists, and functions. The Haskell type system provides a good way to organize and document these data structures, and governs how they can be combined. The abstract syntax for a portion of the Haskell type system is provided below.

Haskell type τ

::=

Bool

Int

Integer

Char

()

( τ , τ )

( τ , τ , τ )

⋮

( τ , ... , τ )

[ τ ]

τ -> τ

⋮

Haskell allows users to define their own algebraic data types as well as their own type synonyms. Every Haskell module has two distinct namespaces for identifiers, and these two namespaces are completely disjoint: there is a namespace for identifiers that correspond to variable, functions, and constructors, and there is a namespace for identifiers that correspond to types. The concrete syntax determines to which namespace a given variable belongs, and there is never any ambiguity in the concrete syntax in this regard. Note that this is unlike Python, where types can appear in any expression within a program. The following example illustrates the distinct namespaces for expressions and types:

module Example where

f :: Integer -> Integer
f (x) = x + x

data Tree = Leaf | Node Tree Tree

type AnotherNameForInteger = Integer

type AnotherNameForTree = Tree

isLeaf :: AnotherNameForTree -> Bool
isLeaf (Leaf ) = True
isLeaf (Node t1 t2) = False

Haskell lists have a list type in Haskell. For example, a list of integers has the type [Integer].

list :: [Integer]
list = [1,2,3]

Haskell lists are actually represented using an algebraic data type that is similar to the Tree data type that we have already seen. If we were to define our own algebraic data type for lists in Haskell, for example, we could do so as follows (the terms "nil" and "cons" are commonly used in the functional language community for the empty list and for the list node constructor, respectively):

data List = Nil | Cons Integer List

example :: List
example = Cons 1 (Cons 2 (Cons 3 Nil))

The built-in list type has its own special names for the two constructors: the empty list is denoted [], and the list node constructor is the operator (:), which can also be written as an infix operator :. The following three syntaxes are equivalent, and evaluate to the same list [1,2,3]:

example1 :: [Integer]
example1 = (:) 1 ((:) 2 ((:) 3 []))

example2 :: [Integer]
example2 =  1 : (2 : (3 : []))

example3 :: [Integer]
example3 =  [1,2,3]

There is also a syntax for defining lists that contain ranges of integers:

*> [1..4]
[1,2,3,4]

Haskell supports user-defined type synonyms. These are simply substituted and have no restrictive effects.

type Distance = Integer

There are some default type synonym definitions, such as String; strings in Haskell are just lists of characters:

type String = [Char]

example1 :: String
example1 = "abcdefg"

example2 :: String
example2 = [a, b, c, d, e, f, g]

Type synonyms can be assigned any valid type, no matter how complex it may be. In the below example, the type synonym GameStrategyAlgorithm is equivalent to the type [[(Integer, Integer)]] -> String (i.e., the type for functions that take lists of lists of integer pairs and return a string):

type GameState = [(Integer, Integer)]
type Game = [GameState]
type Strategy = String
type GameStrategyAlgorithm = Game -> Strategy

[link] 7.7. Defining and working with infinite data type instances using lazy evaluation

Since Haskell expressions are only evaluated when a query is made, a Haskell module can safely describe recursively defined, potentially infinitely large values. For example, we can define an infinite list containing the integer 1 in every position:

infinite :: [Integer]
infinite = [1] ++ infinite

Equivalently, we can use the (:) list node constructor:

infinite :: [Integer]
infinite = 1 : infinite

Haskell also allows infinite lists containing ranges to be declared using the list range notation:

*> take 10 [5..]
[5,6,7,8,9,10,11,12,13,14]

The Haskell operational semantics is based on pattern matching unification, but it doesn't work exactly like the operational semantics we implemented in Assignment #4. Haskell supports lazy evaluation. Intuitively, when trying to evaluate a function application of the form f(e), Haskell will only evaluate the expression e using substitutions until it is sufficiently deep to try all the patterns in the definition of f (e.g., if the patterns associated with f are all of depth at most 3, the expression e will be evaluated until the constructors up to depth 3 are known, and no further).

Haskell lazy evaluation operational semantics make it possible to perform queries on infinite values without running into an infinite recursion or infinite loop. For example, suppose we define the following function on lists, which takes only the first n elements of a list (note that this function is already included in the Prelude).

take :: Integer -> [Integer] -> [Integer]
take 0 xs = []
take n (x:xs) = x : take (n-1) xs

Querying the infinite list is not a problem if we use take.

*> take 10 infinite
[1,1,1,1,1,1,1,1,1,1]

Using pattern matching unification and lazy evaluation, it is also possible to define algorithm that can operate on infinite lists. For example, we can define a function that adds all the corresponding elements of two infinite lists. Notice that there is no case for the empty list in this definition because infinite lists are never empty.

addInfiniteLists :: ([Integer], [Integer]) -> [Integer]
addInfiniteLists (x:xs, y:ys) = (x + y) : addInfiniteLists (xs, ys)

We can now use the addInfiniteLists function to add two infinite lists. Suppose we have the following definitions:

ones = 1 : ones
twos = 2 : twos

We can then define an infinite list in which all entries are 3:

*> take 10 (addInfiniteLists (ones, twos))
[3,3,3,3,3,3,3,3,3,3]

[link] 7.8. Logic programming

Declarative languages are well-suited for describing logical models and then examining them by making queries. Declarative languages like Prolog and Datalog are designed specifically to support logic programming. In Haskell, it is possible to use functions that return boolean values to represent logical predicates, and boolean expressions to represent the formulas that define them.

[link]

Example: Suppose we want to model a simple access control system (e.g., for a file system). Notice that we can view functions like user() and owns() either as tables in a database (with each declaration for the function acting as a row in the table) or as logical predicates.

data Constant = Alice | Bob | X | Y | Z deriving (Eq, Show)

-- Define which constants are a file.
file (X) = True
file (Y) = True
file (Z) = True
file (_) = False

-- Define which constants are a user.
user (Alice) = True
user (Bob)   = True
user (_)     = False

-- Specify which users are administrators.
admin(Alice) = True
admin(Bob)   = False

-- Specify file ownership relation.
owns (Alice, X) = True
owns (Bob,   Y) = True
owns (Bob,   Z) = True
owns (_,     _) = False

-- Only owners and administrators can write files.
write (user', file') =
     ( owns(user', file') || admin(user'))
  && user(user')
  && file(file')

-- Anyone can read files.
read (user, file) = True

In the above example, we needed to define predicates that distinguished different types of constants. However, this means that every call to file() and user() must take place when the program is running. We can make the code more efficient by instead declaring distinct types. Since type checking is performed only at compile time in Haskell, this will still ensure that there are no type errors, but it will be more efficient because fewer function invocations will occur when the program is running.

data User = Alice | Bob deriving (Eq, Show)
data File = X | Y | Z deriving (Eq, Show)

admin :: User -> Bool
admin (Alice) = True
admin (Bob)   = False

owns :: (User, File) -> Bool
owns (Alice, X) = True
owns (Bob,   Y)   = True
owns (Bob,   Z)   = True
owns (_,     _)   = False

write :: (User, File) -> Bool
write (user, file) = owns(user, file) || admin(user)

read :: (User, File) -> Bool
read (user, file) = True

[link] 7.9. Functions and higher-order functions in a typed functional language

Haskell is a functional language; functional languages are distinguished by their ability to allow users to define and use higher-order function: functions that can take other function as arguments, and that can return functions as results. Many other languages have native support for higher-order functions, including JavaScript and Python; we say that these languages support the functional programming paradigm.

To start become familiar with the concept of a higher-order function, we can consider the following example:

f :: Integer -> Integer
f (x) = x + 1

g :: Integer -> (Integer -> Integer)
g (y) = f

Notice that g is a function that takes an integer as an argument. It then throws this argument away and returns the function f. Its type reflects this. Thus, the result of evaluating an expression such as g(4) is a function of type Integer -> Integer, and this function can be applied to another argument.

*> :t g(1)
g(1) :: Integer -> Integer
*> :t g(1)(5)
g(1)(5) :: Integer
*> g(1)(5)
6

What if we want to define a function like g that can be partially applied to only some of its arguments, but that still uses all of its arguments? We can do so by listing multiple, space-separated parameters in the definition of g.

g :: Integer -> (Integer -> Integer)
g (y) (x) = y + x

It is then possible to apply g to arguments one at a time, as before.

*> :t g(3)
g(3) :: Integer -> Integer
*> :t g(3)(4)
g(3)(4) :: Integer
*> g(3)(4)
7

Notice the relationship between application of functions to arguments, and the types of each intermediate result. Suppose we have the following definitions.

data A = A
data B = B
data C = C
data D = D
data E = E

f :: A -> (B -> (C -> (D -> E)))
f a b c d = E

Then we have the following types for each of the following expressions (spacing adjusted for legibility):

*> :t f
f             :: A -> (B -> (C -> (D -> E)))

*> :t f(A)
f(A)          :: B -> (C -> (D -> E))

*> :t f(A)(B)
f(A)(B)       :: C -> (D -> E)

*> :t f(A)(B)(C)
f(A)(B)(C)    :: D -> E

*> :t f(A)(B)(C)(D)
f(A)(B)(C)(D) :: E

[link]

Example: We want to write a higher-order function that takes any function f of type Integer -> Integer and returns a function f' that is its reflection across the x-axis (i.e., for all x, f(x) = -f'(x)).

reflect :: (Integer -> Integer) -> Integer -> Integer
reflect (f) (x) = - f(x)

Notice that we could also write the type as (Integer -> Integer) -> (Integer -> Integer) because -> is right-associative. This makes it more obvious that reflect takes a function of type Integer -> Integer as an input and returns a function of type Integer -> Integer as an output.

[link]

Example: We want to write a higher-order function that takes any function f of type Integer -> Integer and returns an integer approximation of its definite integral starting at x = 0.

integral :: (Integer -> Integer) -> Integer -> Integer
integral (f) (x) = sum [ f(k) | k <- [0..x] ]

We could also write a higher-order function that takes any function f of type Double -> Integer and returns an approximation of its definite integral starting at x = 0 for a given precision.

integral :: Double -> (Double -> Double) -> Double -> Double
integral (i) (f) (x) = sum [ f(k*i)*i | k <- [0..(x/i)] ]

[link]

Example: We want to write a higher-order function that takes any function f of type Double -> Double and returns an approximation of its derivative.

derivative :: Double -> (Double -> Double) -> Double -> Double
derivative (h) (f) (x) = (f(x + h) - f(x)) / h

[link] 7.10. Data encapsulation using algebraic data types

Typed languages that support algebraic data types make it possible to implement and enforce a kind of data encapsulation. For example, suppose we want to represent distances in various units (such as Imperial and metric units) in our program. There are well-known examples of expensive software failures (such as the crash of the Mars Climate Orbiter) that occurred because one component assumed that numerical data was in Imperial units, while another component assumed it was metric units.

data Meters = Meters Integer
data Feet = Feet Integer

Given the above, it is now possible to enforce which type of measurement is supported by a function. For example:

data Action = KeepGoing | SlowDown

updateVelocity :: Feet -> Action
updateVelocity (Feet n) = if n < 100 then SlowDown else KeepGoing

[link] 7.11. Predefined Haskell type classes, user-defined operators, and embedded languages

Haskell allows programmers to define their own functions for converting algebraic data type values to strings (e.g., for displaying them in the interactive interface), and it allows programmers to provide custom definitions for numeric literals (i.e., 0, 1, 2, and so on) and built-in operators (such as +, *, and so on).

instance Num Meters where
  fromInteger n = Meters n
  (Meters x) + (Meters y) = Meters (x + y)

The above definitions allow us to use the numeric liters and built-in operators to work with values of type Meters.

*> 1 + 2 + 3 :: Meters
Meters 6

It is also possible to provide a user-defined function for converting values of type Meters to strings.

instance Show Meters where
show (Meters n) = show n ++ " meters"

Then, we have the following behavior:

*> 1 + 2 + 3 :: Meters
6 meters

The Haskell features that support algebraic data types, multiple ways to apply higher-order functions to arguments, user-defined infix operators, and user-defined definitions of built-in literals and functions can be used in concert to easily define new programming languages within Haskell itself. This makes it much easier to build parse trees directly within Haskell, thus obviating the need for a parser.

[link]

Definition: An embedded language is a subset of the concrete syntax of a programming language (the host language) that is treated as its own separate, self-contained language. Libraries can often be viewed as embedded languages.

[link]

Example: Suppose we define the following algebraic data types, corresponding to an abstract syntax for an imperative programming language. This definition, along with some of Haskell's other features, will allow us to turn this into an embedded language within Haskell. We can then use Haskell's concrete syntax to write the concrete syntax for this language, and Haskell's evaluation algorithm will also act as the parser for this language (with the Haskell evaluation algorithm producing an abstract syntax tree representation of whatever is written).

We also introduce some user-defined infix operators, and add user-specified definitions for some built-in Haskell operators:

(<<<) :: Term -> Term -> Formula
(<<<) t1 t2 = LessThan t1 t2

(===) :: Term -> Term -> Formula
(===) t1 t2 = Equal t1 t2

(&&&) :: Formula -> Formula -> Formula
(&&&) f1 f2 = And f1 f2

(|||) :: Formula -> Formula -> Formula
(|||) f1 f2 = Or f1 f2

instance Num Term where
  fromInteger n = Number n
  t1 + t2 = Plus t1 t2

We can now use all of the above together with the $ syntax to easily write down abstract syntax trees.

parseTree =
   Print 5 $
   Print 6 $
   If ((5 <<< 6) &&& (6 <<< 7)) (
     Print 6 $
     Print 7 $
     End
   ) $
   End

Evaluating the definition of parseTree will yield the actual algebraic data type instance (i.e., a parse tree).

*> parseTree
Print (Number 5) (Print (Number 6) (
  If
    (And (LessThan (Number 5) (Number 6)) (LessThan (Number 6) (Number 7)))
    (Print (Number 6) (Print (Number 7) End)) End)
  )

[link] 7.12. Additional useful Haskell language features and examples

Like Java/C/C++, Haskell provides a built-in ternary operator if ... then ... else ... for writing conditional expressions:

*> if True then 1 else 2
1
*> if False && True then "Yes" else "No"
"No"

However, this is only a convenience; a user can easily implement his or her own syntax for this purpose. The support for pattern matching unification and lazy evaluation that Haskell's operational semantics provides is sufficient. Two example implementations are provided below.

if' :: (Bool, Integer, Integer) -> Integer
if' (True,  trueBranch, falseBranch) = trueBranch
if' (False, trueBranch, falseBranch) = falseBranch

if'' :: Bool -> Integer -> Integer -> Integer
if'' True  trueBranch _           = trueBranch
if'' False _          falseBranch = falseBranch

The if ... then ... else ... syntax (or function, as we have seen above) can be useful in certain situations.

[link]

Example: Suppose we are trying to represent an environment or substitution using a data structure in Haskell. We might do so by using a list of tuples, where each tuple is a pair consisting of a String and a value (e.g., an Integer). For example:

[("x",1), ("y", 2)] :: [(String, Integer)]

Then, we would need a function that can help us retrieve a value associated with a particular variable name.

lookup' :: String -> [(String, Integer)] -> Integer
lookup' x ((x',i) : rest) = if x == x' then i else lookup' x rest

In fact, Haskell's base library contains a similar function, lookup. However, since it returns a result that is wrapped in a Maybe constructor, we can unwrap it using, e.g.:

unwrap :: Maybe Integer -> Integer
unwrap (Just n) = n

[link]

Example: Suppose we want to represent our own totally ordered set. For example, suppose we have the following data type for representing a discrete subset of the color spectrum:

data Color = Red | Green | Blue

Since Red has the lowest frequency and Blue has the highest frequency, we can define an ordering on all instances of this data type by making the Color type a member of the Ord type class. Note that we could use wildcards (_) to make the definition more concise.

instance Ord Color where
  Red   <= Red   = True
  Red   <= Green = True
  Red   <= Blue  = True
  Green <= Green = True
  Green <= Blue  = True
  Blue  <= Blue  = True
  _     <= _     = False

It is now possible to use the built-in operators <, <=, >, and >= to compare colors, as well as library functions such as max :: Color -> Color -> Color, min :: Color -> Color -> Color, maximum :: [Color] -> Color, and minimum :: [Color] -> Color.

[link]

Example: Suppose we have a data type that allows us to build trees that contain integer constants in their leaves.

data IntegerTree = Node IntegerTree IntegerTree | Leaf Integer

We can define a recursive function that assembles all the integers in the leaves in a single list.

ints :: IntegerTree -> [Integer]
ints (Node t1 t2) = ints t1 ++ ints t2
ints (Leaf n ) = [n]

[link]

Example: Suppose we want to write a function that composes two functions on integers. We might define a type synonym for such functions:

type Function = Integer -> Integer

We then know that one possible type for a composition function might be as follows (it takes two function arguments, and returns a new composed function as a result):

compose :: Function -> Function -> Function

However, how do we actually implement compose?

compose f g = ?

To understand how we can do so, we must first expand the type of compose; let's substitute one the right-most Function synonym with its definition:

compose :: Function -> Function -> (Integer -> Integer)

Now recall that the -> type operator is right-associative; this means we can remove the right-most set of parentheses without changing the meaning of the type:

compose :: Function -> Function -> Integer -> Integer

It is now clear how to implement compose: it takes two Function arguments, and one Integer argument:

compose :: Function -> Function -> Integer -> Integer
compose f g x = f(g(x))

Note that this function is already available in the Haskell base library, and it is called (.). So we could also have defined compose as follows:

compose :: Function -> Function -> Function
compose f g = f . g

The above is an example of point-free programming, because we never explicitly gave names to the individual inputs (i.e., "points") on which the functions operate.

[link] 7.13. List comprehensions and pattern matching unification

Haskell list comprehension syntax can be combined with Haskell's built-in pattern matching unification capabilities.

[link]

Example: In Haskell, list comprehensions make it possible to build up a list by iterating over the elements of one or more other lists (possibly filtering entries using one or more formulas).

*> [x+1 | x <- [0..4]]
[1,2,3,4,5]

*> [(x,y) | x <- [1..3], y <- [1..2]]
[(1,1),(1,2),(2,1),(2,2),(3,1),(3,2)]

*> [(x,y) | x <- [1..3], y <- [1..2], x + y == 3, x >= 1]
[(1,2),(2,1)]

[link]

Example: In Haskell list comprehensions, the left-hand side of the membership operator &;lt- can be any pattern. If a pattern appears on the left-hand side of &;lt-, then each element in the list on the right-hand side will be checked against the pattern. Only those elements that unify with the pattern will be considered, and for each of them, the subsequent conditions in the comprehension, as well as the body of the comprehension, will be evaluated.

Suppose a module contains a data type definition for a simple tree data structure, and a list of trees.

data Tree =
    Node Tree Tree
  | Leaf
  deriving Show

example :: [Tree]
example = [
    Node Leaf Leaf,
    Node (Node Leaf Leaf) Leaf,
    Node Leaf (Node Leaf Leaf),
    Node Leaf (Node (Node Leaf Leaf) Leaf)
  ]

Then we can evaluate comprehensions such as the following. The first expression builds a list of the right subtrees of all the trees in example. The second expression builds a list containing the right subtrees that are not leaves. The last expression builds a list of subtree pairs for each non-leaf tree.

*> [r | Node _ r <- example]
[Leaf, Leaf, Node Leaf Leaf, Node (Node Leaf Leaf) Leaf]

*> [r | Node _ r <- example, r /= Leaf]
[Node Leaf Leaf, Node (Node Leaf Leaf) Leaf]

*> [(l,r) | Node l r <- example]
[(Leaf, Leaf), (Node Leaf Leaf, Leaf), (Leaf, Node Leaf Leaf), (Leaf, Node (Node Leaf Leaf) Leaf)]

We can also easily check whether a particular kind of tree is in the list (such as a non-leaf tree) by adding a value of type () for each matching entry and then checking the length of the list resulting from the comprehension, or by comparing it to the empty list.

*> length [() | Node _ _ <- example] > 0
True

*> [() | Node _ _ <- example] /= []
True

[link] 7.14. Modeling and exploring state spaces in a typed declarative/functional language

In many artificial intelligence and optimization problems, the collection of possible actions an algorithm, agent, or player can take can be modelled as a state space tree or graph (also sometimes called a decision tree or decision graph) which may or may not be infinite, depending on the length of the process or game being modeled. Algebraic data types and lazy evaluation make it easy to model such graphs.

[link]

Example: Suppose we want to model a Tic-tac-toe game, with the eventual goal of writing an AI algorithm that can play the game well. We can first settle on a representation for the board.

data Cell = E | X | O deriving (Eq, Show)
type Pos = (Integer, Integer)
data Board = Board [(Pos, Cell)] deriving Eq

instance Show Board where
show (Board pcs) = show [c | (p,c) <- pcs]

start :: Board
start = Board [((r,c),E) | r <- [1..3], c <- [1..3]]

Notice that the starting board consists of all empty cells.

*> start
[E,E,E,E,E,E,E,E,E]

When a player moves, they place an X or O somewhere on the board. We can model this using a function that, given a move, transforms one board into another board. We use a list comprehension that updates only the cell with the corresponding position.

move :: ((Pos, Cell), Board) -> Board
move ((pos', cell'), Board pcs) = Board [(pos, if pos' == pos then cell' else cell) | (pos, cell) <- pcs]

Suppose we want to make a state space graph that captures all possible paths a Tic-tac-toe game can take from a given starting position, and given a player's current turn. Let us assume that turns can be modelled using an infinite list. For example, here is an infinite list for game in which the X and O players take alternating turns.

turns :: [Cell]
turns = X:O:turns

We first define a data structure for representing the possible paths a game can take. The data structure is a tree graph in which each node represents a game state (i.e., a Board), and can have any number of child trees (including zero).

data DecisionTree = Move Board [DecisionTree] deriving (Eq, Show)

Now, we can define a function that builds a decision tree given a starting game state and the player turn sequence. Notice that nextMoves takes a Board and a Cell (to indicate which player is taking a turn), and uses a list comprehension to choose only the board cells that are empty. For each empty cell, it adds a new board state to the list in which that position is filled with the specified player's symbol.

nextMoves :: (Board, Cell) -> [Board]
nextMoves (Board pcs, cell) = [move ((p, cell), Board pcs) | (p, E) <- pcs]

Given a starting board and turn sequence, we can now build the tree of state spaces for the entire game.

tree :: Board -> [Cell] -> DecisionTree
tree b (turn:turns) = Move b [tree b' turns | b' <- nextMoves (b, turn)]

Suppose we have a function win :: Cell -> Board -> Bool that checks if a given board is already a win for a particular player. We can then add an Ord instance to compare boards from player X's perspective: a board is "greater" (or "better") than another board if either the board is a win for X, or the other board is a loss for O.

instance Ord Board where
b <= b' = win b' X || win b O

It is now possible to use the built-in operators <, <=, >, and >= to compare boards, as well as library functions such as max :: Board -> Board -> Board and maximum :: [Board] -> Board to choose the "best" boards.

[link] 7.15. Ad Hoc and Parametric Polymorphism

[link]

Definition: A language supports ad hoc polymorphism if it allows programmers to specify multiple definitions for the same constant or variable (which could be a literal, an operator symbol, a function name, a type, and so on). An interpreter or compiler then choose the appropriate definition to use based on the context (usually, the types of the argument expressions and/or the surrounding expressions and statements in the abstract syntax tree).

Common examples of language features that support ad hoc polymorphism include operator overloading in C++, C and Python (operator overloading is not supported in Java), method overloading in Java subclasses, Java interfaces, and type classes in Haskell.

[link]

Definition: A language supports parametric polymorphism if it allows programmers to specify a single definition for a function or procedure that can then be applied to values of multiple types (typically, a large range of types, or even all types). Typically, the function or procedure will operate on a container type, or some other parameterized or generic type. Thus, languages that support parametric polymorphism usually allow programmers to define data structures that can be parameterized by other data structures.

Common examples of language features that support parametric polymorphism include Java generics and the Haskell type system (the latter uses an extension of the Hindley-Milner type inference algorithm that is also used by the programming language ML).

Note an important distinction between ad hoc polymorphism and parametric polymorphism: ad hoc polymorphism allows programmers to continue adding new definitions for a given variable or constant within the same scope, while parametric polymorphism only permits a single, albeit generic, definition for a function or procedure within any given scope.

[link]

Example: Suppose we define a function len for computing lengths of lists of integers.

len :: [Integer] -> Integer
len [] = 0
len ((:) x xs) = 1 + len xs

Unfortunately, we cannot use the above function to compute the length of a list of, e.g., Bool values because the type [Integer] does not unify with the type [Bool]. We could write an alternative type to address this.

len :: [Bool] -> Integer
len [] = 0
len ((:) x xs) = 1 + len xs

We can now use len on lists of type [Bool], but we can no longer use it on lists of type [Integer]. Notice that the only thing that has changed is the type of the function; the body is the same in both cases. Haskell allows us to take advantage of this by employing a type variable.

len :: [a] -> Integer
len [] = 0
len ((:) x xs) = 1 + len xs

With the above declaration, it is now possible to apply the len function to a list containing any type of element. This is because the type variable a unifies with any other type (including Bool and Integer), so the input type [a] will unify with both [Integer] and [Bool], as well as types such as [String -> String] (a list of functions that take String inputs and return String outputs).

*> len [True, False, True]
3
*> len [1, 2, 3, 4, 5]
5
*> len [\s -> s ++ ".", \t -> t ++ ";"]
2

Since len is a function that can be applied to different input types, but has only a single definition that works for all input types, it is an example of parametric polymorphism.

[link]

Example: We can define a Set data structure for holding sets of Integer values that has an underlying list implementation.

module IntegerSet where

data Set =
    Empty
  | Set [Integer]
  deriving Show

insert :: Set -> Integer -> Set
insert (Set l) x = Set (x:l)
insert (Empty) x = Set [x]

contains :: Integer -> Set -> Bool
contains x (Empty) = False
contains x (Set l) = x `elem` l

However, what if we also want user to have the option of using a data structure for sets that uses an underlying binary search tree implementation? We can first define a class for set data structures that specifies the signatures of the functions that a set data structure must support. We can even define new functions that rely only on the signature.

module SetClass where

class IsSet s where
  empty :: s
  insert :: s -> Integer -> s
  contains :: Integer -> s -> Bool

  inserts :: s -> [Integer] -> s
  inserts s (x:xs) = inserts (insert s x) xs
  inserts s []     = s

This makes it possible to provide two different implementations that are both members of this class.

module IntegerSetList where

import SetClass

data IntegerSetList =
    Empty
  | Set [Integer]
  deriving Show

instance IsSet IntegerSetList where
  empty = Empty

  insert (Set l) x = Set (x:l)
  insert (Empty) x = Set [x]

  contains x (Empty) = False
  contains x (Set l) = x `elem` l

module IntegerSetTree where

import SetClass

data IntegerSetTree =
    Leaf
  | Node Integer IntegerSetTree IntegerSetTree
  deriving Show

instance IsSet IntegerSetTree where
  empty = Leaf

  insert (Leaf       ) x = Node x Leaf Leaf
  insert (Node y s s') x =
    if x < y then
      Node y (insert s x) s'
    else
      Node y s (insert s' x)

  contains x (Leaf       ) = False
  contains x (Node y s s') =
    if y == x then
      True
    else if x < y then
      contains x s
    else
      contains x s'

Haskell type classes need not contain types; they can also contain type constructors (i.e., functions in the type system that can be used to construct a type given a type, i.e., type constructors with kind Type -> Type, and so on).

[link]

Example: We can define a Set data structure for holding sets of values of any type a.

module Set where

data Set a =
    Empty
  | Set [a]
  deriving Show

insert :: Set a -> a -> Set a
insert (Set l) x = Set (x:l)
insert (Empty) x = Set [x]

contains :: Eq a => a -> Set a -> Bool
contains x (Empty) = False
contains x (Set l) = x `elem` l

module SetClass where

class IsSet s where
  empty :: s a
  insert :: Ord a => s a -> a -> s a
  contains :: (Ord a, Eq a) => a -> s a -> Bool

  inserts :: Ord a => s a -> [a] -> s a
  inserts s (x:xs) = inserts (insert s x) xs
  inserts s []     = s

This makes it possible to provide two different implementations of polymorphic set data structures that are both members of this class. Notice that we added the Ord a constraint on the element type a in case we want to compare them (e.g., if we are using a binary search tree implementation of sets).

module SetList where

import SetClass

data SetList a =
    Empty
  | Set [a]
  deriving Show

instance IsSet SetList where
  empty = Empty

  insert (Set l) x = Set (x:l)
  insert (Empty) x = Set [x]

  contains x (Empty) = False
  contains x (Set l) = x `elem` l

module SetTree where

import SetClass

data SetTree a =
    Leaf
  | Node a (SetTree a) (SetTree a)
  deriving Show

instance IsSet SetTree where
  empty = Leaf

  insert (Leaf       ) x = Node x Leaf Leaf
  insert (Node y s s') x =
    if x < y then
      Node y (insert s x) s'
    else
      Node y s (insert s' x)

  contains x (Leaf       ) = False
  contains x (Node y s s') =
    if y == x then
      True
    else if x < y then
      contains x s
    else
      contains x s'

[link]

Example: Suppose we have a Set data structure for holding sets of values of any type a that has a corresponding size :: Set a -> Int function.

module Set where

data Set a =
    Empty
  | Set [a]
  deriving (Eq, Show)

size :: Set a -> Int
size (Set l) = length l
size (Empty) = 0

We can make it possible to order sets by their size using the < operator by making the type Set a (for any possible a) a member of the Ord type class.

instance Eq a => Ord (Set a) where
s < s' = size s < size s'

Note that the instance declaration is prepended with the condition that the type a must be in the Eq type class. This is because the definition of Set a contains a deriving Eq clause, which means that a type Set a can only be used in a module if the parameter type a is a member of the Eq type class (otherwise, Haskell would not be able to compare two sets for equality using the automatically derived definition of (==) generated by the deriving Eq clause).

[link] 7.16. Folds, Unfolds, and Algebraic Properties of Functional Programs

We have seen that the operational semantics of declarative and functional languages support an approach to problem solving that involves first defining a model of some system, and then querying that model to answer questions about that system or explore its properties. We can define this approach to programming with more consistency and mathematical rigor by identifying the mathematical transformations that are typically employed and composed when using this approach. In particular, most component operations fall into two categories:

unfold (or anamorphism) operations are used to define or construct a data value;
fold (or catamorphism) operations are used to reduce, simplify, break down, or transform a data value.

A few special cases of the composition of a fold followed by an unfold are commonly used, as well (these are commonly implemented using comprehensions when the data value being transformed is a set or list):

map (sometimes known as metamorphism) operations are used to modify all the components of a data value using a single, often local transformation that is applied to all the components across the entire data value;
filter operations are used to remove (or select) only some of the components of a data value.

The ability to describe an algorithm using a composition of unfold, fold, map, and filter operations imposes fewer restrictions on how that computation can be carried out (when compared to a description of the same algorithm using a sequential list of operations). A description using fold, map, and so on may still impose some hierarchical dependencies on the computation, but the operational semantics of such a description allows an implementation (or a compiled version) of the algorithm to perform operations in parallel unless such parallelization is explicitly forbidden by the algorithm itself through explicit sequential dependencies introduced by the programmer. In other words, the default interpretation of such an algorithm definition is inherently parallelizable.

Defining the above operations on algebraic data types (including polymorphic algebraic data types) is often a straightforward process (in fact, it can often be automated). The Haskell base library provides functions for all of the above operations on Haskell lists, and many library functions are implemented using these functions; the Haskell compiler is optimized to compile these functions into efficient machine code. Most of the above functions are also supported by languages such as Python.

[link]

Example: We can begin by implementing a fold operation on Haskell Integer lists that does nothing.

fold :: [Integer] -> [Integer]
fold [] = []
fold (x:xs) = x : fold xs

We can modify the above to replace the two list constructors (i.e., (:) and []) in the input list with a binary operator and a base case (in this case, (+) and 0). This way, the fold function now computes the sum of the elements in the list.

fold :: [Integer] -> Integer
fold [] = 0
fold (x:xs) = x + fold xs

We can further generalize the above so that it takes the two replacement functions as arguments. The two functions are a replacement for (:), so it must be of type Integer -> Integer -> Integer, and a replacement for [], which must be of type Integer.

fold :: (Integer -> Integer -> Integer) -> Integer -> [Integer] -> Integer
fold f b [] = b
fold f b (x:xs) = f x (fold f b xs)

We can now use the above fold implementation to define many common functions on lists of integers.

sum :: [Integer] -> Integer
sum = fold (+) 0

product :: [Integer] -> Integer
product = fold (*) 1

maximum :: [Integer] -> Integer
maximum = fold max 0 -- Assumes non-negative integers.

minimum :: [Integer] -> Integer
minimum l = fold min (fold max l) l

[link]

Example: The fold function for the default right-associative list implementation supported by Haskell (i.e., with the built-in constructors (:) and []) is called foldr.

foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f base [] = base
foldr f base (x:xs) = f x (foldr f base xs)

The foldr function can be used to implement many of the usual library functions one might find useful when working with lists, and this is what is done in the Haskell libraries. Below are some examples (some of these are simplified versions of what is actually found in the libraries).

sum :: Num a -> [a] -> a
sum xs = foldr (+) 0 xs

max :: Ord a => a -> a -> a
max x y = if x > y then x else y

maximum :: Ord a => [a] -> a
maximum xs = foldr max 0 xs

[link]

Example: We can implement any map operation as a fold. Suppose we want to add the constant 1 to every element in an Integer list. For example, we want to achieve the following using a call to foldr:

*> [x + 1 | x <- [1,2,3,4,5]]
[2,3,4,5,6]

We can accomplish this by defining a modified list node constructor function that adds 1 to its first argument, then builds the list node:

addOneThenCons :: Integer -> [Integer] -> [Integer]
addOneThenCons x xs = (x + 1) : xs

We can now use the above together with foldr:

*> foldr addOneThenCons [] [1,2,3,4,5]
[2,3,4,5,6]

Using λ abstractions, we can avoid having to explicitly define addOneThenCons:

*> foldr (\x xs -> (x+1) : xs) [] [1,2,3,4,5]
[2,3,4,5,6]

[link]

Example: We can implement any filter operation as a fold. For example, suppose we want to only keep positive integers from an integer list:

*> [x | x <- [-3,-2,-1,0,1,2,3], x > 0]
[1,2,3]

We can accomplish this by defining a modified list node constructor function that only adds a new list node if the data inside it is a positive integer:

consIfPositive :: Integer -> [Integer] -> [Integer]
consIfPositive x xs = if x > 0 then x : xs else xs

We can now use the above together with foldr:

*> foldr consIfPositive [] [-3,-2,-1,0,1,2,3]
[1,2,3]

Using λ abstractions, we can avoid having to explicitly define consIfPositive:

*> foldr (\x xs -> if x > 0 then x : xs else xs) [] [-3,-2,-1,0,1,2,3]
[1,2,3]

[link]

Example: We can implement an easily parallelizable version of quicksort using filter operations:

qsort :: [Integer] -> [Integer]
qsort [] = []
qsort (x:xs) = qsort [y | y <- xs, y < x] ++ [x] ++ qsort [y | y <- xs, y >= x]

[link]

Example: We can implement a fold operation that works for associative binary operators that recursively splits the problem into two parts over and over until a base case is reached.

fold :: (a -> a -> a) -> a -> [a] -> a
fold f b []  = b
fold f b [x] = x
fold f b xs  =
  fold f b (take (length xs `div` 2) xs)
    `f`
  fold f b (drop (length xs `div` 2) xs)

If each recursive invocation were run on a separate thread, processor, server, and so on, this would maximum the amount of parallelization that can be achieved in performing this fold operation.

[link]

Example (filter, map, and fold operations in SQL-like languages): SQL-like languages (as well as the relational algebra on which they are based) can be viewed as supporting fold, map, and filter operations (and compositions thereof) on a particular data structure: a table. This allows databases to be distributed across multiple storage devices, and it makes it possible to simultaneously query different portions of a database, stored on different devices, using multiple computational devices (processors, virtual machines, servers, and so on) running in parallel.

For example, suppose an SQL table consists of some number of rows, and each row has an entry of the form (Name, Age):

+---------------+
|    People     |
+---------------+
| Name  |  Age  |
+-------+-------+
| Alice |  20   |
| Bob   |  17   |
| Carl  |  23   |
+-------+-------+

The following SQL query allows a user to retrieve only the Name entries in the table; this query amounts to a map operation:

> SELECT Name FROM People

+-------+
| Name  |
+-------+
| Alice |
| Bob   |
| Carl  |
+-------+

The following SQL query allows a user to retrieve only the entries within a certain Age range; this query amounts to a filter operation:

> SELECT * FROM People WHERE Age >= 18

+---------------+
| Name  |  Age  |
+-------+-------+
| Alice |  20   |
| Carl  |  23   |
+-------+-------+

The following SQL query allows a user to retrieve the sum of the Age fields of all the entries; this query amounts to fold operation (it is often called an aggregation operation):

> SELECT SUM(Age) FROM People

+-------+
|  Age  |
+-------+
|  60   |
+-------+

We can easily simulate all of the above capabilities using another language that supports the declarative and functional programming paradigms, such as Haskell. Suppose we have the following definitions:

type Name = String
type Age = Integer
data Table = [(Name, Age)]

people = [("Alice", 20), ("Bob", 17), ("Carl", 23)]

We can then perform the following queries:

*> [name | (name, age) <- people]
["Alice", "Bob", "Carl"]

*> [(name, age) | (name, age) <- people, age >= 18]
[("Alice", 20), ("Carl", 23)]

*> foldr (+) 0 [age | (name, age) <- people]
60

Equivalently, we can use map, filter, foldr, and λ abstractions instead of list comprehensions (the Haskell compiler simply converts list comprehensions into some combination of these during compilation):

*> map (\(name, age) -> name) people
["Alice", "Bob", "Carl"]

*> map fst people
["Alice", "Bob", "Carl"]

*> filter (\(name, age) -> age >= 18) people]
[("Alice", 20), ("Carl", 23)]

*> foldr (+) 0 (map fst people)
60

All of the above can also be done using Python list comprehensions:

>>> People = [("Alice", 20), ("Bob", 17), ("Carl", 23)]

>>> [name for (name, age) in People]
["Alice", "Bob", "Carl"]

>>> [(name, age) for (name, age) in People if age >= 18]
[('Alice', 20), ('Carl', 23)]

>>> sum([age for (name, age) in People])
60

Python also supports map, filter, reduce (similar to Haskell's foldr), and lambda abstractions:

>>> list(map((lambda person: person[0]), People))
['Alice', 'Bob', 'Carl']

>>> list(filter((lambda person: person[1] >= 18), People))
[('Alice', 20), ('Carl', 23)]

>>> from functools import reduce
>>> reduce(lambda x,y: x + y, [age for (name, age) in People])
60

[link]

Example (filter and map operations in JavaScript libraries): Many JavaScript libraries, such as jQuery/jQuery UI, node.js, d3, and others support an abstraction for manipulating web documents (which are often used as application components on the web today) that is organized around filter and map operations.

In jQuery, it is possible to select all the elements with a certain tag in an HTML document and then to specify a function that will be applied to each element selected. This corresponds to the composition of a filter and map operation.

For example, the following code selects all the li elements and updates the text content of each to indicate the item number corresponding to that element. Notice that the .text() function takes a function as an argument (the function supplied as an argument takes an index as an input and returns a string):

$("ul li").text(
  function( index ) {
    return "item number " + ( index + 1 );
  }
);

More generally, it's possible to update individual document elements using the .each() function.

[link] 7.17. Assignment #5: Embedded Languages and State Space Search

In this assignment you will practice using the declarative and functional programming language Haskell by implementing a small embedded language, and by building a library of algorithms for solving an optimization problem. You must submit two files:

hw5/Database.hs;
hw5/Allocation.hs.

Please follow the gsubmit directions and remember to put your files in the hw5 directory.

Your solutions to each of the problem parts below will be graded on their correctness, concision, and mathematical legibility. The different problems and problem parts rely on the lecture notes and on each other; carefully consider whether you can use functions from the lecture notes, or functions you define in one part within subsequent parts.

A testing script with several test cases is available for download: hw5-tests.hs. You should be able to place it in the same directory with the other assignment files and load it. Feel free to modify or extend it as you see fit.

In this problem you will implement an interpreter for a small embedded programming language. Your solutions should be included in the file hw5/Database.hs.
The module Database already contains data type definitions below for representing the syntax of the embedded language.

type Column = String
data User = User String deriving (Eq, Show)
data Table = Table String deriving (Eq, Show)
data Command =
    Add User
  | Create Table
  | Allow (User, Table)
  | Insert (Table, [(Column, Integer)])
  deriving (Eq, Show)

A program in this language will consist of a list of elements of values (i.e., a value of type [Command]) such as the following example.

example :: [Command]
example = [
    Add (User "Alice"),
    Add (User "Bob"),
    Create (Table "Revenue"),
    Insert (Table "Revenue", [("Day", 1), ("Amount", 2400)]),
    Insert (Table "Revenue", [("Day", 2), ("Amount", 1700)]),
    Insert (Table "Revenue", [("Day", 3), ("Amount", 3100)]),
    Allow (User "Alice", Table "Revenue")
  ]
1. Implement a function select :: [Command] -> User -> Table -> Column -> Maybe [Integer] that takes four arguments: a list of commands describing the current state of the database, the user evaluating the query, the table on which the query is being evaluated, and the name of the column the user wants to select. Hint: use list comprehensions with pattern matching.
  If the specified user has permission to query the table, the function should return a list of integers corresponding to the values in the specified column of the specified table (wrapped with the Just constructor). If the user or table do not exist, or if the specified user does not have permission to query the specified table, the function should return Nothing.
  
  *Database> select example (User "Alice") (Table "Revenue") "Day"
  Just [1,2,3]
  
  *Database> select example (User "Alice") (Table "Revenue") "Amount"
  Just [2400,1700,3100]
  
  *Database> select example (User "Bob") (Table "Revenue") "Amount"
  Nothing
2. Implement a function aggregate :: [Command] -> User -> Table -> Column -> Operator -> Maybe Integer that takes five arguments: a list of commands describing the current state of the database, the user evaluating the query, the table on which the query is being evaluated, the name of the column to aggregate, and an operator (of type Integer -> Integer -> Integer) to use when aggregating the values in the specified column. Hint: use foldr.
  If the specified user has permission to query the table, the function should return the aggregate (with respect to the supplied aggregator function) of the integers corresponding to the values in the specified column of the specified table (wrapped with the Just constructor). If the user or table do not exist, or if the specified user does not have permission to query the specified table, the function should return Nothing.
  
  *Database> aggregate example (User "Alice") (Table "Revenue") "Amount" (+)
  Just 7200
3. Implement a function validate :: [Command] -> Bool that checks that every command in the list refers to a table that has already been created and a user that has already been added earlier in the list of commands. If every command is valid, it should return True; otherwise, it should return False. Hint: write a helper function and call it on a reversed list of commands, then check each command against the rest of the list.
In this problem, you will implement a small library for representing and working with a state space graph for an optimization problem. All the functions you define should be included in the file hw5/Allocation.hs.

The optimization problem we are addressing is defined as follows: given a list of items, each of a certain integer size, put each item into one of two bins so that once all items have been allocated, the sum of the sizes of the items in the first bin is as close as possible to the sum of the items in the second bin (note that this optimization problem is NP-complete and is equivalent to the subset sum problem, so no efficient solution for the problem is believed to exist).
1. We will represent an allocation of items (either in the midst of running an optimization algorithm or within a final result) using the data type Alloc:
  
  data Alloc = Alloc Bin Bin deriving (Eq, Show)
  
  We will represent the graph of possible states that an algorithm can traverse given a starting allocation and a list of tasks using the Graph data type:
  
  data Graph =
      Branch Alloc Graph Graph
    | Finish Alloc
    deriving (Eq, Show)
  
  Define a function graph :: Alloc -> [Item] -> Graph that takes a starting allocation and a list of items, and returns the full graph of all possible state space traversals available to an algorithm given the list of items. An example is provided below (spacing has been adjusted for legibility):
  
  *> graph (Alloc 0 0) [1,2]
  Branch (Alloc 0 0)
    (Branch (Alloc 1 0)
      (Finish (Alloc 3 0))
      (Finish (Alloc 1 2))
    )
    (Branch (Alloc 0 1)
      (Finish (Alloc 2 1))
      (Finish (Alloc 0 3))
    )
  
  The definition must work even if the supplied list of items is infinite.
2. Define a function alloc :: Graph -> Alloc that makes it easy to retrieve the allocation within the root node of a particular graph of type Graph.
  
  *> alloc (Finish (Alloc 0 0))
  Alloc 0 0
3. One allocation a :: Alloc is better than another allocation b :: Alloc if the difference between the two bin totals in a is less than the difference between the two bin totals in b. Add an instance declaration for the Alloc type so that it is possible to use the built-in Haskell infix operators < and <= to compare two allocations according to their differences. You may use the Haskell library function abs :: Integer -> Integer to compute the absolute value of an integer. If you're getting a stack overflow when testing <, make sure you also define <= explicitly.
  
  *> (Alloc 1 4) < (Alloc 5 10)
  True
  *> (Alloc 10 4) < (Alloc 1 3)
  False
4. One graph g :: Graph is better than another graph g' :: Graph if the difference between the two bin totals in the root node of g is less than the difference between the two bin totals in the root node of g'. Add an instance declaration for the Graph type so that it is possible to use the built-in Haskell infix operator < to compare two graphs.
  
  *> (Finish (Alloc 1 4)) < (Finish (Alloc 5 10))
  True
5. Define a function final :: Graph -> [Alloc] that returns an aggregate list of the allocations within all the leaf nodes of the supplied state space graph.
6. Define a function depth :: Integer -> Graph -> [Alloc] that returns the allocations within the state space graph nodes that are at depth exactly n within the state space graph (the root node is at depth 0).
In this problem, you will implement a small library of algorithms for solving the optimization problem we defined above. All the functions you define should be included in the file hw5/Allocation.hs.

A strategy is an algorithm that can traverse the state space graph; given an existing graph, it chooses some descendant node in the graph and returns the subgraph for which that descendant node is the root.

type Strategy = Graph -> Graph
1. Define a strategy greedy :: Strategy that takes a graph as an input. It should choose and return the better child of the two children of the root of the graph.
2. Define a strategy patient :: Integer -> Strategy that takes an integer argument n and a graph as its two inputs, and chooses and returns the best descendant of the supplied graph that can be found at depth n (an allocation is best if it is better than all others at that depth). If n is 0, the function should simply return the graph supplied to it as an input.
3. Define a strategy optimal :: Strategy that takes a state space graph, and returns the leaf node in the state space graph that has the best allocation. Hint: you can define this function in two lines; look carefully over the functions already available to you.
4. Define a metastrategy metaCompose :: Strategy -> Strategy -> Strategy that takes two strategies as arguments. It should apply the first strategy to obtain a new subgraph, and then it should apply the second strategy to that subgraph and return the result.
  
  *> (metaCompose greedy greedy) (graph (Alloc 0 0) [1,2,3])
  Branch (Alloc 2 1) (Finish (Alloc 5 1)) (Finish (Alloc 2 4))
  *> (metaCompose (patient 2) greedy) (graph (Alloc 0 0) [1,2,3,4])
  Branch (Alloc 2 4) (Finish (Alloc 6 4)) (Finish (Alloc 2 8))
5. Define a metastrategy metaRepeat :: Integer -> Strategy -> Strategy that repeats a given strategy a specified number of times. Your definition of metaRepeat should take an Integer and a Graph argument. If the integer is 0, it should simply return the graph unchanged. If the integer is positive, it should apply the strategy the specified number of times to the graph and then return the result.
6. Define a metastrategy metaGreedy :: Strategy -> Strategy -> Strategy that takes two strategies as its inputs. It should apply the two strategies to its input graph and should return the better of the two results that it obtains.
7. Consider the following strategy:
  
  impatient :: Integer -> Strategy
  impatient n g = (metaRepeat n greedy) g
  
  Describe one way in which impatient is superior to patient, and one way in which it is inferior.
Extra credit: Define a metastrategy fit :: Graph -> [Strategy] -> Strategy that takes a graph and a list of strategies as inputs; this metastrategy should choose the strategy in the list of strategies that has the best performance on the given graph, and return it. Hint: define a way to compare strategies and use minimumBy.

[link] Review #1. Programming Language Concepts

This section contains a collection of review problems going over all the course material. These problems are an accurate representation of the kinds of problems you may see on an in-class exam.

Below is a breakdown of what concepts you should udnerstand and small-scale problems you should be able to solve.

mathematical definitions involved in the definition of a programming language
- strings
- token sequences
- concrete syntax
  - BNF notation
    - terminals
    - non-terminals
    - cases
    - productions
    - non-recursive, recursive, left-recursive
- abstract syntax
  - variables
  - expressions
  - values/constants
  - statements
  - outputs/output sequences
- type system
  - type inference rules
    - typical type inference rule for function application
- operational semantics
  - inference rules
- abstract interpretation (i.e., restricted/abstract operational semantics measuring some dimensions of interest, often used in static analysis)
  - cost
  - memory
  - running time
software components (data structures, and algorithms) involved in an implementation of a programming language
- data structure
  - string
  - token sequence
  - abstract syntax tree
  - intermediate representation abstrac syntax tree
  - machine language instruction sequence
  - bytecode instruction sequence
  - heap
  - stack
  - environment/context
  - substitution
- algorithms and software components
  - tokenizer
  - parser
    - predictive recursive descent parser
    - backtracking recursive descent parser
  - static analysis algorithms (always terminate)
    - abstract interpreter (cost, size, memory, time, etc.)
    - type checker
  - optimization algorithms
    - constant folding (i.e., evaluation of subexpressions with no variables)
    - loop unrolling
  - interpreter
    - evaluation algorithm
    - execution algorithm
  - unification
    - substitution algorithm
    - pattern matching unification algorithm
  - compiler (always terminates)
  - run-time system/virtual machine
  - garbage collector
  - testing/validation
    - bounded exhaustive testing input generator
    - validation algorithm (checks that two components have the same input-output behavior)
imperative and procedural programming langauge concepts
- control flow
- procedures
  - how these are implemented by an interpreter
  - how these are supported by a compiler to a machine language (stack)
declarative and functional programming language concepts and techniques
- how to solve equations involving data values using pattern matching unification
- models/modules and queries
- how can pattern matching unification...
  - support matching of function argument patterns and input values
- algebraic data types
  - constructors (base cases and inductive constructors with child trees)
  - defining non-recursive and recursive functions that operate on algebraic data types
- kinds of polymorphism
  - ad hoc polymorphism
  - parametric polymorphism
- higher order functions
  - types of higher-order functions
  - how to define and apply higher-order functions
  - how to compose functions
- infinite data structures and applications
  - infinite lists, trees, decision trees, state space graphs
  - traversing/searching a state space graph
- folds, maps, and filters
  - using list comprehensions
  - folds on lists and trees
  - defining fold functions
  - using folds to perform aggregate computations
  - using functions with no names (lambda abstractions) in folds
miscellaneous tasks
- does a parser implement a BNF definition of a grammar?
- does an interpreter implementation conform to an operational semantics?
- does a type checker implementation conform to a type system definition?
- apply an operational semantics or run an interpreter on a given program
- apply type inference rules or run a type checker on a given program
- explain the purpose of a definition or algorithm, or how multiple definitions and algorithms can fit together

[link]

Exercise: Suppose you are given the following type definitions for representing simple English sentences:

type Noun = String
type Adjective = String
type Verb = String
type Adverb = String

data NounPhrase = AN Adjective NounPhrase | N Noun
data VerbPhrase = AV Adverb VerbPhrase | V Verb
data Sentence = S NounPhrase VerbPhrase

Determine the minimal substitution that solves the following equation under pattern-matching unification, or explain why no solution can exist (by demonstrating what base case causes an issue):
S (AN a (N "cat")) (V b)
=
S (AN "yellow" (N "cat")) (V "runs")

The minimal substitution is:
{a ↦ "yellow", b ↦ "runs"}
Determine the minimal substitution that solves the following equation under pattern-matching unification, or explain why no solution can exist (by demonstrating what base case causes an issue):
S (AN a (AN b (N "cat"))) (V c)
=
S (AN "yellow" (N "cat")) (V "runs")

There is no substitution because once unification reaches a recursive invocation for the following equation, it will fail:
AN b (N "cat")
=
N "cat"
Since the constructor AN can never be unified with the constructor N, there is no substitution that can be applied to both sides of the original equation to make both sides equivalent.
Given the above data type definition, is it possible to write a single Haskell function pattern that will match any sentence in which "cat" is the subject?
No, this is impossible because the pattern would need to match a tree of arbitrary depth, since the N leaf constructor can only occur at the botton of a NounPhrase tree. In other words, the pattern would need to look something like the following:
f (AN _ (AN _ ( ... (AN _ (N "cat"))))) = ...
Nothing like the above is supported by Haskell.

[link]

Exercise: Adjust the following grammar definition so that it accepts exactly the same set of token sequences, but is not left-recursive:

number n

::=

0 | 1 | ... | 9

− ( n )

n * n

We introduce a new production, and we nove to it all the cases in the production for n that begin with a terminal. We then replace all those cases in the production for n with a single case corresponding to the non-terminal for the new production, and we replace all the left-recursive non-terminals with the new non-terminal.

number n

::=

l * n

left l

::=

0 | 1 | ... | 9

− ( n )

[link]

Exercise: Suppose you are given the following grammar definition and operational semantics:

actor r

::=

foo | bar

action a

::=

change r | is r

is r ⇓ is r

change foo ⇓ is bar

change bar ⇓ is foo

According to the above operational semantics, to what should change foo evaluate?

According to the second inference rule, change foo should evaluate to is bar. Notice that the above inference rules could be implemented as an evaluation algorithm, e.g. in Haskell:

data Actor = Foo | Bar
data Action = Change Actor | Is Actor

eval (Is r ) = Is r
eval (Change Foo) = Is Bar
eval (Change Bar) = Is Foo

[link]

Exercise: Answer the following questions by drawing diagrams (your diagrams may need to incorporate self-loops).

Determine which of the following terms refer to data structures, and which terms refer to algorithms. Draw a flow chart incorporating all of the above components that demonstrates how they might interact in an actual implementation (there may be more than one correct answer, but the way they interact must be reasonable):

input file;
abstract syntax trees;
parser;
compiler;
type checker;
error message;
loop unroller;
machine instruction sequences.

input
file

loop
unroller

machine
instruction
sequences

⇓

⇑ ⇓

⇑

parser

⇒

abstract
syntax
trees

⇒

type
checker

⇒

abstract
syntax
trees

⇒

compiler

⇓

error
messages

Draw a flow chart incorporating all of the below components that demonstrates how they might interact in an actual implementation:

exhaustive case generator;
abstract syntax trees;
compiler;
interpreter;
machine instruction sequences;
simulator;
outputs;
output comparison function.

exhaustive
case
generator

⇓

abstract
syntax
trees

⇓

interpreter

compiler

⇒

machine
instruction
sequences

⇓

outputs

⇐

simulator

⇓

output
comparison
function

[link] Final. Programming Language Concepts

This material is no longer available.

[link] Appendix A. Using gsubmit

In this course, you will submit your assignments using gsubmit. This section reproduces and extends some of the instructions already made available by the BU Computer Science Department.

[link] A.1. Register for a CS account

You must obtain a CS account to use the csa machines maintained by the CS Dept. You will need to physically visit the undergraduate computing lab located at 730 Commonwealth Avenue, on the third floor in room 302.

[link] A.2. Download SSH/SCP client software

You will need an SCP or SFTP client (such as WinSCP for Windows or CyberDuck for OS X) to copy files from your local computer to your csa home directory. If you are using Windows, you will also need an SSH client (such as PuTTY).

[link] A.3. Submitting assignments using gsubmit

A typical workflow can be described as follows.

You assemble your assignment solution file(s) on your own computer or device.

local
device

hw1

hw1.py

your csa2/csa3
home directory

your gsubmit
directory for CS 320

You log into csa2 or csa3 using an SCP or SSH client and create a directory for your submission in your CS account home directory. Note that in the examples below %> represents a terminal prompt, which may look different on your system.

%> cd ~
%> mkdir hw1

local
device

hw1

hw1.py

your csa2/csa3
home directory

hw1

your gsubmit
directory for CS 320

If you have not already done so (e.g., if you were using an SSH client in the previous step), you log into csa2 or csa3 using an SCP client and copy your completed file(s) into that directory.

local
device

hw1

hw1.py

⇒

your csa2/csa3
home directory

hw1

hw1.py

your gsubmit
directory for CS 320

If you have not already done so, you log into csa2 or csa3 using an SSH client and run the gsubmit commands to copy the files from your CS account home directory to the gsubmit directories to which the course staff has access.

%> cd ~
%> gsubmit cs320 hw1

local
device

hw1

hw1.py

⇒

your csa2/csa3
home directory

hw1

hw1.py

⇒

your gsubmit
directory for CS 320

hw1

hw1.py

To view your submitted files, you can use the following command:

%> gsubmit cs320 -ls

To look at a file that has already been submitted, you can use:

%> gsubmit cs320 -cat hw1/hw1.py

After grades are posted (normally, this will be announced on the mailing list and in lecture), you can check your grade using:

%> gsubmit cs320 -cat grade.hw1.txt

[link] Appendix B. Python

The Python programming language will be among the languages we use in this course. This language supports the object-oriented, imperative, and functional programming paradigms, has automatic memory managememt, and natively supports common high-level data structures such as lists and sets. Python is often used as an interpreted language, but it can also be compiled.

[link] B.1. Obtaining Python

The latest version of Python 3 can be downloaded at: https://www.python.org/downloads/. In this course, we will require the use if Python 3, which has been installed on all the CS Department's undergraduate computing lab machines, as well as on csa2/csa3.

[link] B.2. Assembling a Python module

The simplest Python program is a single file (called a module) with the file extension .py. For example, suppose the following is contained within a file called example.py:

# This is a comment in example.py.
# Below is a Python statement.
print("Hello, world.")

Assuming Python is installed on your system, to run the above program from the command line you can use the following (you may need to use python3, python3.2, python3.3, etc. depending on the Python installation you're using). Note that in the examples below %> represents a terminal prompt, which may look different on your system.

%> python example.py
Hello, world.

If you run Python without an argument on the command line, you will enter Python's interactive prompt. You can then evaluate expressions and execute individual statements using this prompt; you can also load and execute a Python module file:

%> python
Python 3.2 ...
Type "help", "copyright", "credits" or "license" for more information.
>>> exec(open("example.py").read()) # Load example.py module.
Hello, world.
>>> x = "Hello." # Execute an assignment statement.
>>> print(x)     # Execute a print statement.
Hello.
>>> x            # Evaluate a string expression.
'Hello.'
>>> 1 + 2        # Evaluate a numerical expression.
3

[link] B.3. Common data structures (i.e., Python expressions)

Python provides native support for several data structures that we will use throughout this course: integers, strings, lists, tuples, sets, and dictionaries (also known as finite maps). In this subsection, we present how instances of these data structures are represented in Python, as well as the most common operations and functions that can be applied to these data structure instances.

Booleans consist of two constants: True and False.
- The usual logical operations are available using the operators and, or, and not.
>>> True                                      # A boolean constant.
True
>>> False                                     # A boolean constant.
False
>>> True and False or True and (not False)    # A boolean expression.
True
Integers are written as in most other programming languages (i.e., as a sequence of digits).
- The usual arithmetic operations are available using the operators +, *, -, and /. The infix operator // represents integer division, and the infix operators ** represents exponentiation. Negative integers are prefixed with the negation operator -.
- The usual relational operators ==, !=, <, >, <=, >= are available.
- The int() function can convert a string that looks like an integer into an integer.
>>> 123                                       # An integer constant.
123
>>> 1 * (2 + 3) // 4 - 5                      # An integer expression.
-4
>>> 4 * 5 >= 19                               # A boolean expression involving integers.
True
>>> int("123")                                # A string being converted into an integer
123
Strings are delimited by either ' or " characters. Strings can be treated as lists of single-character strings. Another way to look at this is that there is no distinction between a character and a string: all characters are just strings of length 1. Multiline strings can be delimited using """ or ''' (i.e., three quotation mark characters at the beginning and end of the string literal).
- The empty string is denoted using '' or "".
- Two strings can be concatenated using +.
- The function len() returns the length of a string.
- Individual characters in a string can be accessed using the bracketed index notation (e.g., s[i]). These characters are also strings themselves.
>>> 'Example.'                                # A string.
'Example.'
>>> "Example."                                # Alternate notation for a string.
'Example.'
>>> len("ABCD")                               # String length.
4
>>> "ABCD" + "EFG"                            # String concatenation.
'ABCDEFG'
>>> "ABCD"[2]                                 # Third character in the string.
'C'
Lists are similar to arrays: they are ordered sequences of objects and/or values. The entries of a list can be of a mixture of different types, and lists containing one or more objects are delimited using [ and ], with the individual list entries separated by commas. Lists cannot be members of sets.
- The empty list is denoted using [].
- Two lists can be concatenated using +.
- The function len() returns the length of a list.
- Individual entries in a list can be accessed using the bracketed index notation (e.g., a[i]).
- To check if a value is in a list, use the in relational operator.
>>> [1,2,"A","B"]                             # A list.
[1, 2, 'A', 'B']
>>> [1, 2] + ['A','B']                        # Concatenating lists.
[1, 2, 'A', 'B']
>>> len([1,2,"A","B"] )                       # List length.
4
>>> [1,2,"A","B"][0]                          # First entry in the list.
1
>>> 1 in [1, 2]                               # List containment check.
True
Tuples are similar to lists (they are ordered, and can contain objects of different types), except they are delimited by parentheses ( and ), with entries separated by commas. The main distinction between lists and tuples is that tuples are hashable (i.e., they can be members of sets).
- The empty tuple is denoted using ().
- A tuple containing a single object x is denoted using (x, ).
- Two tuples can be concatenated using +.
- A tuple can be turned into a list using the list() function.
- A list can be turned into a tuple using the tuple() function.
- The function len() returns the length of a tuple.
- Individual entries in a tuple can be accessed using the bracketed index notation (e.g., t[i]).
- To check if a value is in a tuple, use the in relational operator.
>>> (1,2,"A","B")                             # A tuple.
(1, 2, 'A', 'B')
>>> (1,)                                      # Another tuple.
(1,)
>>> (1, 2) + ('A','B')                        # Concatenating tuples.
(1, 2, 'A', 'B')
>>> list((1, 2, 'A','B'))                     # A tuple being converted into a list.
[1, 2, 'A', 'B']
>>> tuple([1, 2, 'A','B'])                    # A list being converted into a tuple.
(1, 2, 'A', 'B')
>>> len((1,2,"A","B"))                        # Tuple length.
4
>>> (1,2,"A","B")[0]                          # First entry in the tuple.
1
>>> 1 in (1, 2)                               # Tuple containment check.
True
Sets are unordered sequences that cannot contain duplicates. They are a close approximation of mathematical sets. Sets cannot be members of sets.
- The empty set is denoted using set().
- The methods .union() and .intersect correspond to the standard set operations.
- A list or tuple can be turned into a set using the set() function.
- A set can be turned into a list or tuple using the list() or list() function, respectively.
- The function len() returns the size of a set.
- To access individual entries in a set, it is necessary to turn the set into a list or tuple.
- To check if a value is in a set, use the in relational operator.
>>> {1,2,"A","B"}                             # A set.
{1, 2, 'A', 'B'}
>>> ({1,2}.union({3,4})).intersection({4,5})  # Set operations.
{4}
>>> set([1, 2]).union(set(('A','B')))         # Converting a list and a tuple to sets.
{'A', 1, 2, 'B'}
>>> len({1,2,"A","B"})                        # Set size.
4
>>> 1 in {1,2,"A","B"}                        # Tuple containment check.
True
Frozen sets are like sets, except they can be members of other sets. A set can be turned into a frozen set using the frozenset() function.

>>> frozenset({1,2,3})                        # A frozen set.
frozenset({1, 2, 3})
>>> {frozenset({1,2}), frozenset({3,4})}      # Set of frozen sets.
{frozenset({3, 4}), frozenset({1, 2})}
Dictionaries are unordered collections of associations between some set of keys and some set of values. Dictionaries are also known as finite maps.
- The empty dictionary is denoted using {}.
- The list of keys that the dictionary associates with values can be obtained using list(d.keys()).
- The list of values that the dictionary contains can be obtained using list(d.values()).
- The function len() returns the number of entries in the dictionary.
- Individual entries in a dictionary can be accessed using the bracketed index notation (e.g., d[key]).
>>> {"A":1, "B":2}                            # A dictionary.
{'A': 1, 'B': 2}
>>> list({"A":1, "B":2}.keys())               # Dictionary keys.
['A', 'B']
>>> list({"A":1, "B":2}.values())             # Dictionary values.
[1, 2]
>>> len({"A":1, "B":2})                       # Dictionary size.
2
>>> {"A":1, "B":2}["A"]                       # Obtain a dictionary value using a key.
1

[link] B.4. Function, procedure, and method invocations

Python provides a variety of ways to supply parameter arguments when invoking functions, procedures, and methods.

Function calls and method/procedure invocations consist of the function, procedure, or method name followed by a parenthesized, comma-delimited list of arguments. For example, suppose a function or procedure example() is defined as follows:

def example(x, y, z):
  print("Invoked.")
  return x + y + z

To invoke the above definition, we can use one of the following techniques.
- Passing arguments directly involves listing the comma-delimited arguments directly between parentheses.
  
  >>> example(1,2,3)
  Invoked.
  6
- The argument unpacking operator (also known as the *-operator, the scatter operator, or the splat operator) involves providing a list to the function, preceded by the * symbol; the arguments will be drawn from the elements in the list.
  
  >>> args = [1,2,3]
  >>> example(*args)
  Invoked.
  6
- The keyword argument unpacking operator (also known as the **-operator) involves providing a dictionary to the function, preceded by the ** symbol; each named paramter in the function definition will be looked up in the dictionary, and the value associated with that dictionary key will be used as the argument passed to that parameter.
  
  >>> args = {'z':3, 'x':1, 'y':2}
  >>> example(**args)
  Invoked.
  6
Default parameter values can be specified in any definition. Suppose the following definition is provided.

def example(x = 1, y = 2, z = 3):
  return x + y + z

The behavior is then as follows: if an argument corresponding to a parameter is not supplied, the default value found in the definition is used. If an argument is supplied, the supplied argument value is used.

>>> example(0, 0)
3
>>> example(0)
5
>>> example()
6

[link] B.5. Comprehensions

Python provides concise notations for defining data structures and performing logical computations. In particular, it support a comprehension notation that can be used to build lists, tuples, sets, and dictionaries.

List comprehensions make it possible to construct a list by iterating over one or more other data structure instances (such as a list, tuple, set, or dictionary) and performing some operation on each element or combination of elements. The resulting list will contain the result of evaluating the body for every combination.

>>> [ x for x in [1,2,3] ]
[1, 2, 3]
>>> [ 2 * x for x in {1,2,3} ]
[2, 4, 6]
>>> [ x + y for x in {1,2,3} for y in (1,2,3) ]
[2, 3, 4, 3, 4, 5, 4, 5, 6]

It is also possible to add conditions anywhere after the first for clause. This will filter which combinations are actually used to add a value to the resulting list.

>>> [ x for x in {1,2,3} if x < 3 ]
[1, 2]
>>> [ x + y for x in {1,2,3} for y in (1,2,3) if x > 2 and y > 1 ]
[5, 6]
Set comprehensions make it possible to construct a set by iterating over one or more other data structure instances (such as a list, tuple, set, or dictionary) and performing some operation on each element or combination of elements. The resulting list will contain the result of evaluating the body for every combination. Notice that the result will contain no duplicates because the result is a set.

>>> { x for x in [1,2,3,1,2,3] }
{1, 2, 3}
Dictionary comprehensions make it possible to construct a dictionary by iterating over one or more other data structure instances (such as a list, tuple, set, or dictionary) and performing some operation on each element or combination of elements. The resulting dictionary will contain the result of evaluating the body for every combination.

>>> { key : 2 for key in ["A","B","C"] }
{'A': 2, 'C': 2, 'B': 2}

[link] B.6. Other useful built-in functions

The built-in function type() can be used to determine the type of a value. Below, we provide examples of how to check whether a given expression has one of the common Python types:

>>> type(True) == @bool
True
>>> type(123) == int
True
>>> type("ABC") == str
True
>>> type([1,2,3]) == list
True
>>> type(("A",1,{1,2})) == tuple
True
>>> type({1,2,3}) == set
True
>>> type({"A":1, "B":2}) == dict
True

[link] B.7. Common Python definition and control constructs (i.e., Python statements)

A Python program is a sequence of Python statements. Each statement is either a function definition, a variable assignment, a conditional statement (i.e., if, else, and/or elif), an iteration construct (i.e., a for or while loop), a return statement, or a break or continue statement.

Variable assignments make it possible to assign a value or object to a variable.

x = 10

It is also possible to assign a tuple (or any computation that produces a tuple) to another tuple:

(x, y) = (1, 2)
Function and procedure definitions consist of the def keyword, followed by the name of the function or procedure, and then by one or more arguments (delimited by parentheses and separated by commas).

def example(a, b, c):
return a + b + c
Conditional statements consist of one or more branches, each with its own boolean expression as the condition (with the exception of else). The body of each branch is an indented sequence of statements.

def fibonacci(n):
    # Computes the nth Fibonacci number.
    if n <= 0:
        return 0
    elif n <= 2:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)
Iteration constructs make it possible to repeat a sequence of statements over and over. The body of an iteration construct is an indented sequence of statements.
- The while construct has a boolean expression as its condition (much like if). The body is executed over and over until the expression in the condition evaluates to False, or a break statement is encountered.
  
  def example1(n):
      # Takes an integer n and returns the sum of
      # the integers from 1 to n-1.
      i = 0
      sum = 0
      while i < n:
          sum = sum + i
          i = i + 1
      return sum
  
  def example2(n):
      # Takes an integer n and returns the sum of
      # the integers from 1 to n-1.
      i = 0
      sum = 0
      while True:
          sum = sum + i
          i = i + 1
          if i == n:
              break
      return sum
- The for construct makes it possible to repeat a sequence of statements once for every object in a list, tuple, or set, or once for every key in a dictionary.
  
  def example3(n):
      # Takes an integer n and returns the sum of
      # the integers from 1 to n-1.
      sum = 0
      for i in range(0,n):
          sum = sum + i
      return sum
  
  def example4(d):
      # Takes a dictionary d that maps keys to
      # integers and returns the sum of the integers.
      sum = 0
      for key in d:
          sum = sum + d[key]
      return sum

[link] Appendix C. Haskell

The Haskell programming language will be among the languages we use in this course. This language supports the functional programming paradigm, has automatic memory managememt, and natively supports algebraic data types. Haskell is both an interpreted and compiled language.

[link] C.1. Obtaining Haskell

The latest version of Haskell can be downloaded with the Haskell Platform: http://www.haskell.org/platform/. The Haskell Platform has been installed on all the CS Department's undergraduate computing lab machines, as well as on csa2/csa3.

[link] C.2. Haskell modules and declarations

Haskell code is organized into modules. There is one named module per Haskell file, and the file name should match the module name (e.g., the file Example.hs should contain the Example module definition). The module body consists of a series of declarations, which can appear in any order; the only exception is that if there are multiple declarations for a function, they must all be grouped together.

module Example where

g (x) = f(x) + f(x);

f (0) = 0;
f (x) = f(x - 1) + 1;

-- This is a comment.

Each declaration in a Haskell module falls into one of two categories: an expression-level declaration, or a type-level declaration. Any declaration that defines variables or functions is an expression-level declaration; any declaration that defines a type is a type-level declaration.

-- Expression-level declaration (defines a new function called "f").
f(x) = 2 * x

-- Type-level declaration (defines a new data type called "Tree".
data Tree = Leaf | Node Tree Tree

[link] C.3. Common data structures (i.e., Haskell expressions)

The Haskell base library (sometimes also called the "Prelude", and imported implicitly and automatically into every module) has a large number of standard and conventional data type and function declarations. These declarations provide native support for a variety of data structures and operations that we will use throughout this course: integers, strings, lists, and tuples, among others. In this subsection, we present how instances of these data structures are represented in Haskell, as well as the most common operations and functions that can be applied to these data structure instances.

The Bool data type supports two constructors: True and False.
- The usual logical operations are available using the operators &&, ||, and not.
- The built-in syntax if ... then ... else ... is also available (similar to Python's ... if ... else ... and C/C++/Java's ternary operator ... ? ... : ....
*> True
True
*> False
False
*> True || False && True && (not False)
True
*> if True then "Yes" else "No"
"Yes"
Integers are written as in most other programming languages (i.e., as a sequence of digits). Haskell supports a bounded-size type Int and an unbounded-size type Integer.
- The usual arithmetic operations are available using the operators +, *, -, /, and ^. The Prelude function div represents integer division. The Prelude functions min and max make it possible to compute the minimum and maximum of two integers, respectively, and the function minimum and maximum make it possible to compute the minimum and maximum of a list of integers. Negative integers are prefixed with the negation operator -.
- The usual relational operators ==, /= (not equal), <, >, <=, >= are available.
*> 123
123
*> 1 * (2 + 3) `div` 4 - 5
-4
*> 4 * 5 >= 19
True
*> max 10 20
20
*> minimum [1,2,3]
1
Strings are delimited by " characters (individual characters are always delimited using '). Strings can be treated as lists of characters.
- The empty string is denoted using "".
- Two strings can be concatenated using ++.
- The Preludefunction length returns the length of a string.
*> "Example."        -- A string.
"Example."
*> length "ABCD"     -- String length.
4
*> "ABCD" ++ "EFG"   -- String concatenation.
"ABCDEFG"
Lists are ordered sequences of expressions. The entries of a list must all be of the same type. Lists are delimited using [ and ], with the individual list entries separated by commas.
- The empty list is denoted using [].
- Elements can be added to the front of a list using :.
- Two lists can be concatenated using ++.
- The Prelude function length returns the length of a list.
- To check if a value is in a list, use the Prelude function elem (note that this relies on the == operation being defined on the type of the elements in the list).
*> [1,2,3]            -- A list.
[1,2,3]
*> 1 : [2,3]          -- Adding an element to the front.
[1,2,3]
*> [1,2,3] ++ [4,5]   -- Concatenating lists.
[1,2,3,4,5]
*> length [1,2,3,4]   -- List length.
4
*> 1 `elem` [1, 2]    -- List containment check.
True
Tuples are ordered, and can contain expressions of different types. They are delimited by parentheses ( and ), with entries separated by commas.
- The empty tuple is denoted using ().
- There are no tuples containing a single element; (e) is equivalent to e for any Haskell expression e.
*> (1,2,3) -- A tuple.
(1,2,3)