Data Mechanicsfor Pervasive Systems and Urban Applications

1. Introduction, Background, and Motivation

1.1. Overview

With over half of the world's population living in cities [#], and given the possibility that cities are a more efficient [#] way to organize and distribute resources and activities, the rigorous study of cities as systems that can be modeled mathematically is a compelling proposition. With the advent of pervasive infrastructures of sensors and computational resources in urban environments (smart cities, software-defined cities, and the Internet of Things), there is a potential to inform and validate such models using actual data, and to exploit them using the interoperability of the pervasive computational infrastructure that is available.
In this course, we introduce and define the novel term data mechanics to refer specifically to the study of a particular aspect of large, instrumented systems such as cities: how data can flow through institutions and computational infrastructures, how these flows can interact, integrate, and cleave, and how they can inform decisions and operations (in both online and offline regimes). We choose the term mechanics specifically because it connotes the study of mechanics (e.g., classical mechanics) in physics, where universal laws that govern the behavior and interactions between many forms of matter are studied. We also choose this term to emphasize that, often, the data involved will be tied to the physical environment: geospatial and temporal information is a common and integral part of data sets produced and consumed by instrumented environments.

1.2. Data Mechanics Repository and Platform

This course is somewhat unusual in that there are specific, concrete software development goals towards which the staff and students are working. In particular, one goal of the course is to construct a new general-purpose service platform for collecting, integrating, analyzing, and employing (both in real time and offline) real data sets derived from urban sensors and services (particularly those in the city of Boston). We will consider several real data sets and data feeds, including: We will also secure additional data sets from the above sources as well as a few others. In particular, some nationwide data repositories may be worth considering: For the purposes of student projects, there are no limits on other sources of data or computation (e.g., Twitter, Mechanical Turk, and so on) that can be employed in conjunction with some of the above.
Because this course involves the construction of a relatively large software application infrastructure, we will have the opportunity to introduce and practice a variety of standard software development and software engineering concepts and techniques. This includes source control, collaboration, documentation, modularity and encapsulation, testing and validation, inversion of control, and others. Students will also need to become familiar with how to use web service APIs used by government organizations (e.g., Socrata) to make queries and retrieve data.
The overall architecture of the service platform will have at least the following:
  • a database/storage backend ("repository") that houses:
    • original and derived data sets, with annotations that include:
      • from where, when, and by what algorithm it was retrieved
      • using what integration algorithms it was derived
    • algorithms for data retrieval or integration, with references that include:
      • when it was written and by whom
      • in what data sets it is derived
      • from what component algorithms it is composed (if it is such)
  • a web service ("platform") with an application program interface (API) for running analysis and optimization algorithms:
    • a defined language for defining analysis and optimization algorithms over the data stored in the repository
    • an interface for submitting and running algorithms
  • other features for data and result visualization, simulation using partial data, etc.

1.3. Mathematical Modeling, Analysis Algorithms, and Optimization Techniques

There are a variety of problems that it may be possible to address using the data in the repository and the capabilities of the platform. This course will cover a variety of useful online and offline optimization topics in a mathematically rigorous but potentially application-specific way, including:
  • a defined language for defining analysis and optimization algorithms over the data stored in the repository,
  • dual decomposition,
  • online optimization.
The goal is to apply some of these techniques to the data sets(including integrated or derived data sets) and solve practical problems. Some of the problems raised in discussions with the City of Boston DoIT and MassDOT teams are:
  • characterizing intersections and coming up with a metric that incorporates:
    • intersection throughput (people per unit time),
    • modes of transportation (public transport, biking, walking),
    • intersection safety (vehicle speed, accidents, and so on),
    • intersection organization (no left turns, and so on);
  • characterizing streets and deriving metrics for:
    • number of parking spaces,
    • probability of an accident (e.g., using different modes of transportation),
    • senior and handicapped mobility;
  • characterizing neighborhoods:
    • economic condition and gentrification,
    • senior and handicapped accessibility;
  • how to allocate resources to optimize some of the metrics above:
    • where to perform repairs,
    • how to improve housing affordability,
    • where to place bike racks of ride sharing stations,
    • senior and handicapped accessibility;
  • answering immediate questions relevant to an individual (e.g., in an app):
    • is a building accessible,
    • is there a place to park (or will there likely be at some other time),
    • is a neighborhood safe (or is it becoming less or more safe),
    • is a neighborhood affordable (or is it becoming less or more affordable),
    • are healthcare services nearby.
  • integrating public transportation data with other data (e.g., Waze):
    • why are buses on a certain route late,
    • performance and problems during unexpected events (e.g., snow).
The above list is far from complete, and we will update it as the course progresses. Students are encouraged to discuss project ideas with faculty and one another.

2. Modeling Data and Data Transformations

To study rigorously the ways data can behave and interact within an infrastructure that generates, transforms, and consumes data (e.g., to make decisions), it is necessary to define formally what data and data transformations are. One traditional, widely used, and widely accepted model for data is the relational model: any data set is a relation (i.e., a subset of a product of sets), and transformations on data are functions between relations. While the relational model is sufficient to define any transformation on data sets, the MapReduce paradigm is one modern framework for defining transformations between data sets.
In modern contexts and paradigms, these models can be useful when studying relatively small collections of individual, curated data sets that do not change dramatically in the short term. However, these alone are not sufficient in a context in which data sets are overwhelmingly multitudinous, varying in structure, and continuously being generated, integrated, and transformed. One complementary discipline that can provide useful tools for dealing with numerous interdependent data sets is that of data provenance or data lineage. The provenance of a data set (or subset) is a formal record of its origin, which can include how the data was generated, from what data or other sources it was created or derived, what process was used or was responsible for creating or deriving it, and other such information. This information can be invaluable for a variety of reasons beyond the obvious ones (i.e., the origin of the data), such as:
  • the same data be generated again or reproduced if an error occurs,
  • a data set can be updated from the original source if the source has been updated or changed,
  • the source of an inconsistency or aberration of data can be investigated,
  • any of the above could be applied to a subset because recomputing or investigating the entire data set would be prohibitively costly.

2.1. Relational Data and the MapReduce Paradigm

The relational model for data can be expressed in a variety of ways: a data set is a relation on sets, a logical predicate governing terms, a collection of tuples or records with fields and values, a table of rows with labelled columns, and so on. Mathematically, they are all equivalent. In this course, we will adopt a particular model because it is well-suited for the tools and paradigms we will employ, and because it allows for one fairly clean mathematical integration of the study of relational data and data provenance.
A data set (also known as a store or database) is a multiset R: a collection (possibly with duplicates) of tuples of the form (x1,...,xn) taken from the set product X1 × ... × Xn. Typically, some distinguished set (e.g., the left-most in the set product) will be a set of keys, so that every tuple contains a key. Whether a set is a key or not often depends on the particular paradigm and context, however.
A data transformation T: A B is a mapping from one space of data sets A to another space of data sets B. Notice that an individual data set S (i.e., a relation or, equivalently, a set of tuples) is just an element of A.
Some of the typical building blocks for data transformations in the relational model are:
  • union and difference (intersection can also be defined in terms of these),
  • projection (sometimes generalized into extended projection),
  • selection (filtering),
  • renaming,
  • Cartesian product,
  • variants of join operations (many can be constructed using the above, but other variants have been added as extensions),
  • aggregation (an extension).
One common operation on relations that is not possible to express in traditional formulations but found in some relational database systems is the transitive closure of a data set. Normally, this requires an iterative process consisting of a sequence of join operations.
We can model and illustrate the transformations that constitute the MapReduce paradigm using Python. Note that selection and projection can be implemented directly using Python comprehensions, but we define wrappers below for purposes of illustration.

def union(R, S):
    return R + S

def difference(R, S):
    return [t for t in R if t not in S]

def intersect(R, S):
    return [t for t in R if t in S]

def project(R, p):
    return [p(t) for t in R]

def select(R, s):
    return [t for t in R if s(t)]
def product(R, S):
    return [(t,u) for t in R for u in S]

def aggregate(R, f):
    keys = {r[0] for r in R}
    return [(key, f([v for (k,v) in R if k == key])) for key in keys]
We consider a few simple examples that illustrate how transformations can be constructed within the relational model. We start by showing how select can be used with a predicate to filter a data set.

>>> def red(t): return t == 'tomato'
>>> select(['banana', 'tomato'], red)
Suppose we have two data sets and want to join them on a common field. The below sequence illustrates how that can be accomplished by building up the necessary expression out of simple parts.

>>> X = [('Alice', 22), ('Bob', 19)]
>>> Y = [('Alice', 'F'), ('Bob', 'M')]

>>> product(X,Y)
[(('Alice', 'F'), ('Alice', 22)), (('Alice', 'F'), ('Bob', 19)), (('Bob', 'M'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> select(product(X,Y), lambda t: t[0][0] == t[1][0])
[(('Alice', 'F'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> project(select(product(X,Y), lambda t: t[0][0] == t[1][0]), lambda t: (t[0][0], t[0][1], t[1][1]))
[('Alice', 'F', 22), ('Bob', 'M', 19)]
Finally, the sequence below illustrates how we can compute an aggregate value for each unique key in a data set (such as computing the total age by gender).

>>> X = [('Alice', 'F', 22), ('Bob', 'M', 19), ('Carl', 'M', 25), ('Eve', 'F', 27)]

>>> project(X, lambda t: (t[1], t[2]))
[('F', 22), ('M', 19), ('M', 25), ('F', 27)]

>>> aggregate(project(X, lambda t: (t[1], t[2])), sum)
[('F', 49), ('M', 44)]
The following sequence explains what is happening inside the aggregate function for a particular key.

>>> Y = project(X, lambda t: (t[1], t[2]))
>>> keys = {t[0] for t in Y}
>>> keys
{'F', 'M'}
>>> [v for (k,v) in Y if k == 'F']
[22, 27]
>>> sum([v for (k,v) in Y if k == 'F'])
>>> ('F', sum([v for (k,v) in Y if k == 'F']))
('F', 49)
Consider the following line of Python code that operates on a data set D containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:

R = sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])
We can identify which building blocks for transformations available in the relational model are being used in the above code; we can also draw a corresponding flow diagram that shows how they fit together.
!table([ ['r:select``D', 'r:project``...', 'r:agg sum``...', 'R'] ])
We can also rewrite the above using the building blocks for transformations in the relational model.

R = aggregate(project(select(D, lambda psc: if psc[1] == "Massachusetts" and psc[2] == "Trump"), lambda psc: 1), sum)
Notice that the order of the operations is important. If we tried to do the projection before the selection, the information that we are using to perform the selection would be gone after the projection. On the other hand, two selection (one immediately followed by another) can be done in any order. This is one reason why it makes sense to call the paradigm a relational algebra: there are algebraic laws that govern how the operations can be arranged.
In the MapReduce paradigm, a smaller set of building blocks inspired by the functional programming paradigm (supported by languages such as ML and Haskell) exist for defining transformations between data sets. Beyond adapting (with some modification) the map and reduce (a.k.a., "fold") functions from functional programming, the contribution of the MapReduce paradigm is the improvement in the performance of these operations on very large distributed data sets. Because of the elegance of the small set of building blocks, and because of the scalability advantages under appropriate circumstances, it is worth studying the paradigm's two building blocks for data transformations: map and reduce:
  • a map operation will apply some user-specified computation to every tuple in the data set, producing one or more new tuples,
  • a reduce operation will apply some user-specified aggregation computation to every set of tuples having the same key, producing a single result.
Notice that there is little restriction on the user-specified code other than the requirement that it be stateless in the sense that communication and coordination between parallel executions of the code is impossible. It is possible to express all the building blocks of the relational model using the building blocks of the MapReduce paradigm.
We can model and illustrate the two basic transformations that constitute the MapReduce paradigm in a concise way using Python.

def map(f, R):
    return [t for (k,v) in R for t in f(k,v)]
def reduce(f, R):
    keys = {k for (k,v) in R}
    return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]
A map operation applies some function f to every key-value tuple and produces zero or more new key-value tuples. A reduce operation collects all values under the same key and performs an aggregate operation f on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.
At the language design level, the relational model and the MapReduce paradigm are arguably complementary and simply represent different trade-offs; they can be used in conjunction on data sets represented as relations. Likewise, implementations of the two represent different performance trade-offs. Still, contexts in which they are used can also differ due to historical reasons or due to conventions and community standards.
We can use the Python implementations of the map and reduce operations in an example above to implement some common transformations in the relational model. For this example, we assume that the first field in each tuple in the data set is a unique key (in general, we assume there is a unique key field if we are working with the MapReduce paradigm). We illustrate how projection and aggregation can be implemented in the code below.

R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]

# Projection keeps only gender and age.
# The original key (the name) is discarded.
X = map(lambda k,v: [(v[0], v[1])], R)

# Aggregation by the new key (i.e., gender).
Y = reduce(lambda k,vs: (k, sum(vs)), X) 
Selection is not straightforward to implement because this is not a common use case for MapReduce workflows. However, it is possible by treating each tuple in the input data set as its own key and then keeping on the keys at the end.

R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]

X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).
We can also perform a simple join operation, although we also need to "combine" the two collections of data. This particular join operation is also simple because each tuple in the input is only joined with one other tuple. Join operations were also not originally envisioned as a common use case for MapReduce workflows.

R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]

X =   map(lambda k,v: [(k, ('Age', v))], R)\
    + map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
        lambda k,vs:\
          (k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
Suppose you have a data set containing tuples of the form (name, gender, age). You want to produce a result of the form (gender, total) where total is the sum of the age values of all tuples with the corresponding gender. The code below illustrates using Python how this can be done in the MapReduce paradigm.

INPUT = [('Alice', ('F', 19)),\
         ('Bob', ('M', 23)),\
         ('Carl', ('M', 20)),\
         ('Eve', ('F', 27))]
TEMP = map(lambda k,v: [(v[0], v[1])], INPUT)
OUTPUT = reduce(lambda k,vs: (k, sum(vs)), TEMP)
We provide an equivalent MapReduce paradigm implementation of the algorithm using MongoDB.

db.INPUT.insert({_id:"Alice", gender:"F", age:19});
db.INPUT.insert({_id:"Bob", gender:"M", age:23});
db.INPUT.insert({_id:"Carl", gender:"M", age:20});
db.INPUT.insert({_id:"Eve", gender:"F", age:27});

  function() {
    emit(this.gender, {age:this.age});
  function(k, vs) {
    var total = 0;
    for (var i = 0; i < vs.length; i++)
        total += vs[i].age;
    return {age:total};
  {out: "OUTPUT"}
Suppose we have a data set containing tuples of the form (name, city, age). We want to produce a result of the form (city, range) where range is defined as the difference between the oldest and youngest person in each city.
  • We can construct a sequence of transformations that will yield this result using the basic building blocks available in the relational model (projections, selections, aggregations, and so on).
  • Alternatively, we can use the MapReduce paradigm to construct a single-pass (one map operation and one reduce operation) MapReduce computation that computes the desired result. Hint: emit two copies of each entry (one for the computation of the maximum and one for the computation of the minimum).
One approach in the relational paradigm is to aggregate the minimum and maximum age for each city, negate the minimum ages, and aggregate once more to get the ranges.

NCA = [('Alice', 'Boston', 23), ('Bob', 'Boston', 19), ('Carl', 'Seattle', 25)]
CA = [(c,a) for (n,c,a) in NCA]
MIN = aggregate(CA, min)
MIN_NEG = [(c,-1*a) for (c,a) in MIN]
MAX = aggregate(CA, max)
RESULT = aggregate(union(MIN_NEG, MAX), sum)
Below is a flow diagram that represents the above transformations.
!table([ [null , null, 'r:proj``MIN', 'rd:union``MIN_NEG', null , null], ['r:proj``NCA', 'ru:aggmin`rd:agg max``CA', null , null , 'r:agg sum``...', 'RESULT'], [null , null, 'rru:union``MAX', null , null , null] ])
Using the MapReduce paradigm, we may prefer to follow the paradigm's two-stage breakdown of a computation by first converting every single entry in the input data set into an approximation of the range. For example, given only the record ('Alice', ('Boston', 23)), in the mapping stage we might estimate the range as ('Boston', (23, 23, 0)) where the second and third entries are the minimum and maximum "known so far" given the limited information (a single data point). Then, in the reduce stage, we would combine these estimates.

NCA = [('Alice', ('Boston', 23)), ('Bob', ('Boston', 19)), ('Carl', ('Seattle', 25))]
I = map(lambda k, v: [(v[0], (v[1], v[1], 0))], NCA)

def reducer(k, vs):
    age_lo = min([l for (l,h,r) in vs])
    age_hi = max([h for (l,h,r) in vs])
    age_ran = age_hi - age_lo 
    return (k, (age_lo, age_hi, age_ran))
RESULT = reduce(reducer, I)
Consider the following algorithm implemented as a collection of transformations drawn from the relational paradigm. The input data set consists of tuples of the form (intersection, date, time, cars). Each tuple represents a single accident took place at the specified intersection, occurred on the specified date and time, and involved the specified number of cars.

D = [('Commonwealth Ave./Mass Ave.', '2016-11-02', '11:34:00', 3), ...]
M = project(D, lambda idtc: [(idtc[0], idtc[3])])
R = aggregate(M, max)
H = [(i,d,t) for ((i,m), (j,d,t,c)) in product(R, D) if i==j and m==c]
A data flow diagram representing the transformations is provided below.
!table([ ['d:prod`r:proj``D', 'r:agg max``M', 'lld:prod``R'], ['r:select``...', 'r:proj``...', 'H'] ])
Complete the following tasks. To simplify the exercise, you may assume that there is at most one accident involving each possible quantity of cars (e.g., there is only one accident that involved five cars).
  • Write a description of what the algorithm computes. What does the output data set represent?
  • Provide an implementation of this algorithm using the MapReduce paradigm. You only need one map and one reduce operation.

2.2. Composing Transformations into Algorithms

Whether we are using the relation model or the MapReduce paradigm, the available building blocks can be used to assemble fairly complex transformations on data sets. Each transformation can be written either using the concrete syntax of a particular programming language that implements the paradigm, or as a data flow diagram that describes how starting and intermediate data sets are combined to derive new data sets over the course of the algorithm's operation.
We can use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the k-means clustering algorithm.

def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2

def plus(args):
    p = [0,0]
    for (x,y) in args:
        p[0] += x
        p[1] += y
    return tuple(p)

def scale(p, c):
    (x,y) = p
    return (x/c, y/c)

M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

OLD = []
while OLD != M:
    OLD = M

    MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
    PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
    PD = aggregate(PDs, min)
    MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MT = aggregate(MP, plus)

    M1 = [(m, 1) for (m, _) in MP]
    MC = aggregate(M1, sum)

    M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
Below is a flow diagram describing the overall organization of the computation (nodes are data sets and edges are transformations). Note that the nodes labeled "..." are intermediate results that are implicit in the comprehension notation. For example, [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2] first filters product(MPD, PD) using a selection criteria if p==p2 and d==d2 and then performs a projection from tuples of the form ((m,p,d), (p2,d2)) to tuples of the form (m, p) to obtain the result.
!table([ ['rd:prod``P', null, null, 'd:agg min``PDs'], [null, 'r:proj``...', 'dr:prod`ru:proj``MPD', 'd:prod``PD'], ['ru:prod``M', null, null, 'd:selection``...'], ['u:proj``...', 'ld:prod``MC', 'l:agg sum``M1', 'd:proj``...'], ['u:selection``...', 'l:prod``MT', null, 'lu:proj`ll:agg plus``MP'] ])
We can use building blocks drawn from the MapReduce paradigm that are available in MongoDB to construct an implementation of the k-means clustering algorithm. This implementation illustrates many of the idiosyncrasies of the MapReduce abstraction made available by MongoDB.{ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);

function flatten(A) {
  db[A].find().forEach(function(a) { db[A].update({_id: a._id}, a.value); });

function prod(A, B, AB) {
  db[A].find().forEach(function(a) {
    db[B].find().forEach(function(b) {
      db[AB].insert({left:a, right:b});

function union(A, B, AB) {
  db[A].find().forEach(function(a) {
  db[B].find().forEach(function(b) {

function hash_means(M, HASH) {
    function() { emit("hash", {hash: this.x + this.y}); },
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      return {hash: hash};
    {out: HASH}

// We'll only perform a single product operation. Using map-reduce, we can perform
// argmax and argmin more easily. We can also use map-reduce to compare progress.


var iterations = 0;
do {
  // Compute an initial hash of the means in order to have a baseline
  // against which to compare when deciding whether to loop again.
  hash_means("M", "HASHOLD");

  prod("M", "P", "MP");
  // At this point, entries in MP have the form
  // {_id:..., left:{x:13,y:1}, right:{x:4,y:5}}.

  // For each point, find the distance to the closest mean. The output after
  // flattening has entries of the form {_id:{x:?, y:?}, m:{x:?, y:?}, d:?}
  // where the identifier is the point.
    function() {
      var point = {x:this.right.x, y:this.right.y};
      var mean = {x:this.left.x, y:this.left.y};
      emit(point, {m:mean, d:dist(point, mean)});
    function(point, vs) {
      // Each entry in vs is of the form {m:{x:?, y:?}, d:?}.
      // We return the one that is closest to point.
      var j = 0;
      vs.forEach(function(v, i) {
        if (v.d < vs[j].d)
          j = i;
      return vs[j]; // Has form {m:{x:?, y:?}, d:?}.
    {out: "MPs"}
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, value:{m:{x:4, y:5}, d:1}}.

  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, m:{x:4, y:5}, d:1}.

  // For each mean (i.e., key), compute the average of all the points that were
  // "assigned" to that mean (because it was the closest mean to that point).
    function() {
      // The key is the mean and the value is the point together with its counter.
      var point = this._id;
      var point_with_count = {x:point.x, y:point.y, c:1};
      var mean = this.m;
      emit(mean, point_with_count);
    function(key, vs) {
      // Remember that the reduce operations will be applied to the values for each key
      // in some arbitrary order, so our aggregation operation must be commutative (in
      // this case, it is vector addition).
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      return {x:x, y:y, c:c};
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
  // At this point, entries in M have the form
  // {_id:{x:2, y:3}, value:{x:4, y:5}}.

  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, x:4, y:5}. The identifier
  // value does not matter as long as it is unique.

  // Compute the hash of the new set of means.
  hash_means("M", "HASHNEW");

  // Extract the two hashes in order to compare them in the loop condition.
  var hashold = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
  var hashnew = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
} while (hashold != hashnew);
If we are certain that our k-means algorithm will have a relatively small (e.g., constant) number of means, we can take advantage of this by only tracking the means in a local variable and using .updateMany() to distribute the means to all the points at the beginning of each iteration. This leads to a much more concise (and, for a small number of means, efficient) implementation of the algorithm than what is presented in a previous example. In particular, it is no longer necessary to encode a production operation within MongoDB.{ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);


var means = [{x:13,y:1}, {x:2,y:12}];
do {
  db.P.updateMany({}, {$set: {means: means}}); // Add a field to every object.
    function() {
      var closest = this.means[0];
      for (var i = 0; i < this.means.length; i++)
        if (dist(this.means[i], this) < dist(closest, this))
          closest = this.means[i];
      emit(closest, {x:this.x, y:this.y, c:1});
    function(key, vs) {
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      return {x:x, y:y, c:c};
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
  means = db.M.find().toArray().map(function(r) { return {x:r.value.x, y:r.value.y}; });
} while (true);
We do not deal with the issue of convergence in the above example; an equality function on JSON/BSON objects (i.e., the list of means) would need to be defined to implement the loop termination condition. Below, we illustrate how the above implementation can be written within Python using PyMongo.

import pymongo
import bson.code

client = pymongo.MongoClient()
db = client.local{'_id':'dist', 'value': bson.code.Code("""
    function(u, v) {
        return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);


means = [{'x':13,'y':1}, {'x':2,'y':12}]

while True:
    db.P.update_many({}, {'$set': {'means': means}})

    mapper = bson.code.Code("""
        function() {
            var closest = this.means[0];
            for (var i = 0; i < this.means.length; i++)
                if (dist(this.means[i], this) < dist(closest, this))
                    closest = this.means[i];
            emit(closest, {x:this.x, y:this.y, c:1});
    reducer = bson.code.Code("""
        function(key, vs) {
            var x = 0, y = 0, c = 0;
            vs.forEach(function(v, i) {
                x += v.x;
                y += v.y;
                c += v.c;
            return {x:x, y:y, c:c};
    finalizer = bson.code.Code("""
        function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; }
    db.P.map_reduce(mapper, reducer, "M", finalize = finalizer)

    means = [{'x':t['value']['x'], 'y':t['value']['y']} for t in db.M.find()]
We can also use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the Floyd-Warshall shortest path algorithm.

N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]

oo = float('inf') # This represents infinite distance.

P = product(N,N)
I = [((x,y),oo if x != y else 0) for (x,y) in P] # Zero distance to self, infinite distance to others.
D = [((x,y),1) for (x,y) in E] # Edge-connected nodes are one apart.

OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
    P = product(STEP, STEP) # All pairs of edges.
    NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y]) # Add distances of connected edge pairs.
    OUTPUT = aggregate(NEW, min) # Keep only shortest node-node distance entries.


2.3. Data Provenance

Data provenance is an overloaded term that refers, in various contexts and communities, to the source, origin, or lifecycle of a particular unit of data (which could be an individual data point, a subset of a data set, or an entire data set). In this course, we will use the term primarily to refer to dependency relationships between data sets (or relationships between individual entries in those data sets) that may be derived from one another (usually over time). As an area of study within computer science, data provenance (also called data lineage) is arguably still being delineated and developed by the community. However, some community standards for general-purpose representations of data provenance have been established (e.g., PROV [#]).
While the research literature explores various ways of categorizing approaches to data provenance, there are two dimensions that can be used to classify provenance techniques (surveyed in more detail in the literature [#]):
  • from where the data was generated (e.g., what data sets or individual data entries) and how it was generated (e.g., what algorithms were used);
  • whether the lineage is tracked at a fine granularity (e.g., per data entry such as a row in a data set) or a coarse granularity (e.g., per data set).
coarse fine
how transformation execution path through transformation algorithm
from where source data sets specific entries in source data sets
When data lineage is being tracked at a fine granularity, there are at least two approaches one can use to determine provenance of a single entry within a data set produced using a specific transformation. One approach is to track the provenance of every entry within the transformation itself (sometimes called tracing); another approach is to determine the provenance of an entry after the transformation has already been performed (e.g., by another party and/or at some point in the past). The latter scenario is more likely if storage of per-entry provenance meta-data is an issue, or if the transformations are black boxes that cannot be inspected, modified, or extended before they are applied to input data sets.
For a transformation that may combine input data set entries in some way, a large number of entries in the input data set can influence an individual entry in the output data set. For such transformations, finding the per-entry provenance for an entry in the output data set can be non-trivial. Without additional information about the transformation, the conservative assumption to make is that all entries in the input data set may contribute to every entry in the output data set.
In some cases, we may know more about a transformation. Perhaps its definition is broken down into building blocks found in the relational model (e.g., selections and projections), or it can be defined using map and reduce operations in the MapReduce paradigm. In such cases, studying the data lineage of individual data items it may be possible to derive standard techniques [#] for determining data lineage given the components that make up the transformation.
For relational transformations that simply add or remove data set entries without changing their internal structure (i.e., schema) or data content (i.e., values), computing the per-entry provenance is relatively straightforward. In particular: for union, difference, intersection, selection, and product transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set (i.e., the fine-grained "where" provenance of an individual entry in the output) can be determined using a linear search through the input data set (or sets) for that entry.
For projection transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set can be determined by applying the transformation to each entry in the input data set. Whenever this yields an output equivalent to the target, we know that the provenance for that output entry includes that input entry.
The above facts imply that for several relational transformation building blocks, per-entry provenance of output data set entries can be reconstructed efficiently. Note that by induction this also implies that any transformation composed of these building blocks is also amenable to efficient reconstruction of the provenance of any output data set entry.
If a transformation might combine subsets of the input entries to compute individual output entries (but we have no additional information about the transformation), then with regard to per-entry provenance a worst case scenario may apply.
Suppose we know nothing about the internal structure of a transformation, but we want to determine what entries in the input data set influence a particular entry in the output data set. In this case, even running the entire transformation a second time (knowing the target entry) may provide no additional information about which subset of input data set entries produced that output data set entry. In this worst-case scenario, it would be necessary to run the transformation on all 2n subcollections of the input data set (for an input data set of size n), and to check for each one of those 2n outputs whether the particular entry of interest from the output data set was generated.
However, it is possible that a transformation that combines input entries into output entries falls into one of a small number of categories that make it possible to compute per-entry provenance more efficiently.
A transformation is context-free if any two entries in the input data set that contribute to the same output entry always do so (regardless of what other input entries are present in the input data set). If a transformation is context-free, then the provenance information for a given output data set entry can be computed in quadratic time. First, we can partition the input data set using the entries of the output data set (i.e., create one bucket of input data set entries for each output data set entry). Then, we can determine which partition contributed to the output data set entry of interest using linear search.
A transformation is key-preserving if any two entries under the same keys in the input data set always contribute to an entry with the same output key in the output data set. If a transformation is key-preserving, then the provenance information for a given entry is easy to determine by building up a map from input data set keys to output data set keys, and then performing a linear search over the input data set.
Any relational aggregation (that is, a key-based aggregation operation as defined in the relational model that produces an output data set in which all keys are unique) is always key-preserving and context-free.
The above example completes our understanding of the complexity of reconstructing per-entry provenance for all the building blocks in the relational model.
transformation f per-entry provenance
reconstruction approach
complexity for input data
set with n entries
linear search over input data set
to find entry
O(n) entry equality checks
projection application of transformation once
to each input data set entry
O(n) executions of f and
O(n) entry equality checks
aggregation application of transformation once
to each input data set entry and
construction of key-to-key map
O(n) executions of f and
O(n log n) to build and use
key-to-key map
A transformation f that takes a data set of points as its input, runs the k-means algorithm on those points, and returns a data set of means is not context-free. To see why, consider its behavior on following inputs for k = 2:
f([(0,0), (2,2)])
[(0,0), (2,2)]
f([(0,0), (2,2), (100,100)])
[(1,1), (100,100)]
Notice that in the first case, the input entries (0,0) and (2,2) each produce their own output entry. However, the introduction of a new point (100,100) results in (0,0) and (2,2) both contributing to the same output entry (1,1).
A transformation can be context-free but not key-preserving. For example, suppose a transformation aggregates some vectors by key but then discards the key via a projection:
[(j, (0,2)), (j, (2,0)), (k, (0,3)), (k, (3,0))]
[(j, (2,2)), (k, (3,3))]
[(2,2), (3,3)]
The above is context-free because there is no way that other input entries can have any influence over the way existing entries aggregate by key. However, they key j might map to any numeric value key in the output (depending on what the input entries are):
[(j, (0,0))]
[(j, (0,0)), (j, (1,1))]
Any key-preserving transformation is context-free. To see this, suppose that there is some transformation f that is key-preserving but not context-free. This would imply that there is some set of inputs (i, a), (j, b), and (k, c) for which the definition of context-free is not satisfied, such as a case in which two input entries each map to their own distinct output entries when on their own but map to the same output entry when the third input entry (k, c) is introduced:
f([(i, a), (j, b)])
[(l, d)]
f([(i, a), (j, b), (k, c)])
[(l, d), (m, e), (n, f)]
However, the above would imply that under different conditions, the keys i and j might either both map to a key l or might map to two different keys l and m. But this would imply that entries with the key j map to entries with either key l or key m. This is not key-preserving and contradicts our assumptions, so our premise must have been impossible.
The diagram below illustrates the relationships between the properties discussed above. Notice that whether a transformation is context-free and key-preserving is used to determine whether a transformation that combines input entries might have properties that allow us to compute per-entry provenance more efficiently. There is no need to check whether other transformations (such as projections and products) have these properties because per-entry provenance is already efficiently computable in those cases.
data set transformations
transformations that combine multiple
input entries into output entries
(example: k-means)
context-free transformations
(example: aggregation by key that drops the key)
key-preserving transformations
(example: relational aggregation by key)
other transformations
(examples: selections, projections)
We provide explicit algorithms for computing per-entry provenance in three of the cases outlined above.
If we introduce a function for computing powersets of entries in a data set (e.g., using a Python recipe for such a function), we can define an inefficient Python algorithm that can correctly determine per-entry provenance in the general case (i.e., not knowing any additional information about a transformation f).

from itertools import combinations, chain

def powerset(iterable):
    "powerset([1,2,3]) -> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

def general_prov(f, X, y):
    for Y in reversed(powerset(X)): # From largest to smallest subset.
        if list(f(list(Y))) == [y]:
            return Y
The function below determines the per-entry provenance for a context-free transformation. For this example, we assume that all data set entries are of the form (key, value).

def context_free_prov(f, X, y):

    # Build the partitions of the input data set.
    partitions = []
    for (x_key, x_val) in X:
        found = False
        for i in range(0,len(partitions)):
            if len(f(partitions[i] + [(x_key, x_val)])) == 1:
                partitions[i].append((x_key, x_val))
                found = True
        # Create a new partition if adding to any other
        # partition increases the size of the output data set.
        if found == False:
            partitions.append([(x_key, x_val)])

    # Find the corresponding partition.
    for partition in partitions:
        if y in f(partition):
             return partition
The function below determines the per-entry provenance for a key-preserving transformation. For this example, we assume that all data set entries are of the form (key, value).

def key_preserving_prov(f, X, y):
    keymap = {}
    # Build up the map from input data set
    # keys to output data set keys.
    for (x_key, x_val) in X:
        (y_key, y_val) = f([(x_key, x_val)])[0]
        keymap[x_key] = y_key
    (y_key, y_val) = y

    # Collect all the tuples that contribute
    # to the target result.
    pre_image = set()
    for (x_key, x_val) in X:
        if keymap[x_key] == y_key:
             pre_image.add((x_key, x_val))
    return pre_image
For each of the following informal descriptions of data transformations, determine whether computing the per-entry provenance of a row in the output data set is easy (e.g., linear, O(n log n), or quadratic time) or difficult (i.e., cannot be done in polynomial time), assuming that the transformation's implementation is a black box (except for the provided description).
  • Given a data set containing entries of the form (name, age), the transformation produces a list of all entries corresponding to individuals who are of a legal voting age.
  • Given a data set containing entries of the form (product, (department, price, date)) describing sales that take place in a department store, the transformation computes the total revenue for each department to produce entries of the form (department, revenue). Assume that each product appears in exactly one department.
  • Given a data set containing entries of the form (product, (department, price, date)) describing sales that take place in a department store, the transformation computes the total revenue from each department for each possible date to produce entries of the form ((date, department), revenue).
  • Given a data set specifying the geographical locations of a collection of radio receivers in a bounded region, determine an optimal placement of 100 radio towers within that region that minimizes the aggregate of the distances between each tower-receiver pair.
Given a data set containing entries of the form (restaurant, popularity, location) that describe restaurants within the various neighborhoods in a city, the transformation uses an unknown machine learning algorithm to group the restaurants by similarity of cuisine into categories (each restaurant is assigned to exactly one cuisine category) and produces a data set with entries of the form (cuisine, count) specifying the number of restaurants that specialize in each type of cuisine. If you have no additional information, how difficult will it be to determine the provenance of an individual entry in the output?
Tracking provenance at a coarse granularity (per data set) is relatively inexpensive in most applications (since the number of data sets is usually orders of magnitude less than the number of individual data set entries), and standards such as PROV provide authors of transformations a way to represent provenance information.
Suppose Alice and Bob want to assemble a script that retrieves some information from an online resource and stores it in a database. We can use the prov Python package to programmatically assemble a course granularity provenance record that conforms to the PROV standard and describes a particular execution of this script.
We first create the document object and define the namespaces we will use for symbols in the document.

doc = prov.model.ProvDocument()
doc.add_namespace('alg', '') # The scripts in / format.
doc.add_namespace('dat', '') # The data sets in / format.
doc.add_namespace('ont', '')
doc.add_namespace('log', '') # The event log.
doc.add_namespace('bdp', '')
We then use an agent to represent the script, en entity to represent a resource, and an activity to repsent this invocation of the script. We also associate this current activity with the script, and we and indicate that his activity used the entity representing the resource.

this_script = doc.agent('alg:example', {prov.model.PROV_TYPE:prov.model.PROV['SoftwareAgent'], 'ont:Extension':'py'})
resource = doc.entity('bdp:wc8w-nujj', 
    {'prov:label':'311, Service Requests', 
    prov.model.PROV_TYPE:'ont:DataResource', 'ont:Extension':'json'}
this_run = doc.activity(
    'log:a'+str(uuid.uuid4()), startTime, endTime, 
    {prov.model.PROV_TYPE:'ont:Retrieval', 'ont:Query':'?type=Animal+Found&$select=type,latitude,longitude,OPEN_DT'}
doc.wasAssociatedWith(this_run, this_script)
doc.used(this_run, resource, startTime)
Finally, we define an entity for the data obtained and indicate to what agent it was attributed, by what actvitity it was generated, and from what entity it was derived.

found = doc.entity('dat:found', {prov.model.PROV_LABEL:'Animals Found', prov.model.PROV_TYPE:'ont:DataSet'})
doc.wasAttributedTo(found, this_script)
doc.wasGeneratedBy(found, this_run, endTime)
doc.wasDerivedFrom(found, resource, this_run, this_run, this_run)
We can view our completed provenance record in, for example, JSON form using doc.serialize().

3. Systems, Models, and Algorithms

When using mathematics to solve problems involving real-world entities and phenomenon, we can introduce the notions of a system (i.e., an organized collection of interdependent components and their possible states) that has distinct system states (which can capture spatial and temporal information about the system components, their behaviors, and their relationships). A mathematical model captures only some parts of the system and the possible system states, abstracting the other details away, and it does so using some particular symbolic language or representation.
Four common abstract mathematical objects used to represent models of systems are particularly relevant given our application domain of interest:
  • graphs consisting of a collection of nodes and edges (possibly weighted and directed) between those nodes,
  • vector spaces consisting of sets of vectors,
  • systems of equations and inequalities (or, equivalently, sets of constraints or set of logical formulas), and
  • functions (whether continuous or discrete).
In many cases, these mathematical objects can be used to represent each other (e.g., a graph can be represented as a function or as a system of equations), but each has its own characteristics that make it well-suited for modeling certain kinds of systems. Very often, two or more of these are combined when modeling a system.
Most data sets describing a real-world system are using one or more of the above languages for modeling that system. Identifying which of these is being used can inform our decisions about what techniques we can use to answer questions about that system or to solve problems involving that system. It can also tell us how we can convert to a different language or representation in order to convert, combine, or derive new data sets.

3.1. Systems, Models, and Metrics

We can provide abstract definitions for systems, models, and metrics. These then allow us to characterize satisfaction, search, and optimization problems in a general and consistent way.
A system is a set S of distinct system states (possibly infinite). Often, each possible system state s S is called a model of the system.
One mathematical object that is commonly used to represent systems is a vector space where individual vectors are the system states (i.e., models) of that system.
A constraint set is a set of logical formulas (written using a formal language that may contain terms such as integers, vector, matrices, sets, and so on), usually used to identify a subset of a system (i.e., a subset of the possible systems states or models). A system state or model satisfies a constraint or constraint set if the logical formulas are true for that model. The subset of all states that satisfy the constraints is sometimes called the feasible region.
When using vector spaces, it is often possible to represent constraints using a matrix equation, e.g.:
In the above, M might be a matrix from ℝn × m, x might be a variable vector in ℝm, and b might be a vector in ℝn.
Given a system S and a set of constraints, a constraint satisfaction problem is the problem of finding a model (or set of models) in S that satisfies that set of constraints.
A metric over a system S is a function f: S ℝ that maps system states to real numbers.
Given a system S and metric f, an optimization problem is the problem of finding s S that maximizes or minimizes f:
argmaxs S  f(s)
argmins S  f(s)
Computing the argmax or argmin operation in the relational or MapReduce paradigms can be non-trivial, but it is still possible. For example, in the relational paradigm it is possible to compute it by combining two projections, an aggregation, and a selection.

def argmax(X, f):
    Y = [f(x) for x in X] # Projection.
    y_max = aggregate(Y, max)
    XF = [(x, f(x)) for x in X] # Projection.
    xs = [x for (x,y) in XF if y == y_max] # Selection.
    return xs[0]
Consider a state space S = {0,1}4 with states of the form x = (x1, x2, x3, x4) and the following set of constraints:
x1 > x2
  • What is the size of the state space?
  • Find a state in S that solves this constraint satisfaction problem.
  • Suppose that this problem involves maximizing the objective function f(x) = x1 + x2 + x3 + x4; what state in S solves this optimization problem?
Suppose we want to create an approximate, discrete version of the k-means algorithm. We can do so by treating the outcome of the k-means algorithm as the solution to an optimization problem. In order to make this concrete, we will work with the following set of points and Euclidean distances (as in a previous example).

P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2
For k = 1, we can create a state space S that consists of all possible positions for the mean within a particular bounded region. The metric would be the sum of the distances from all the points in P to the mean m. The solution would be the one that minimizes the metric.

S = list(product(range(0,20), range(0,20))) # Possible locations for one mean.

def metric(m):
    return sum([dist(m, p) for p in P])

o = min(S, key=metric)
For k = 2, we would need to create a state space that represents all possible pairs of means. We would also need to adjust the metric function to consider in the sum only the distances from each point to the closest (i.e., minimum-distance) mean to that point.

V = list(product(range(0,20), range(0,20))) # Possible locations for one mean.
S = list(product(V, V)) # Possible pairs of means.

def metric(M):
    return sum([min([dist(m, p) for m in M]) for p in P])

o = min(S, key=metric)
Consider the following problem: given a data set describing a set of 10,000 spectators and their heights, determine if it is possible to seat them in a stadium of 10,000 seats so that no one blocks anyone else's view. Determine whether this is a constraint satisfaction problem or an optimization problem, and identify the state space, constraint set, and (if applicable) metric.
For each of the following scenarios, determine whether it is an instance of a constraint satisfaction problem or an instance of an optimization problem. If it is an optimization problem, identify the objective function.
  • Find the lowest-cost route between two destinations on a transportation network.
  • Given a social network where any pair of individual members can be "connected", determine if the whole network is connected (i.e., there exists a path from every member to every other member).
  • Given a social network where any pair of individual members can be "connected", find the smallest group of individuals that must be removed in order to make the network disconnected.

3.2. Linear systems, satisfiability modulo theories, and linear programming

We often say a system or subset of a system is linear if it can be modeled using a Euclidean vector space and can be defined using a collection of linear constraints of the form:
a1x1 + ... + anxn  ♦  c
Where (x1, ..., xn) n is a vector, the ai are coefficients in ℝ, c in ℝ is a constant, and ♦ is a relation operator such as =, ≤, <, ≥, or >. Note that the constraints contain no higher-order xi terms such as xi2.
There are three common techniques for solving constraint satisfaction and optimization problems involving linear systems:
  • finding an exact or approximate solution (e.g., using some combination of matrix inversion, QR factorization, and ordinary least squares),
  • checking if a solution exists (and finding a witness system state or model proving this), and
  • finding a solution that minimizes or maximizes some (linear) metric over the vector space.
The third item (solving an optimization problem) is arguably a generalization of the first approach if we remove the requirement that the metric must be linear (since the first approach minimizes the Euclidean distance metric, which is quadratic and thus non-linear).
Suppose we want to use an SMT solver to solve a flow maximization problem for a flow graph (the variable on each represents the flow amount, and the parenthesized number is the maximum possible flow on that edge):
!table([ [null, null, ('dd:x4 (6)`dr:x5 (3)``•'), null, null], [('r:x1 (?)``•'), ('ur:x2 (7)`dr:x3 (8)``•'), null, ('r:x7 (5)``•'), ('•')], [null, null, ('ur:x6 (4)``•'), null, null], ])
Here, the system is the set of possible states of the graph in terms of the amount of material flowing along each edge (we model the flow that goes along each edge in the graph as a variable ranging over ℝ, and the flow in the overall graph containing seven edges as a vector in ℝ7).

import z3

(x1,x2,x3,x4,x5,x6,x7) = [z3.Real('x'+str(i)) for i in range(1,8)]

S = z3.Solver()

# Only allow non-negative flows.
for x in (x1,x2,x3,x4,x5,x6,x7):
    S.add(x >= 0)
# Edge capacity constraints.
S.add(x2 <= 7, x3 <= 8, x4 <= 6)
S.add(x5 <= 3, x6 <= 4, x7 <= 5)

# Constraints derived from graph topology.
S.add(x1 == x2+x3, x2 == x4+x5, x3+x4 == x6, x5+x6 == x7)

S.add(x1 > 0) # We want a positive flow.

If we are allowed to vary only the variable for incoming edge, we can use the SMT solver together with binary search to find the maximum flow through the graph.

flow = 0
for i in range(5, -1, -1):
    S.add(x1 >= (2**i + flow))
    if str(S.check()) != 'unsat':
        flow += 2**i

S.add(x1 >= flow)
We could alternatively build up the constraint set from a starting matrix.

def dot(xs, ys):
    return sum([x*y for (x,y) in zip(xs, ys)])

x = [x1,x2,x3,x4,x5,x6,x7]

M = [
  [ 0,-1, 0, 0, 0, 0, 0],
  [ 0, 0,-1, 0, 0, 0, 0],
  [ 0, 0, 0,-1, 0, 0, 0],
  [ 0, 0, 0, 0,-1, 0, 0],
  [ 0, 0, 0, 0, 0,-1, 0],
  [ 0, 0, 0, 0, 0, 0,-1],

  [ 1, 0, 0, 0, 0, 0, 0],
  [ 0, 1, 0, 0, 0, 0, 0],
  [ 0, 0, 1, 0, 0, 0, 0],
  [ 0, 0, 0, 1, 0, 0, 0],
  [ 0, 0, 0, 0, 1, 0, 0],
  [ 0, 0, 0, 0, 0, 1, 0],
  [ 0, 0, 0, 0, 0, 0, 1],

  [-1, 1, 1, 0, 0, 0, 0],
  [ 0,-1, 0, 1, 1, 0, 0],
  [ 0, 0,-1,-1, 0, 1, 0],
  [ 0, 0, 0, 0,-1,-1, 1]

b = [-7,-8,-6,-3,-4,-5,
      0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0]

S = z3.Solver()

for i in range(len(M)):
    S.add(b[i] <= dot(M[i], x))

Suppose we want to use linear programming to solve the problem described in the example above.

from scipy.optimize import linprog

M = [
  [ 1,-1,-1, 0, 0, 0, 0],
  [ 0, 1, 0,-1,-1, 0, 0],
  [ 0, 0, 1, 1, 0,-1, 0],
  [ 0, 0, 0, 0, 1, 1,-1],
b = [0, 0, 0, 0]
c = [-1, 0, 0, 0, 0, 0, 0]
bounds = [(0, None), (0, 7), (0, 8), (0, 6), (0, 3), (0, 4), (0, 5)]

result = linprog(c, A_ub = M, b_ub = b, bounds=bounds, options={"disp": True})
Note that this technique produces a potentially different solution from the one in the example above (though both are optimal).
Suppose that you want to distribute two types of infrastructure improvements across the 20 neighborhoods of a city: road repairs and electric grid updates. Road repairs cost $100,000 per mile and electric grid updates cost $50,000 per mile. You have access to a data set I that contains tuples of the form (neighborhood, road-miles, grid-miles) describing the quantity of each type of infrastructure in each neighborhood.
You want to distribute the cost of improvements equitably such that every neighborhood receives exactly the same amount of funding, and every neighborhood is able to use all that funding it receives. If your overall budget is capped at $10,000,000, is this possible?
One approach is to fix some constant c and create two variables xi and yi and three constraints for every entry (i, ri, gi) in I:
xi ⋅ 100,000 + yi ⋅ 50,000 ≥ c
The above constraints ensure that at least c funding is used in the neighborhood, but also that the amount of infrastructure repairs does not exceed the potential needs of the neighborhood. Thus, the state space consists of 20 ⋅ 2 dimensions (i.e., the two vectors x and y having one component for every neighborhood).
Note that if we wanted to check whether it is possible to disburse all the available funding equitably, we could set c = 10,000,000/20. Alternatively, if we wanted to maximize c, we could try larger and larger values of c until it cannot be increased without causing the problem to become unsolvable. Alternatively, we could treat c as another variable and define the objective function that must be maximized as f(x, y, c) = c.

3.3. Graph and Spatial Problems as Constraint Satisfaction and Optimization Problems

We have already seen how we can define some basic algorithms that have natural spatial interpretations (basic clustering and shortest) using data transformations assembled from building blocks drawn from the relational and the MapReduce paradigms. These algorithms employed two different representations: a vector space (clustering) and a graph (shortest paths). In this section, we review how algorithms such as these can be viewed as solutions to constraint satisfaction and/or optimization problems.
Suppose we have a graph consisting of a set of nodes N = {n1, n2, ...} and a set of edges E of the form (n1, n2). The minimum spanning tree of this graph is the smallest subset TE such that in the subgraph consisting of the edges in T, every node can be reached from every other node (i.e., by traversing a path along the edges in T).
The problem of finding a minimum spanning tree can be broken down into a constraint satisfaction problem:
  • the set of system states is the collection of edge subsets (i.e., all 2|E| possible subsets);
  • the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another) and the requirement that the subset T must be a tree (or, equivalent, that it must have at most |N| − 1 edges).
This problem can also be broken down into an optimization problem that minimizes a metric:
  • the set of system states is the collection of edge subsets (i.e., all 2|E| possible subsets);
  • the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another);
  • the metric is a function that computes the size of T (i.e., the number of edges in the spanning subtree).

3.4. Decomposition Techniques

General-purpose techniques for solving constraint satisfaction and optimization problems can sometimes be inefficient (potentially having a running time that is exponential in the number of state space dimensions and/or the number of constraints). This can be especially costly when state space dimensions or constraints are derived from large data sets having hundreds, thousands, or hundreds of thousands of entries.
In such scenarios, it may be possible to apply general-purpose decomposition techniques for solving constraint satisfaction and optimization problems in a scalable way. Usually, these rely on certain assumptions about the properties of the constraints and/or the objective function.
Suppose we have a constraint satisfaction problem with an n-dimensional state space S containing system states of the form (x1, ...,xn) and a set of constraints C. If the set of constraints C can be separated into two disjoint subsets such that C = C1C2 where C1 and C2 share no variables, then the problem can be decomposed into two independent subproblems.
Suppose that a constraint satisfaction problem with state space S and constraint set C can be decomposed into two independent subproblems where C = C1C2. If s1 S is a solution to the subproblem corresponding to C1 and s2 S is a solution to the subproblem corresponding to C2, then we can combine s1 and s2 to obtain a state that satisfies all the constraints in C (using a linear-time scan through the representation of s1 and s2).
Suppose we have a constraint satisfaction problem with an (n+m)-dimensional state space S containing system states of the form (x1, ...,xn, y1, ..., ym) and a set of constraints C = CxCxyCy where:
  • all constraints in Cx only contain variables from the list x1, ...,xn;
  • all constraints in Cy only contain variables from the list y1, ...,ym;
  • every constraint in Cxy has at least one variable from x1, ...,xn and at least one from y1, ...,yn.
Then the constraints contained in Cxy are called the complicating constraints and the variables found in those constraints are called the complicating variables.
Suppose we have a constraint satisfaction problem with states of the form (x1, ...,xn, y1, ..., ym) {0,1}n+m and a constraint set C = CxCxyCy. Suppose also that x1 and y1 are the only complicating variables and x1y1 is the only complicating constraint. Then we can consider four different subsets of the state space:
  • all states in which x1 = 0 and y1 = 0;
  • all states in which x1 = 1 and y1 = 0;
  • all states in which x1 = 0 and y1 = 1;
  • all states in which x1 = 1 and y1 = 1.
Note that for each of the four scenarios, we can easily extend the two subproblems corresponding to Cx and Cy with one additional constraint (such as x1 = 1 for Cx and y1 = 0 for Cy). We can then eliminate the complicating constraint (which is redundant at that point), allowing us to decompose the new problem into two independent subproblems. If we find a solution to both subproblems in any of these four cases, then we have found a solution to the overall original problem corresponding to C.
This approach can lead to significant performance gains if the algorithmic complexity of solving the constraint satisfaction problem is worse than linear. For example, suppose that there are k constraints and solving the problem takes exponential time in the number of constraints (such as 2k). If we can split the problem into two independent subproblems (each of size k/2), then we have can compute the overall running time needed to solve all four pairs of independent subproblems:
4 ⋅ 2 ⋅ 2k/2
2k/2 + 3
The running time 2k/2 + 3 is substantially faster than 2k for large k. In fact, 2k/2 + 3 ≈ √(2k).
In addition, note that the third scenario where x1 = 0 and y1 = 1 does not satisfy the complicated constraint x1y1, so we need not consider it at all. Thus, the running time would actually be 3 ⋅ 2 ⋅ 2k/2 if we avoid this case.
Suppose we have an optimization problem with an (n+m)-dimensional state space S containing system states of the form (x1, ...,xn, y1, ..., ym), a set of constraints C = CxCy, and an objective function f(x1, ...,xn, y1, ..., ym) = g(x1, ...,xn) + h(y1, ...,ym). Note that there are no complicating constraints or variables.
In this scenario, we can decompose the problem into two independent optimization problems (with objective functions g and h, respectively). We can then find a solution s1 that optimizes g under constraints Cx and a solution s2 that optimizes h under constraints Cy. Finally, we can combine these two solutions s1 and s2 into a solution for the overall problem that optimizes f. As in the corresponding fact about constraint satisfaction problems, we can then use a linear-time scan through the representations of s1 and s2 to build this overall solution.

4. Statistical Analysis

Given a data set some of the optimization techniques we have studied can be used to find a best-fit model for that data. For example, given a data set of points in ℝ2 representing some collection of observations measured along two dimensions, it is possible to find an ordinary least squares "best fit" linear function (using algebra, optimization, gradient descent, and so on). However, just because we have found a "best fit" model for a data set from some given space of functions does not mean that the model is actually a good one for that particular data set. Perhaps the relationship between the two dimensions is not linear, or even completely non-existent. How can we determine if this is the case?
In this section, we will present some statistical methods that can help us decide whether a linear model is actually appropriate for a given two-dimensional data set, and to quantify how appropriate it might be. We will review the mathematical foundations underlying the methods, how they can be applied, and how they can be realized as data transformations in the paradigms we have studied in earlier sections.

4.1. Review of Facts about Projections from Linear Algebra

We review a few useful definitions and facts about vectors in ℝn from linear algebra. These provide a mathematical foundation for defining the statistical constructs discussed in this section.
A vector is an ordered, finite tuple of real numbers (x1, ..., xn) n. Note that we will use individual variables to denote an entire vector (e.g., x n), and that same variable with a subscript index (e.g., xi) to denote individual components of that vector; thus, we have that x = (x1, ..., xn).
For a vector x n and a scalar s ℝ, we can scale x by computing sx = (sx1, ..., sxn). Note that for a scalar s ℝ, we often use the shorthand notation x/s to mean (1/s) ⋅ x.
For a vector x n the Euclidean length or norm of x is ||x|| = √(x12 + ... + xn2). We can also obtain a normalized unit vector of length one that points in the same direction as x by computing x/||x||.
Given two vectors x n and y n, the dot product of the two vectors is xy = x1y1 + ... + xnyn.
We can compute the projection of a vector x n onto another vector y n by computing (x ⋅ (y/||y||)) ⋅ (y/||y||).
Given two vectors (5,7) 2 and (3,0) 2, the normalized version of (3,0) is (1/3) ⋅ (3,0) = (1,0). The projection of (5,7) onto (3,0) is then:
((5,7) ⋅ ((1/3) ⋅ (3,0))) ⋅ ((1/3) ⋅ (3,0)
((5,7) ⋅ (1,0)) ⋅ (1,0)
(5 ⋅ 1 + 7 ⋅ 0) ⋅ (1,0)
5 ⋅ (1,0)
Theorem (Cauchy-Schwarz inequality):
Given two vectors x n and y n, the following inequality holds:
|xy| / (||x|| ⋅ ||y||)
Another way of writing the above is:
|  (x/||x||) ⋅ (y/||y||)  |
The intuitive interpretation is as follows: if we try to find the projection of a unit vector x/||x|| onto another unit vector y/||y||, the resulting vector has length at most 1.

4.2. Defining Mean and Standard Deviation using Concepts in Linear Algebra

In this section, we will assume that we are working with a data set of the form [(x1, y1), ..., (xn, yn)] that contains n (not necessarily distinct) tuples and where for 1 ≤ in we have (xi, yi) 2. For the purposes of analysis, we will also often use two projections of this data set as vectors in their own right: (x1, ...,xn) n and (y1, ...,yn) n. We will also use the shorthand notation x = (x1, ...,xn) and y = (y1, ...,yn).
Definition (arithmetic mean):
We define the arithmetic mean μ(x) of a one-dimensional or one-column data set x n where x = (x1, ..., xn) as:
(1/n) ⋅ (x1 + ... + xn)
For x n, μ(x) is the quantity that minimizes the following objective function:
argmina i=1n (xi - a)2)
To prove the above, assume that the true mean is some specific value m. Next, we choose some arbitrary a in the formula and proceed algebraically from there by multiplying the terms under the summation and then using the distributive property to factor out the terms without an index subscript:
Σi=1n (xi - a)2
Σi=1n ((xim) + (ma))2
Σi=1n ( (xim)2 + 2 ⋅ ((xim) ⋅ (ma)) + (ma)2 )
Σi=1n (xim)2 + 2 ⋅ (ma) ⋅ (Σi=1n (xim)) + Σi=1n (ma)2
Σi=1n (xim)2 + 2 ⋅ (ma) ⋅ (Σi=1n xi − Σi=1n m)) + Σi=1n (ma)2
In the above, notice that Σi=1n xi = Σi=1n m because m is the true mean, so the second term is just 2 ⋅ (ma) ⋅ 0 = 0. Thus, we can eliminate it. Furthermore, we have Σi=1n (ma)2 = n ⋅ (ma)2. Applying all these substitutions, we have:
Σi=1n (xi - a)2
i=1n (xim)2 ) + n ⋅ (ma)2
Notice that setting a = m minimizes the above because the first term does not contain a at all, and the second term becomes 0 exactly when a = m. Thus, the a that minimizes the objective Σi=1n (xi - a)2 is exactly the mean.
For a given vector x n, suppose we use μ(x) to create a vector with n entries (μ(x), μ(x), ..., μ(x)) in which every entry is μ(x). We can call this vector xμ.
The vector xμ minimizes a particular objective function:
argminv is on the diagonal in ℝn ||x - v||
Another way of thinking about this is that xμ as defined above is the projection of the vector x onto the diagonal line in ℝn. This means that xμ is also the point on the diagonal in ℝn that is closest to x.
Notice that the geometric, vector-based interpretation of the mean preserves an important property of the mean of a data set x: rearranging the elements of x should not change the mean. In fact, this is guaranteed because the dimensions are interchangeable: for any permuted version of x (let us call it x'), the distance of x' from the diagonal will always be the same as the distance of x from the diagonal because we can simply relabel the dimensions to bring x' back to its original form x.
Suppose that our data consists of only two points in one dimension: x1 = 10 and x2 = 2. We can compute:
(1/2) ⋅ (10 + 2)
(6, 6)
Notice that (6, 6) is on the diagonal in ℝ2. Notice also that the vector xxμ = (10, 2) - (6, 6) = (4, -4) has a slope of  − 1 (it is orthogonal to the diagonal). This is consistent with xμ being a projection of x onto the diagonal.
Definition (standard deviation):
We define the standard deviation σ(x) of a one-dimensional or one-column data set x n where x = (x1, ..., xn) as:
√(   (1/n) ⋅ ((x1 − μ(x))2 + ... + (xn − μ(x))2)   )
For x n, σ(x) can be rewritten in terms of the norm (i.e., length) ||x − μ(x)|| of the vector x − μ(x):
(1/√(n)) ⋅ √(   ((x1 − μ(x))2 + ... + (xn − μ(x))2)   )
(1/√(n)) ⋅ ||xxμ||
In other words, we can view the standard deviation of a data vector x as a normalized distance between x and the diagonal in ℝn.
We might ask why the extra scalar term 1/√(n) is necessary in the above. After all, the term ||xxμ|| already measures the distance between the vector x and the diagonal. Furthermore, the individual xi terms must represent some units specific to the domain from which the data is drawn (e.g., temperature, height, velocity, and so on), while n and √(n) have no units relevant to the domain of the data (they are just quantities associated with how many data points were collected).
The reason is that when the number of components in the vector x changes, the number of dimensions in the space from which the vector is drawn also changes. But the number of dimensions influences the lengths of vectors (if the individual vector components are about the same): the more dimensions there are, the longer a vector can become because the distances from all dimensions contribute to the length of a vector.
Consider the lengths of the following vectors (each drawn from a vector space having its own distinct number of dimensions):
||(1, 1)||
||(1, 1, 1)||
||(1, 1, 1, 1)||
||(1, 1, 1, 1, 1)||
||(1, 1, 1, 1, 1, ..., 1)||
In general, we can see that the following holds for a vector of length n in which components are all the same value s:
||(s, s, s, ..., s)||
s ⋅ √(n)
The above example shows that if we want to compare the standard deviations of two distinct data sets that are of different sizes, we need to account for the fact that the vectors are drawn from different vector spaces (in terms of the number of dimensions). The scalar term 1/√(n) in the standard deviation formula accounts for this difference and makes comparisons possible.
One practical issue that can be a concern when defining statistical analysis workflows (e.g., using the relational or MapReduce paradigms) over large data sets is how optimizations based on algebraic laws may interact with the representation size of numerical values within the language, library, or framework being used.
Suppose that we have a large data set x 1000consisting of 1000 entries: x = (x1, ..., x1000). Assume that 995 of these entries are approximately 0.1 while the other five are all larger (i.e., let us assume they are all 2). Furthermore, suppose that we have a representation size for floating point numbers such that the smallest floating point number we can represent is 0.001 (anything else is cut off and thus rounded to 0).
If we want to compute the mean, we might decide to first scale every entry in the vector x by dividing it by n to obtain the vector (x1/n, ..., x1000/n). However, notice that this will lead to 0.1/1000 = 0.0001 as the true mathematical answer for most of the data. Due to rounding errors, 995 of the data points will be treated as if they were 0. Thus, we will have:
(2 + 2 + 2 + 2 + 2) / 1000
But notice that the true mean is quite different; the error we see above is an entire order of magnitude (and grows as the number of data points grows):
(2 + 2 + 2 + 2 + 2 + 0.1 ⋅ 995) / 1000
(10 + 99.5) / 1000
One intuitive response might be to avoid the optimization and perform the summation of all components of x before performing division by n. But this may again lead to an issue when there are too many data points, as the total may exceed the maximum representation size of a number!
Issues such as the above are less common in contemporary tool chains because many libraries use representations for numerical data structures that allow arbitrary precision and size. If such a library is not available, however, we must ensure that our optimizations do not introduce such errors. In the specific scenario above, the solution might be to split the data up into smaller chunks, perform the optimization over those small chunks, and then aggregate the results of those individual computations (though this would only work for associative operations such as summation).

4.3. Covariance and Correlation

Given a data set that has two-dimensional entries of the form (xi, yi), it is possible to compare the two dimensions by comparing their standard deviation vectors.
Definition (covariance):
For x n and y n, we define the covariance between x and y as:
cov(x, y)
(1/n) ⋅ ((x1 - μ(x)) ⋅ (y1 - μ(y)) + ... + (xn - μ(x)) ⋅ (yn - μ(y)))
(1/n) ⋅ (Σi=1n (xi - μ(x)) ⋅ (yi - μ(y)))
(1/n) ⋅ (xxμ) ⋅ (yyμ)
((1/√(n)) ⋅ (xxμ)) ⋅ ((1/√(n)) ⋅ (yyμ))
Notice that the last line in the above indicates that we can view the covariance as the dot product of the vectors (xxμ) and (yyμ) (each of which represent how x and y deviate from their respective means).
Definition (correlation coefficient):
For x n and y n, we define the correlation coefficient (also known as the Pearson product-moment correlation coefficient) between x and y as:
ρ(x, y)
cov(x, y) / (σ(x) ⋅ σ(y))
For x n and y n, it must be that -1 ≤ ρ(x, y) ≤ 1. We can see this by noticing that the formula for ρ(x, y) appears in the Cauchy-Schwarz inequality:
ρ(x, y)
cov(x, y) / (σ(x) ⋅ σ(y))
((1/n) ⋅ (xxμ) ⋅ (yyμ)) / ((1/√(n)) ⋅ ||xxμ|| ⋅ (1/√(n)) ⋅ ||yyμ||)
((1/n) ⋅ (xxμ) ⋅ (yyμ)) / ((1/n) ⋅ ||xxμ|| ⋅ ||yyμ||)
((xxμ) ⋅ (yyμ)) / (||xxμ|| ⋅ ||yyμ||)
Notice that the only difference between the above and the Cauchy-Schwarz inequality is that in the above, the numerator is not guaranteed to be positive. Thus, we have:
cov(x, y) / (σ(x) ⋅ σ(y))
In addition to being bounded by −1 and 1, the correlation coefficient has some other useful properties. If x and y have a positive linear correlation, then ρ(x, y) is positive. If they have a negative linear correlation, it is negative. Given two pairs of data sets x, y and x', y', if |ρ(x, y)| > |ρ(x', y')|, then x and y have a stronger correlation with each other than do x' and y'.
While the correlation coefficient is useful for comparing correlations, the quantity itself should not be used to infer anything about the extent of the correlation. To do that with some rigor, it is necessary to introduce a few additional concepts.
Suppose that our data consists of only a pair of two-dimensional tuples: x = (2, 6) and y = (20, 60). In other words, the data set is [(2,20), (6,60)]. Notice that the slopes of the orthogonal projections of x and of y onto the diagonal are the same, and if we normalize these projections, they become the same vector. That means (xxμ) / ||xxμ|| = (yyμ) / ||yyμ||. But the dot product of a vector with itself is the square of its length: vv = ||v||2. Thus, we have:
cov(x, y) / (σ(x) ⋅ σ(y))
(xxμ) ⋅ (yyμ) / (||xxμ|| ⋅ ||yyμ||)
Thus, because the two dimensions have a positive linear correlation in this example, the correlation coefficient is 1.

4.4. Observations, Hypothesis Testing, and Significance

The correlation coefficient gives us a way to quantify that there may exist a correlation between two dimensions in a data set. However, this correlation may be an artifact of the way we happened to obtain the data. For example, we may have obtained a biased sample. We can account for this by modeling the likelihood of obtaining a biased sample given the assumption that no correlation exists. If it is very unlikely that we would obtain a sample with the observed correlation if no correlation actually exists, then we might conclude that it is likely that a correlation does exist.
To take this approach, we need to compare the correlation we observe against something (i.e., to decide how likely or unlikely it is if there is no correlation). This means we need to define mathematically what it means for there to be no correlation.
Definition (distribution):
A distribution over a set S is a function f: S ℝ. If S represents a set of possible observations of a system (e.g., a collection of system states), then a distribution f might represent the probability that the system is in that state, or the frequency with which that system state has been or can be observed.
Definition (p-value):
Let a subset of the real number line S ⊂ ℝ represent all possible system states that can be observed, and let f be a distribution over S that represents the probability of observing each of the possible states in S. Given some state s S, the probability of observing s is then f(s). We define the p-value at the observed state s to be the probability of observing any state that is equal to or more extreme than the state s.
One way to define which states are more extreme is to consider all the states from s to the nearest endpoint. Suppose S is an interval [a, b]. If s > (ab) / 2, then the p-value would be computed by taking the definite integral of f from s to b. Otherwise, it would be computed by taking the definite integral of f from s to a.
In order to use the p-value concept to reason about correlation coefficients, it is necessary to model how the correlation coefficient should behave if no correlation exists. One way to do this using the data alone is to consider every permutation of the data.
Assume we have a two-dimensional data set [(x1, y1), ..., (xn, yn)], also represented by x n and y n. One way to obtain a range of possible correlation coefficients is to compute the correlation coefficient for all n! possible permutations of the second dimension. Intuitively, if the data has no correlation, then performing this permutation should not change the correlation coefficient much. On the other hand, if the data does have a correlation, we should get a distribution of correlation coefficients with respect to which the true correlation coefficient is "extreme".

from random import shuffle
from math import sqrt

data = [(18, 28), (24, 18), (27, 31), (14, 15), (46, 23),
        (36, 19), (27, 10), (34, 25), (19, 15), (13, 13),
        (4, 2), (17, 20), (28, 12), (36, 11), (26, 14),
        (19, 19), (24, 13), (25, 6), (20, 8), (17, 22),
        (18, 8), (25, 12), (28, 27), (31, 28), (35, 22),
        (17, 8), (19, 19), (23, 23), (22, 11)]
x = [xi for (xi, yi) in data]
y = [yi for (xi, yi) in data]

def permute(x):
    shuffled = [xi for xi in x]
    return shuffled

def avg(x): # Average
    return sum(x)/len(x)

def stddev(x): # Standard deviation.
    m = avg(x)
    return sqrt(sum([(xi-m)**2 for xi in x])/len(x))

def cov(x, y): # Covariance.
    return sum([(xi-avg(x))*(yi-avg(y)) for (xi,yi) in zip(x,y)])/len(x)

def corr(x, y): # Correlation coefficient.
    if stddev(x)*stddev(y) != 0:
        return cov(x, y)/(stddev(x)*stddev(y))

def p(x, y):
    c0 = corr(x, y)
    corrs = []
    for k in range(0, 2000):
        y_permuted = permute(y)
        corrs.append(corr(x, y_permuted))
    return len([c for c in corrs if abs(c) >= abs(c0)])/len(corrs)
In the above, we compute the correlation coefficient for a large number of permuted data sets. We then compute the fraction of these data sets that have a correlation coefficient with a higher absolute value than the correlation coefficient of the original data. We can call this fraction the p-value. Ideally, we would try all possible permutations; since this is infeasible, we can use this technique to get an approximation.
Alternatively, we can use a library to compute the same information. The scipy.stats.pearsonr() function will return the correlation coefficient and the p-value.

import scipy.stats
print(scipy.stats.pearsonr(x, y))
Another approach is to mathematically derive a distribution of correlation coefficients based on some assumptions about how the observation vectors x and y are distributed if they are independent of one another (i.e., not correlated) and each follow some distribution (e.g., the normal distribution). Then, computing a definite integral of this distribution from some specific observation to the nearest boundary would yield the p-value for that observation. Taking this approach usually requires parameterizing the distribution using the size of the data set (i.e., the number of samples) because this can influence how the distribution of correlation coefficients should be computed.
Suppose that you have a data set of n observations in which each tuple is of the form (wi, xi, yi, zi), where w = (w1, ..., wn), x = (x1, ..., xn), y = (y1, ..., yn) and z = (z1, ..., zn). You compute some covariance and correlation coefficient quantities and find the following relationships:
cov(x, y)
cov(w, z)
ρ(y, z)
ρ(w, x)
ρ(w, x)
  • Can you conclude that the correlation between x and y is stronger than the correlation between w and z?
  • Can you conclude that the correlation between y and z is stronger than the correlation between w and x?
  • Suppose that we change the last inequality to ρ(y, z) < 0. Does this change your answers to any of the above?
Suppose you have a data set D consisting of tuples of the form (xi, yi) and you want to define the necessary transformations using the MapReduce paradigm to compute the covariance for this data set. Provide the definitions of the map and reduce operations necessary to accomplish this.
Suppose you have a data set D consisting of tuples of the form (xi, yi) and you want to compute the correlation coefficient for this data set with a transformation sequence that uses building blocks from the relational paradigm. The flow diagram for this sequence of transformations is provided below.
!table([ [['dr:prod`r:proj``(xi,yi)'], ['r:agg plus``(xi,yi,1)'], ['d:proj``(sx,sy,n)']], [['d:agg plus``(((xi-μ(x))^2)/n, ((y-μ(y))^2)/n)'], ['rd:proj`l:proj``(xi,yi,μ(x), μ(y), n)'], ['l:prod``(μ(x) = sx/n, μ(y) = sy/n, n)']], [['d:proj``(σ(x)^2, σ(y)^2)'], ['d:prod``(cov(x, y)'], ['l:agg plus``(1/n)*(x-μ(x))*(y-μ(y))']], [['r:prod``(σ(x), σ(y))'], ['r:proj``(cov(x,y), σ(x), σ(y))'], ['(ρ(x, y))']] ])

4.5. Sampling and Inference

Often, data sets are partial observations of the overall system. Despite this, we may still be able to infer things about the system's properties from those partial observations. In this subsection we go over just a few of the many methods available for making inferences based on data obtained via sampling.
Definition (sampling):
Given some data set S (which may not be known in its entirely and may only exist in theory, e.g., the data set of all humans, or the data set of all Earth-size planets in the galaxy), we call any subset TS a sample of S.
Typically, the data set being sampled will contain records or tuples, each corresponding to an item or an individual. Each tuple provides some information about the properties or measurements of that item or individual along certain dimensions (e.g., size, location, and so on).
Definition (probability sampling):
A probability sample of a data set S is any sample TS that is collected under the condition that for every tuple, the probabiility of selecting that tuple is known in advance.
Fact (proportional reasoning between independent variables):
Assume a data set S has information that can be used to derive two relevant binary characteristics describing the individuals or items in that data set (e.g., whether an individual is male or female, whether an individual resides in a particular neighborhood, and so on). We can define two (possibly overlapping) subsets AS and BS corresponding to those subsets that have each of those characteristics.
If the two characteristics are independent we can conclude that:
In other words, if we treat the tuples from B as a sample of S and we find the fraction of those tuples that are also in A, we can conclude that the same would have occurred if we had sampled the entirety of S.
Assume we know that within a certain geographical region, one out of every 10 people is a member of an online social networking mobile application that collects the geographical locations of its users. Suppose we take a sample of how many unique users were present in a particular neighborhood within that region over the course of a week, and we get a total of 1200 users. Assuming that there is no correlation between that particular neighborhood and the popularity of that mobile application (i.e., residency in that neighborhood and membership in the social network are independent variables), we can then infer the total population in that neighborhood by solving the following equation:
The above yields n = 12,000, so we can conclude that there are about 12,000 residents in that particular neighborhood based on our observations.
In the above example, we could have also worked in the other direction if we only knew the overall population in the region, the population of the neighborhood, and the popularity of the application within the population in that neighborhood. This would allow us to derive the total number of users of the application in the region.
Fact (capture-recapture):
Suppose that we can use probability sampling of a data set of size N, we know that the probability of selecting any individual tuple is equivalent to 1/N, and we do not know the size N of the data set. In addition, suppose we are allowed to "mark" any tuple that appears in a sample we take so that we can identify it (with complete certainty) as a previously marked tuple if we see it in subsequent samples.
Under these conditions, we can estimate the size of the entire data set (an unknown quantity) in the following way. First, we take a sample of size n of the data set, and we mark every tuple in that sample. Next, we take another (independent) sample of size K and count the number of marked tuples in that sample (suppose it is kK). We can then solve the following equation for N:
(nK) / k
Suppose that we want to know how many distinct people regularly visit a particular grocery store, and we can identify visitors in some way (e.g., by recording their unique identifiers). We can fix a particular time period and collect the identifiers of all the people who visit that store (suppose we collect n of them). We can then fix another time period, again record the identifiers of all K visitors during that time period, and then determine what fraction k/K of the identifiers from the first collection appears in the second collection. We could then estimate that the total number of individuals who visit the store regularly is around (nK) / k.
You are obtaining a probability sample of a population such that the probability of selecting every individual is the same. You take a sample of 1000 individuals and find that 200 of them are members of a particular social network. Answer the following questions.
  • Assuming that membership in the social network is completely independent of age, and that there are 20,000 individuals who are 18-24 years old in the overall population, provide one possible estimate k for the number of individuals in the 18-24 age range who are members of the social network.
  • Suppose you determine that the social network has 10,000 unique members. What is the size n of the population from which your sample was taken?
Fact (Algorithm R and reservoir sampling):
Using Algorithm R, it is possible to take a probability sample of size k from an incoming sequence of data set entries (e.g., tuples) without knowing the length of the sequence in advance. Furthermore, it is guaranteed that every item in the sequence had an equal probability of being added to the sample.
Assume we have enough storage in our database to store n + k tuples, and it currently holds a data set X that has n tuples. However, we expect that tomorrow, we will obtain a new data set Y that also has n tuples, so we cannot store it unless we delete X completely. We need to take a probability sample of size k from the collection XY such that any item has an equal likelihood of being added to the sample. Is this possible? Explain your answer. You may assume that every tuple in X and Y has a unique identifier and there is no overlap between X and Y.
You were using Algorithm R to perform reservoir sampling of packets as they were arriving, with a sample size of 100. You were also recording the number of packets there were corrupted, but you did not record the total number of packets that arrived.
Later on, someone asks you to compute the total number of packets that had arrived during the sampling period. You look at your sample and you find that out of 100 packets, 7 were corrupted. You also recorded that a total of 1400 packets had been corrupted during the sample collection period. How many packets arrived during the sample collection period?
Suppose there are exactly n entries in a data set of all neighborhoods and that these entries can be divided into two dimensions: x = (x1, ..., xn) and y = (y1, ..., yn). Furthermore, for all 1 ≤ in, we have that xi {0,1} and yi {0,1}, where each observation of the form (xi, yi) specifies whether the ith neighborhood has a hospital (the xi component) and whether it has a school (the yi component).
We compute the correlation coefficient and determine that ρ(x, y) = 0. If there are 12 neighborhoods with a school, there are 6 neighborhoods that have both a school and a hospital, and there are 8 hospitals in total, what can these facts reasonably imply about the total number of neighborhoods n?

5. Visualizations and Web Services

5.1. Web Services

In this section, we present examples of how a few existing tools can be used to implement and test a basic web service. More generally, there are innumerable frameworks for building server-side and client-side application components.
The following presents one possible web service implementation using the Flask framework.

import jsonschema
from flask import Flask, jsonify, abort, make_response, request
from flask.ext.httpauth import HTTPBasicAuth

app = Flask(__name__)
auth = HTTPBasicAuth()

users = [
  { 'id': 1, 'username': u'alice' },
  { 'id': 2, 'username': u'bob' }

schema = {
  "type": "object", 
  "properties": {"username" : {"type": "string"}},
  "required": ["username"]

@app.route('/client', methods=['OPTIONS'])
def show_api():
    return jsonify(schema)

@app.route('/client', methods=['GET'])
def show_client():
    return open('client.html','r').read()

@app.route('/app/api/v0.1/users', methods=['GET'])
def get_users(): # Server-side reusable name for function.
    print("I'm responding.")
    return jsonify({'users': users})

@app.route('/app/api/v0.1/users/', methods=['GET'])
def get_user(user_id):
    user = [user for user in users if user['id'] == user_id]
    if len(user) == 0:
    return jsonify({'user': user[0]})

def not_found(error):
    return make_response(jsonify({'error': 'Not found foo.'}), 404)

@app.route('/app/api/v0.1/users', methods=['POST'])
def create_user():
    if not request.json:
        print('Request not valid JSON.')

        jsonschema.validate(request.json, schema)
        user = { 'id': users[-1]['id'] + 1, 'username': request.json['username'] }
        return jsonify({'user': user}), 201
        print('Request does not follow schema.')

def foo(username):
    if username == 'alice':
        return 'ecila'
    return None

def unauthorized():
    return make_response(jsonify({'error': 'Unauthorized access.'}), 401)

if __name__ == '__main__':
The HTML page below can be used to test the POST request handling capabilities of the application.

  function postSomething() {
    var http = new XMLHttpRequest();
    var url = "http://localhost:5000/app/api/v0.1/users";
    var json = JSON.stringify({'name':'carl'});"POST", url, true);
    http.setRequestHeader("Content-type", "application/json");
    http.onreadystatechange = function() {
      if (http.readyState == 4 && http.status == 200) {
<button onclick="postSomething();">Post new user</button>

[link]  References

[1] "World's population increasingly urban with more than half living in urban areas".
[2] Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. West. "Growth, innovation, scaling, and the pace of life in cities". Proceedings of the National Academy of Sciences of the United States of America 2007;104(17):7301-7306.
[3] Robert Albright, and Alan Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker White. "SGL: A Scalable Language for Data-driven Games". Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data 2008.
[4] Walker White, Benjamin Sowell, Johannes Gehrke, and Alan Demers. "Declarative Processing for Computer Games". Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games 2008.
[5] W3C Working Group. "PROV Model Primer".
[6] Robert Ikeda and Jennifer Widom. "Data Lineage: A Survey".
[7] Y. Cui and J. Widom. "Lineage Tracing for General Data Warehouse Transformations". The VLDB Journal 2003;12(1):41-58.
[8] Gerd Gigerenzer. "Mindless statistics". The Journal of Socio-Economics 2004;33(5):587-606.
[9] Nihar B. Shah and Dengyong Zhou. "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing". CoRR 2014.

Appendix A. Other Resources

A.1. MongoDB and Related Resources

In this course, staff will use MongoDB to store and work with large data sets. This section lists some resources that you may find useful.
Some references, tutorials, and other materials of interest mentioned during lecture include the following.

A.2. Installation Resources for Other Software Packages and Libraries

This subsection contains links to other useful resources that may help with installing packages and libraries we use in this course.
Some resources that may be useful when working with the prov package:
  • when installing the prov library for Python on Windows, you may have some trouble installing the lxml package. Try obtaining the precompiled package here, instead, and installing it using pip install *.whl;
  • a short tutorial is available here.
As we use the JSON format for storing authentication, configuration, and provenance information, the JSON Schema is useful and there is a corresponding jsonschema Python package that you will need to install.
Two useful kinds of tools for solving problems that can be represented as systems of equations and inequalities are SMT solvers and linear optimization packages.
  • The SciPy library collection includes the optimize.linprog module for solving linear optimization problems, and the library is relatively straightforward to install:
    • using Windows, it may be easiest to obtain and install precompiled binaries for numpy and scipy (in that order);
    • using Mac OS X or Linux, running pip install numpy and pip install scipyshould be sufficient (you may need to download and install a Fortran compiler first under some configurations).
  • The Z3 SMT solver is now open source and very easy to use with Python; precompiled binaries are also available.