Data Mechanicsfor Pervasive Systems and Urban Applications

NOTE: This page contains all the examples presented during the lectures, as well as all the assigned projects. Click here to go back to the main page with the course information and schedule.

1. Introduction, Background, and Motivation
2. Modeling Data and Data Transformations
3. Systems, Models, and Algorithms
4. Statistical Analysis
5. Visualizations and Web Services
- 5.1. Visualizations
- 5.2. Web Services
- 5.3. Project #3: Visualizations, Web Services, and Complete Project
References
Appendix A. Using GitHub
- A.1. Retrieving and Submitting Assigned Projects on GitHub
Appendix B. Other Resources
- B.1. MongoDB and Related Resources
- B.2. Installation Resources for Other Software Packages and Libraries

[link]
1. Introduction, Background, and Motivation

[link]
1.1. Overview

With over half of the world's population living in cities [#], and given the possibility that cities are a more efficient [#] way to organize and distribute resources and activities, the rigorous study of cities as systems that can be modeled mathematically is a compelling proposition. With the advent of pervasive infrastructures of sensors and computational resources in urban environments (smart cities, software-defined cities, and the Internet of Things), there is a potential to inform and validate such models using actual data, and to exploit them using the interoperability of the pervasive computational infrastructure that is available.

In this course, we introduce and define the novel term data mechanics to refer specifically to the study of a particular aspect of large, instrumented systems such as cities: how data can flow through institutions and computational infrastructures, how these flows can interact, integrate, and cleave, and how they can inform decisions and operations (in both online and offline regimes). We choose the term mechanics specifically because it connotes the study of mechanics (e.g., classical mechanics) in physics, where universal laws that govern the behavior and interactions between many forms of matter are studied. We also choose this term to emphasize that, often, the data involved will be tied to the physical environment: geospatial and temporal information is a common and integral part of data sets produced and consumed by instrumented environments.

[link]
1.2. Data Mechanics Repository and Platform

This course is somewhat unusual in that there are specific, concrete software development goals towards which the staff and students are working. In particular, one goal of the course is to construct a new general-purpose service platform for collecting, integrating, analyzing, and employing (both in real time and offline) real data sets derived from urban sensors and services (particularly those in the city of Boston). We will consider several real data sets and data feeds, including:

We will also secure additional data sets from the above sources as well as a few others. In particular, some nationwide data repositories may be worth considering:

500 Cities: Local Data for Better Health.

For the purposes of student projects, there are no limits on other sources of data or computation (e.g., Twitter, Mechanical Turk, and so on) that can be employed in conjunction with some of the above.

Because this course involves the construction of a relatively large software application infrastructure, we will have the opportunity to introduce and practice a variety of standard software development and software engineering concepts and techniques. This includes source control, collaboration, documentation, modularity and encapsulation, testing and validation, inversion of control, and others. Students will also need to become familiar with how to use web service APIs used by government organizations (e.g., Socrata) to make queries and retrieve data.

The overall architecture of the service platform will have at least the following:

a database/storage backend ("repository") that houses:
- original and derived data sets, with annotations that include:
  - from where, when, and by what algorithm it was retrieved
  - using what integration algorithms it was derived
- algorithms for data retrieval or integration, with references that include:
  - when it was written and by whom
  - in what data sets it is derived
  - from what component algorithms it is composed (if it is such)
a web service ("platform") with an application program interface (API) for running analysis and optimization algorithms:
- a defined language for defining analysis and optimization algorithms over the data stored in the repository
- an interface for submitting and running algorithms
other features for data and result visualization, simulation using partial data, etc.

[link]
1.3. Mathematical Modeling, Analysis Algorithms, and Optimization Techniques

There are a variety of problems that it may be possible to address using the data in the repository and the capabilities of the platform. This course will cover a variety of useful online and offline optimization topics in a mathematically rigorous but potentially application-specific way, including:

a defined language for defining analysis and optimization algorithms over the data stored in the repository,
dual decomposition,
online optimization.

The goal is to apply some of these techniques to the data sets(including integrated or derived data sets) and solve practical problems. Some of the problems raised in discussions with the City of Boston DoIT and MassDOT teams are:

characterizing intersections and coming up with a metric that incorporates:
- intersection throughput (people per unit time),
- modes of transportation (public transport, biking, walking),
- intersection safety (vehicle speed, accidents, and so on),
- intersection organization (no left turns, and so on);
characterizing streets and deriving metrics for:
- number of parking spaces,
- probability of an accident (e.g., using different modes of transportation),
- senior and handicapped mobility;
characterizing neighborhoods:
- economic condition and gentrification,
- senior and handicapped accessibility;
how to allocate resources to optimize some of the metrics above:
- where to perform repairs,
- how to improve housing affordability,
- where to place bike racks of ride sharing stations,
- senior and handicapped accessibility;
answering immediate questions relevant to an individual (e.g., in an app):
- is a building accessible,
- is there a place to park (or will there likely be at some other time),
- is a neighborhood safe (or is it becoming less or more safe),
- is a neighborhood affordable (or is it becoming less or more affordable),
- are healthcare services nearby.
integrating public transportation data with other data (e.g., Waze):
- why are buses on a certain route late,
- performance and problems during unexpected events (e.g., snow).

The above list is far from complete, and we will update it as the course progresses. Students are encouraged to discuss project ideas with faculty and one another.

[link]
1.4. Project #0: Preparation and Trial Submission

The purpose of this project is to obtain the necessary credentials for subsequent projects and for everyone (including course staff) to become familiar with the submission process and address any issues.

Choose an initial team consisting of one or two members. If Alice and Bob form a team and have the BU login names alice@bu.edu and bob@bu.edu, the unique identifier for their team will be alice_bob, where, in cases with two members, the two usernames are in ascending alphabetical order and separated by an underscore. For a one-member team, it is simply the BU login name of the sole member (e.g., alice).
Depending on which data you think you might use, sign up for one of the data services that require an account (e.g., the MBTA Developer Portal). We may also make other data sets from these organizations and others over the course of the semester.
Create a GitHub account and follow the GitHub submission instructions by forking the public repository at https://github.com/Data-Mechanics/course-2018-spr-proj-zero, adding a single folder named using the group identifier (alice or alice_bob) that contains a single ASCII text file members.txt. Each team member should commit a line to that text file specifying a mapping from their GitHub username and their BU username. For example, if Alice and Bob have the GitHub usernames alicegh and bobgh, respectively, then the file should look as follows:
```
alicegh:alice
bobgh:bob
        
```
One of the team members should then make a pull request as specified in the instructions to submit the project. On subsequent projects, you will be forking a private repository once we add your user account to the Data-Mechanics organization (which we will retrieve from the members.txt file you add above).
Download and install Python 3 and MongoDB on your system if you do not have them already, then install the Data Mechanics Library (dml) and the PROV (prov) modules for Python.

[link]
2. Modeling Data and Data Transformations

To study rigorously the ways data can behave and interact within an infrastructure that generates, transforms, and consumes data (e.g., to make decisions), it is necessary to define formally what data and data transformations are. One traditional, widely used, and widely accepted model for data is the relational model: any data set is a relation (i.e., a subset of a product of sets), and transformations on data are functions between relations. While the relational model is sufficient to define any transformation on data sets, the MapReduce paradigm is one modern framework for defining transformations between data sets.

In modern contexts and paradigms, these models can be useful when studying relatively small collections of individual, curated data sets that do not change dramatically in the short term. However, these alone are not sufficient in a context in which data sets are overwhelmingly multitudinous, varying in structure, and continuously being generated, integrated, and transformed. One complementary discipline that can provide useful tools for dealing with numerous interdependent data sets is that of data provenance or data lineage. The provenance of a data set (or subset) is a formal record of its origin, which can include how the data was generated, from what data or other sources it was created or derived, what process was used or was responsible for creating or deriving it, and other such information. This information can be invaluable for a variety of reasons beyond the obvious ones (i.e., the origin of the data), such as:

the same data be generated again or reproduced if an error occurs,
a data set can be updated from the original source if the source has been updated or changed,
the source of an inconsistency or aberration of data can be investigated,
any of the above could be applied to a subset because recomputing or investigating the entire data set would be prohibitively costly.

[link]
2.1. Relational Data and the MapReduce Paradigm

The relational model for data can be expressed in a variety of ways: a data set is a relation on sets, a logical predicate governing terms, a collection of tuples or records with fields and values, a table of rows with labelled columns, and so on. Mathematically, they are all equivalent. In this course, we will adopt a particular model because it is well-suited for the tools and paradigms we will employ, and because it allows for one fairly clean mathematical integration of the study of relational data and data provenance.

[link]

Definition:

A data set (also known as a store or database) is a multiset R: a collection (possibly with duplicates) of tuples of the form (x₁,...,x_n) taken from the set product X₁ × ... × X_n. Typically, some distinguished set (e.g., the left-most in the set product) will be a set of keys, so that every tuple contains a key. Whether a set is a key or not often depends on the particular paradigm and context, however.

[link]

Definition:

A data transformation T: A → B is a mapping from one space of data sets A to another space of data sets B. Notice that an individual data set S (i.e., a relation or, equivalently, a set of tuples) is just an element of A.

Some of the typical building blocks for data transformations in the relational model are:

union and difference (intersection can also be defined in terms of these),
projection (sometimes generalized into extended projection),
selection (filtering),
renaming,
Cartesian product,
variants of join operations (many can be constructed using the above, but other variants have been added as extensions),
aggregation (an extension).

One common operation on relations that is not possible to express in traditional formulations but found in some relational database systems is the transitive closure of a data set. Normally, this requires an iterative process consisting of a sequence of join operations.

[link]

Example:

We can model and illustrate the transformations that constitute the MapReduce paradigm using Python. Note that selection and projection can be implemented directly using Python comprehensions, but we define wrappers below for purposes of illustration.


def union(R, S):
    return R + S

def difference(R, S):
    return [t for t in R if t not in S]

def intersect(R, S):
    return [t for t in R if t in S]

def project(R, p):
    return [p(t) for t in R]

def select(R, s):
    return [t for t in R if s(t)]
 
def product(R, S):
    return [(t,u) for t in R for u in S]

def aggregate(R, f):
    keys = {r[0] for r in R}
    return [(key, f([v for (k,v) in R if k == key])) for key in keys]

[link]

Example:

We consider a few simple examples that illustrate how transformations can be constructed within the relational model. We start by showing how select can be used with a predicate to filter a data set.


>>> def red(t): return t == 'tomato'
>>> select(['banana', 'tomato'], red)
['tomato']

Suppose we have two data sets and want to join them on a common field. The below sequence illustrates how that can be accomplished by building up the necessary expression out of simple parts.


>>> X = [('Alice', 22), ('Bob', 19)]
>>> Y = [('Alice', 'F'), ('Bob', 'M')]

>>> product(X,Y)
[(('Alice', 'F'), ('Alice', 22)), (('Alice', 'F'), ('Bob', 19)), (('Bob', 'M'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> select(product(X,Y), lambda t: t[0][0] == t[1][0])
[(('Alice', 'F'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> project(select(product(X,Y), lambda t: t[0][0] == t[1][0]), lambda t: (t[0][0], t[0][1], t[1][1]))
[('Alice', 'F', 22), ('Bob', 'M', 19)]

Finally, the sequence below illustrates how we can compute an aggregate value for each unique key in a data set (such as computing the total age by gender).


>>> X = [('Alice', 'F', 22), ('Bob', 'M', 19), ('Carl', 'M', 25), ('Eve', 'F', 27)]

>>> project(X, lambda t: (t[1], t[2]))
[('F', 22), ('M', 19), ('M', 25), ('F', 27)]

>>> aggregate(project(X, lambda t: (t[1], t[2])), sum)
[('F', 49), ('M', 44)]

The following sequence explains what is happening inside the aggregate function for a particular key.


>>> Y = project(X, lambda t: (t[1], t[2]))
>>> keys = {t[0] for t in Y}
>>> keys
{'F', 'M'}
>>> [v for (k,v) in Y if k == 'F']
[22, 27]
>>> sum([v for (k,v) in Y if k == 'F'])
49
>>> ('F', sum([v for (k,v) in Y if k == 'F']))
('F', 49)

[link]

Example:

Consider the following line of Python code that operates on a data set D containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:


R = sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])

We can identify which building blocks for transformations available in the relational model are being used in the above code; we can also draw a corresponding flow diagram that shows how they fit together.

!table([ ['r:select``D', 'r:project``...', 'r:agg sum``...', 'R'] ])

We can also rewrite the above using the building blocks for transformations in the relational model.


R = aggregate(project(select(D, lambda psc: if psc[1] == "Massachusetts" and psc[2] == "Trump"), lambda psc: 1), sum)

Notice that the order of the operations is important. If we tried to do the projection before the selection, the information that we are using to perform the selection would be gone after the projection. On the other hand, two selection (one immediately followed by another) can be done in any order. This is one reason why it makes sense to call the paradigm a relational algebra: there are algebraic laws that govern how the operations can be arranged.

In the MapReduce paradigm, a smaller set of building blocks inspired by the functional programming paradigm (supported by languages such as ML and Haskell) exist for defining transformations between data sets. Beyond adapting (with some modification) the map and reduce (a.k.a., "fold") functions from functional programming, the contribution of the MapReduce paradigm is the improvement in the performance of these operations on very large distributed data sets. Because of the elegance of the small set of building blocks, and because of the scalability advantages under appropriate circumstances, it is worth studying the paradigm's two building blocks for data transformations: map and reduce:

a map operation will apply some user-specified computation to every tuple in the data set, producing one or more new tuples,
a reduce operation will apply some user-specified aggregation computation to every set of tuples having the same key, producing a single result.

Notice that there is little restriction on the user-specified code other than the requirement that it be stateless in the sense that communication and coordination between parallel executions of the code is impossible. It is possible to express all the building blocks of the relational model using the building blocks of the MapReduce paradigm.

[link]

Example:

We can model and illustrate the two basic transformations that constitute the MapReduce paradigm in a concise way using Python.


def map(f, R):
    return [t for (k,v) in R for t in f(k,v)]
    
def reduce(f, R):
    keys = {k for (k,v) in R}
    return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]

A map operation applies some function f to every key-value tuple and produces zero or more new key-value tuples. A reduce operation collects all values under the same key and performs an aggregate operation f on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.

At the language design level, the relational model and the MapReduce paradigm are arguably complementary and simply represent different trade-offs; they can be used in conjunction on data sets represented as relations. Likewise, implementations of the two represent different performance trade-offs. Still, contexts in which they are used can also differ due to historical reasons or due to conventions and community standards.

[link]

Example:

We can use the Python implementations of the map and reduce operations in an example above to implement some common transformations in the relational model. For this example, we assume that the first field in each tuple in the data set is a unique key (in general, we assume there is a unique key field if we are working with the MapReduce paradigm). We illustrate how projection and aggregation can be implemented in the code below.


R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]

# Projection keeps only gender and age.
# The original key (the name) is discarded.
X = map(lambda k,v: [(v[0], v[1])], R)

# Aggregation by the new key (i.e., gender).
Y = reduce(lambda k,vs: (k, sum(vs)), X)

Selection is not straightforward to implement because this is not a common use case for MapReduce workflows. However, it is possible by treating each tuple in the input data set as its own key and then keeping on the keys at the end.


R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]

X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).

We can also perform a simple join operation, although we also need to "combine" the two collections of data. This particular join operation is also simple because each tuple in the input is only joined with one other tuple. Join operations were also not originally envisioned as a common use case for MapReduce workflows.


R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]

X =   map(lambda k,v: [(k, ('Age', v))], R)\
    + map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
        lambda k,vs:\
          (k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
        X\
      )

[link]

Example:

Suppose you have a data set containing tuples of the form (name, gender, age). You want to produce a result of the form (gender, total) where total is the sum of the age values of all tuples with the corresponding gender. The code below illustrates using Python how this can be done in the MapReduce paradigm.


INPUT = [('Alice', ('F', 19)),\
         ('Bob', ('M', 23)),\
         ('Carl', ('M', 20)),\
         ('Eve', ('F', 27))]
TEMP = map(lambda k,v: [(v[0], v[1])], INPUT)
OUTPUT = reduce(lambda k,vs: (k, sum(vs)), TEMP)

We provide an equivalent MapReduce paradigm implementation of the algorithm using MongoDB.


db.INPUT.insert({_id:"Alice", gender:"F", age:19});
db.INPUT.insert({_id:"Bob", gender:"M", age:23});
db.INPUT.insert({_id:"Carl", gender:"M", age:20});
db.INPUT.insert({_id:"Eve", gender:"F", age:27});

db.INPUT.mapReduce(
  function() {
    emit(this.gender, {age:this.age});
  },
  function(k, vs) {
    var total = 0;
    for (var i = 0; i < vs.length; i++)
        total += vs[i].age;
    return {age:total};
  },
  {out: "OUTPUT"}
);

[link]

Example:

Suppose we have a data set containing tuples of the form (name, city, age). We want to produce a result of the form (city, range) where range is defined as the difference between the oldest and youngest person in each city.

We can construct a sequence of transformations that will yield this result using the basic building blocks available in the relational model (projections, selections, aggregations, and so on).
Alternatively, we can use the MapReduce paradigm to construct a single-pass (one map operation and one reduce operation) MapReduce computation that computes the desired result. Hint: emit two copies of each entry (one for the computation of the maximum and one for the computation of the minimum).

One approach in the relational paradigm is to aggregate the minimum and maximum age for each city, negate the minimum ages, and aggregate once more to get the ranges.


NCA = [('Alice', 'Boston', 23), ('Bob', 'Boston', 19), ('Carl', 'Seattle', 25)]
CA = [(c,a) for (n,c,a) in NCA]
MIN = aggregate(CA, min)
MIN_NEG = [(c,-1*a) for (c,a) in MIN]
MAX = aggregate(CA, max)
RESULT = aggregate(union(MIN_NEG, MAX), sum)

Below is a flow diagram that represents the above transformations.

!table([ [null , null, 'r:proj``MIN', 'rd:union``MIN_NEG', null , null], ['r:proj``NCA', 'ru:aggmin`rd:agg max``CA', null , null , 'r:agg sum``...', 'RESULT'], [null , null, 'rru:union``MAX', null , null , null] ])

Using the MapReduce paradigm, we may prefer to follow the paradigm's two-stage breakdown of a computation by first converting every single entry in the input data set into an approximation of the range. For example, given only the record ('Alice', ('Boston', 23)), in the mapping stage we might estimate the range as ('Boston', (23, 23, 0)) where the second and third entries are the minimum and maximum "known so far" given the limited information (a single data point). Then, in the reduce stage, we would combine these estimates.


NCA = [('Alice', ('Boston', 23)), ('Bob', ('Boston', 19)), ('Carl', ('Seattle', 25))]
I = map(lambda k, v: [(v[0], (v[1], v[1], 0))], NCA)

def reducer(k, vs):
    age_lo = min([l for (l,h,r) in vs])
    age_hi = max([h for (l,h,r) in vs])
    age_ran = age_hi - age_lo 
    return (k, (age_lo, age_hi, age_ran))
RESULT = reduce(reducer, I)

[link]

Example:

Consider the following algorithm implemented as a collection of transformations drawn from the relational paradigm. The input data set consists of tuples of the form (intersection, date, time, cars). Each tuple represents a single accident took place at the specified intersection, occurred on the specified date and time, and involved the specified number of cars.


D = [('Commonwealth Ave./Mass Ave.', '2016-11-02', '11:34:00', 3), ...]
M = project(D, lambda i, date, time, cars: [(i, cars)])
R = aggregate(M, max)
H = [(i,d,t) for ((i,m), (j,d,t,c)) in product(R, D) if i==j and m==c]

A data flow diagram representing the transformations is provided below.

!table([ ['d:prod`r:proj``D', 'r:agg max``M', 'lld:prod``R'], ['r:select``...', 'r:proj``...', 'H'] ])

Complete the following tasks. To simplify the exercise, you may assume that there is at most one accident involving each possible quantity of cars (e.g., there is only one accident that involved five cars).

Write a description of what the algorithm computes. What does the output data set represent?
Provide an implementation of this algorithm using the MapReduce paradigm. You only need one map and one reduce operation.

[link]
2.2. Composing Transformations into Algorithms

Whether we are using the relation model or the MapReduce paradigm, the available building blocks can be used to assemble fairly complex transformations on data sets. Each transformation can be written either using the concrete syntax of a particular programming language that implements the paradigm, or as a data flow diagram that describes how starting and intermediate data sets are combined to derive new data sets over the course of the algorithm's operation.

[link]

Example:

We can use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the k-means clustering algorithm.


def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2

def plus(args):
    p = [0,0]
    for (x,y) in args:
        p[0] += x
        p[1] += y
    return tuple(p)

def scale(p, c):
    (x,y) = p
    return (x/c, y/c)

M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

OLD = []
while OLD != M:
    OLD = M

    MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
    PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
    PD = aggregate(PDs, min)
    MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MT = aggregate(MP, plus)

    M1 = [(m, 1) for (m, _) in MP]
    MC = aggregate(M1, sum)

    M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
    print(sorted(M))

Below is a flow diagram describing the overall organization of the computation (nodes are data sets and edges are transformations). Note that the nodes labeled "..." are intermediate results that are implicit in the comprehension notation. For example, [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2] first filters product(MPD, PD) using a selection criteria if p==p2 and d==d2 and then performs a projection from tuples of the form ((m,p,d), (p2,d2)) to tuples of the form (m, p) to obtain the result.

!table([ ['rd:prod``P', null, null, 'd:agg min``PDs'], [null, 'r:proj``...', 'dr:prod`ru:proj``MPD', 'd:prod``PD'], ['ru:prod``M', null, null, 'd:selection``...'], ['u:proj``...', 'ld:prod``MC', 'l:agg sum``M1', 'd:proj``...'], ['u:selection``...', 'l:prod``MT', null, 'lu:proj`ll:agg plus``MP'] ])

[link]

Example:

We can use building blocks drawn from the MapReduce paradigm that are available in MongoDB to construct an implementation of the k-means clustering algorithm. This implementation illustrates many of the idiosyncrasies of the MapReduce abstraction made available by MongoDB.


db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

function flatten(A) {
  db[A].find().forEach(function(a) { db[A].update({_id: a._id}, a.value); });
}

function prod(A, B, AB) {
  db[AB].remove({});
  db.createCollection(AB);
  db[A].find().forEach(function(a) {
    db[B].find().forEach(function(b) {
      db[AB].insert({left:a, right:b});
    });
  });
}

function union(A, B, AB) {
  db[AB].remove({});
  db.createCollection(AB);
  db[A].find().forEach(function(a) {
    db[AB].insert(a);
  });
  db[B].find().forEach(function(b) {
    db[AB].insert(b);
  });
}

function hash_means(M, HASH) {
  db[M].mapReduce(
    function() { emit("hash", {hash: this.x + this.y}); },
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      });
      return {hash: hash};
    },
    {out: HASH}
  );
}

// We'll only perform a single product operation. Using map-reduce, we can perform
// argmax and argmin more easily. We can also use map-reduce to compare progress.

db.M.remove({});
db.M.insert([{x:13,y:1},{x:2,y:12}]);
db.P.remove({});
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

var iterations = 0;
do {
  // Compute an initial hash of the means in order to have a baseline
  // against which to compare when deciding whether to loop again.
  hash_means("M", "HASHOLD");

  prod("M", "P", "MP");
  // At this point, entries in MP have the form
  // {_id:..., left:{x:13,y:1}, right:{x:4,y:5}}.

  // For each point, find the distance to the closest mean. The output after
  // flattening has entries of the form {_id:{x:?, y:?}, m:{x:?, y:?}, d:?}
  // where the identifier is the point.
  db.MPs.remove({});
  db.MP.mapReduce(
    function() {
      var point = {x:this.right.x, y:this.right.y};
      var mean = {x:this.left.x, y:this.left.y};
      emit(point, {m:mean, d:dist(point, mean)});
    },
    function(point, vs) {
      // Each entry in vs is of the form {m:{x:?, y:?}, d:?}.
      // We return the one that is closest to point.
      var j = 0;
      vs.forEach(function(v, i) {
        if (v.d < vs[j].d)
          j = i;
      });
      return vs[j]; // Has form {m:{x:?, y:?}, d:?}.
    },
    {out: "MPs"}
  );
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, value:{m:{x:4, y:5}, d:1}}.

  flatten("MPs");
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, m:{x:4, y:5}, d:1}.

  // For each mean (i.e., key), compute the average of all the points that were
  // "assigned" to that mean (because it was the closest mean to that point).
  db.MPs.mapReduce(
    function() {
      // The key is the mean and the value is the point together with its counter.
      var point = this._id;
      var point_with_count = {x:point.x, y:point.y, c:1};
      var mean = this.m;
      emit(mean, point_with_count);
    },
    function(key, vs) {
      // Remember that the reduce operations will be applied to the values for each key
      // in some arbitrary order, so our aggregation operation must be commutative (in
      // this case, it is vector addition).
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  // At this point, entries in M have the form
  // {_id:{x:2, y:3}, value:{x:4, y:5}}.

  flatten("M");
  // At this point, entries in MPs have the form
  // {_id:{x:2, y:3}, x:4, y:5}. The identifier
  // value does not matter as long as it is unique.

  // Compute the hash of the new set of means.
  hash_means("M", "HASHNEW");

  // Extract the two hashes in order to compare them in the loop condition.
  var hashold = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
  var hashnew = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
  print(hashold);
  print(hashnew);
  print(iterations);
  iterations++;
} while (hashold != hashnew);

[link]

Example:

If we are certain that our k-means algorithm will have a relatively small (e.g., constant) number of means, we can take advantage of this by only tracking the means in a local variable and using .updateMany() to distribute the means to all the points at the beginning of each iteration. This leads to a much more concise (and, for a small number of means, efficient) implementation of the algorithm than what is presented in a previous example. In particular, it is no longer necessary to encode a production operation within MongoDB.


db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

var means = [{x:13,y:1}, {x:2,y:12}];
do {
  db.P.updateMany({}, {$set: {means: means}}); // Add a field to every object.
  db.P.mapReduce(
    function() {
      var closest = this.means[0];
      for (var i = 0; i < this.means.length; i++)
        if (dist(this.means[i], this) < dist(closest, this))
          closest = this.means[i];
      emit(closest, {x:this.x, y:this.y, c:1});
    },
    function(key, vs) {
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  means = db.M.find().toArray().map(function(r) { return {x:r.value.x, y:r.value.y}; });
  printjson(means);
} while (true);

We do not deal with the issue of convergence in the above example; an equality function on JSON/BSON objects (i.e., the list of means) would need to be defined to implement the loop termination condition. Below, we illustrate how the above implementation can be written within Python using PyMongo.


import pymongo
import bson.code

client = pymongo.MongoClient()
db = client.local

db.system.js.save({'_id':'dist', 'value': bson.code.Code("""
    function(u, v) {
        return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
    }
    """)})

db.P.insert_many([{'x':1,'y':2},{'x':4,'y':5},{'x':1,'y':3},{'x':10,'y':12},\
                  {'x':13,'y':14},{'x':13,'y':9},{'x':11,'y':11}])

means = [{'x':13,'y':1}, {'x':2,'y':12}]

while True:
    db.P.update_many({}, {'$set': {'means': means}})

    mapper = bson.code.Code("""
        function() {
            var closest = this.means[0];
            for (var i = 0; i < this.means.length; i++)
                if (dist(this.means[i], this) < dist(closest, this))
                    closest = this.means[i];
            emit(closest, {x:this.x, y:this.y, c:1});
        }
        """)
    reducer = bson.code.Code("""
        function(key, vs) {
            var x = 0, y = 0, c = 0;
            vs.forEach(function(v, i) {
                x += v.x;
                y += v.y;
                c += v.c;
            });
            return {x:x, y:y, c:c};
        }
        """)
    finalizer = bson.code.Code("""
        function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; }
        """)
    db.P.map_reduce(mapper, reducer, "M", finalize = finalizer)

    means = [{'x':t['value']['x'], 'y':t['value']['y']} for t in db.M.find()]
    print(means)

[link]

Example:

We can also use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the Floyd-Warshall shortest path algorithm.


N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]

oo = float('inf') # This represents infinite distance.

P = product(N,N)
I = [((x,y),oo if x != y else 0) for (x,y) in P] # Zero distance to self, infinite distance to others.
D = [((x,y),1) for (x,y) in E] # Edge-connected nodes are one apart.

OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
    STEP = OUTPUT
    P = product(STEP, STEP) # All pairs of edges.
    NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y]) # Add distances of connected edge pairs.
    OUTPUT = aggregate(NEW, min) # Keep only shortest node-node distance entries.

SHORTEST = OUTPUT

[link]
2.3. Data Provenance

Data provenance is an overloaded term that refers, in various contexts and communities, to the source, origin, or lifecycle of a particular unit of data (which could be an individual data point, a subset of a data set, or an entire data set). In this course, we will use the term primarily to refer to dependency relationships between data sets (or relationships between individual entries in those data sets) that may be derived from one another (usually over time). As an area of study within computer science, data provenance (also called data lineage) is arguably still being delineated and developed by the community. However, some community standards for general-purpose representations of data provenance have been established (e.g., PROV [#]).

While the research literature explores various ways of categorizing approaches to data provenance, there are two dimensions that can be used to classify provenance techniques (surveyed in more detail in the literature [#]):

from where the data was generated (e.g., what data sets or individual data entries) and how it was generated (e.g., what algorithms were used);
whether the lineage is tracked at a fine granularity (e.g., per data entry such as a row in a data set) or a coarse granularity (e.g., per data set).

[link]

	coarse	fine
how	transformation	execution path through transformation algorithm
from where	source data sets	specific entries in source data sets

When data lineage is being tracked at a fine granularity, there are at least two approaches one can use to determine provenance of a single entry within a data set produced using a specific transformation. One approach is to track the provenance of every entry within the transformation itself (sometimes called tracing); another approach is to determine the provenance of an entry after the transformation has already been performed (e.g., by another party and/or at some point in the past). The latter scenario is more likely if storage of per-entry provenance meta-data is an issue, or if the transformations are black boxes that cannot be inspected, modified, or extended before they are applied to input data sets.

For a transformation that may combine input data set entries in some way, a large number of entries in the input data set can influence an individual entry in the output data set. For such transformations, finding the per-entry provenance for an entry in the output data set can be non-trivial. Without additional information about the transformation, the conservative assumption to make is that all entries in the input data set may contribute to every entry in the output data set.

In some cases, we may know more about a transformation. Perhaps its definition is broken down into building blocks found in the relational model (e.g., selections and projections), or it can be defined using map and reduce operations in the MapReduce paradigm. In such cases, studying the data lineage of individual data items it may be possible to derive standard techniques [#] for determining data lineage given the components that make up the transformation.

[link]

Fact:

For relational transformations that simply add or remove data set entries without changing their internal structure (i.e., schema) or data content (i.e., values), computing the per-entry provenance is relatively straightforward. In particular: for union, difference, intersection, selection, and product transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set (i.e., the fine-grained "where" provenance of an individual entry in the output) can be determined using a linear search through the input data set (or sets) for that entry.

[link]

Fact:

For projection transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set can be determined by applying the transformation to each entry in the input data set. Whenever this yields an output equivalent to the target, we know that the provenance for that output entry includes that input entry.

The above facts imply that for several relational transformation building blocks, per-entry provenance of output data set entries can be reconstructed efficiently. Note that by induction this also implies that any transformation composed of these building blocks is also amenable to efficient reconstruction of the provenance of any output data set entry.

If a transformation might combine subsets of the input entries to compute individual output entries (but we have no additional information about the transformation), then with regard to per-entry provenance a worst case scenario may apply.

[link]

Fact:

Suppose we know nothing about the internal structure of a transformation, but we want to determine what entries in the input data set influence a particular entry in the output data set. In this case, even running the entire transformation a second time (knowing the target entry) may provide no additional information about which subset of input data set entries produced that output data set entry. In this worst-case scenario, it would be necessary to run the transformation on all 2ⁿ subcollections of the input data set (for an input data set of size n), and to check for each one of those 2ⁿ outputs whether the particular entry of interest from the output data set was generated.

However, it is possible that a transformation that combines input entries into output entries falls into one of a small number of categories that make it possible to compute per-entry provenance more efficiently.

[link]

Fact:

A transformation is context-free if any two entries in the input data set that contribute to the same output entry always do so (regardless of what other input entries are present in the input data set). If a transformation is context-free, then the provenance information for a given output data set entry can be computed in quadratic time. First, we can partition the input data set using the entries of the output data set (i.e., create one bucket of input data set entries for each output data set entry). Then, we can determine which partition contributed to the output data set entry of interest using linear search.

[link]

Fact:

A transformation is key-preserving if any two entries under the same keys in the input data set always contribute to an entry with the same output key in the output data set. If a transformation is key-preserving, then the provenance information for a given entry is easy to determine by building up a map from input data set keys to output data set keys, and then performing a linear search over the input data set.

[link]

Example:

Any relational aggregation (that is, a key-based aggregation operation as defined in the relational model that produces an output data set in which all keys are unique) is always key-preserving and context-free.

The above example completes our understanding of the complexity of reconstructing per-entry provenance for all the building blocks in the relational model.

[link]

transformation f	per-entry provenance reconstruction approach	complexity for input data set with n entries
union intersection difference selection product	linear search over input data set to find entry	O(n) entry equality checks
projection	application of transformation once to each input data set entry	O(n) executions of f and O(n) entry equality checks
aggregation	application of transformation once to each input data set entry and construction of key-to-key map	O(n) executions of f and O(n log n) to build and use key-to-key map

[link]

Example:

A transformation f that takes a data set of points as its input, runs the k-means algorithm on those points, and returns a data set of means is not context-free. To see why, consider its behavior on following inputs for k = 2:

f([(0,0), (2,2)])

[(0,0), (2,2)]

f([(0,0), (2,2), (100,100)])

[(1,1), (100,100)]

Notice that in the first case, the input entries (0,0) and (2,2) each produce their own output entry. However, the introduction of a new point (100,100) results in (0,0) and (2,2) both contributing to the same output entry (1,1).

[link]

Example:

A transformation can be context-free but not key-preserving. For example, suppose a transformation aggregates some vectors by key but then discards the key via a projection:

[(j, (0,2)), (j, (2,0)), (k, (0,3)), (k, (3,0))]

↦

[(j, (2,2)), (k, (3,3))]

↦

[(2,2), (3,3)]

The above is context-free because there is no way that other input entries can have any influence over the way existing entries aggregate by key. However, they key j might map to any numeric value key in the output (depending on what the input entries are):

[(j, (0,0))]

↦

[(0,0)]

[(j, (0,0)), (j, (1,1))]

↦

[(1,1)]

[link]

Theorem:

Any key-preserving transformation is context-free. To see this, suppose that there is some transformation f that is key-preserving but not context-free. This would imply that there is some set of inputs (i, a), (j, b), and (k, c) for which the definition of context-free is not satisfied, such as a case in which two input entries each map to their own distinct output entries when on their own but map to the same output entry when the third input entry (k, c) is introduced:

f([(i, a), (j, b)])

[(l, d)]

f([(i, a), (j, b), (k, c)])

[(l, d), (m, e), (n, f)]

However, the above would imply that under different conditions, the keys i and j might either both map to a key l or might map to two different keys l and m. But this would imply that entries with the key j map to entries with either key l or key m. This is not key-preserving and contradicts our assumptions, so our premise must have been impossible.

The diagram below illustrates the relationships between the properties discussed above. Notice that whether a transformation is context-free and key-preserving is used to determine whether a transformation that combines input entries might have properties that allow us to compute per-entry provenance more efficiently. There is no need to check whether other transformations (such as projections and products) have these properties because per-entry provenance is already efficiently computable in those cases.

[link]

data set transformations

transformations that combine multiple
input entries into output entries
(example: k-means)

context-free transformations
(example: aggregation by key that drops the key)

key-preserving transformations
(example: relational aggregation by key)

other transformations
(examples: selections, projections)

We provide explicit algorithms for computing per-entry provenance in three of the cases outlined above.

[link]

Example:

If we introduce a function for computing powersets of entries in a data set (e.g., using a Python recipe for such a function), we can define an inefficient Python algorithm that can correctly determine per-entry provenance in the general case (i.e., not knowing any additional information about a transformation f).


from itertools import combinations, chain

def powerset(iterable):
    "powerset([1,2,3]) -> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

def general_prov(f, X, y):
    for Y in reversed(powerset(X)): # From largest to smallest subset.
        if list(f(list(Y))) == [y]:
            return Y

[link]

Example:

The function below determines the per-entry provenance for a context-free transformation. For this example, we assume that all data set entries are of the form (key, value).


def context_free_prov(f, X, y):

    # Build the partitions of the input data set.
    partitions = []
    for (x_key, x_val) in X:
        found = False
        for i in range(0,len(partitions)):
            if len(f(partitions[i] + [(x_key, x_val)])) == 1:
                partitions[i].append((x_key, x_val))
                found = True
                break
        # Create a new partition if adding to any other
        # partition increases the size of the output data set.
        if found == False:
            partitions.append([(x_key, x_val)])

    # Find the corresponding partition.
    for partition in partitions:
        if y in f(partition):
             return partition

[link]

Example:

The function below determines the per-entry provenance for a key-preserving transformation. For this example, we assume that all data set entries are of the form (key, value).


def key_preserving_prov(f, X, y):
    keymap = {}
    
    # Build up the map from input data set
    # keys to output data set keys.
    for (x_key, x_val) in X:
        (y_key, y_val) = f([(x_key, x_val)])[0]
        keymap[x_key] = y_key
    (y_key, y_val) = y

    # Collect all the tuples that contribute
    # to the target result.
    pre_image = set()
    for (x_key, x_val) in X:
        if keymap[x_key] == y_key:
             pre_image.add((x_key, x_val))
    return pre_image

[link]

Exercise:

For each of the following informal descriptions of data transformations, determine whether computing the per-entry provenance of a row in the output data set is easy (e.g., linear, O(n log n), or quadratic time) or difficult (i.e., cannot be done in polynomial time), assuming that the transformation's implementation is a black box (except for the provided description).

Given a data set containing entries of the form (name, age), the transformation produces a list of all entries corresponding to individuals who are of a legal voting age.
Given a data set containing entries of the form (product, (department, price, date)) describing sales that take place in a department store, the transformation computes the total revenue for each department to produce entries of the form (department, revenue). Assume that each product appears in exactly one department.
Given a data set containing entries of the form (product, (department, price, date)) describing sales that take place in a department store, the transformation computes the total revenue from each department for each possible date to produce entries of the form ((date, department), revenue).
Given a data set specifying the geographical locations of a collection of radio receivers in a bounded region, determine an optimal placement of 100 radio towers within that region that minimizes the aggregate of the distances between each tower-receiver pair.

[link]

Exercise:

Given a data set containing entries of the form (restaurant, popularity, location) that describe restaurants within the various neighborhoods in a city, the transformation uses an unknown machine learning algorithm to group the restaurants by similarity of cuisine into categories (each restaurant is assigned to exactly one cuisine category) and produces a data set with entries of the form (cuisine, count) specifying the number of restaurants that specialize in each type of cuisine. If you have no additional information, how difficult will it be to determine the provenance of an individual entry in the output?

Tracking provenance at a coarse granularity (per data set) is relatively inexpensive in most applications (since the number of data sets is usually orders of magnitude less than the number of individual data set entries), and standards such as PROV provide authors of transformations a way to represent provenance information.

[link]

Example:

Suppose Alice and Bob want to assemble a script that retrieves some information from an online resource and stores it in a database. We can use the prov Python package to programmatically assemble a course granularity provenance record that conforms to the PROV standard and describes a particular execution of this script.

We first create the document object and define the namespaces we will use for symbols in the document.


doc = prov.model.ProvDocument()
doc.add_namespace('alg', 'http://datamechanics.io/algorithm/alice_bob/') # The scripts in / format.
doc.add_namespace('dat', 'http://datamechanics.io/data/alice_bob/') # The data sets in / format.
doc.add_namespace('ont', 'http://datamechanics.io/ontology#')
doc.add_namespace('log', 'http://datamechanics.io/log#') # The event log.
doc.add_namespace('bdp', 'https://data.cityofboston.gov/resource/')

We then use an agent to represent the script, en entity to represent a resource, and an activity to repsent this invocation of the script. We also associate this current activity with the script, and we and indicate that his activity used the entity representing the resource.


this_script = doc.agent('alg:example', {prov.model.PROV_TYPE:prov.model.PROV['SoftwareAgent'], 'ont:Extension':'py'})
resource = doc.entity('bdp:wc8w-nujj', 
    {'prov:label':'311, Service Requests', 
    prov.model.PROV_TYPE:'ont:DataResource', 'ont:Extension':'json'}
  )
this_run = doc.activity(
    'log:a'+str(uuid.uuid4()), startTime, endTime, 
    {prov.model.PROV_TYPE:'ont:Retrieval', 'ont:Query':'?type=Animal+Found&$select=type,latitude,longitude,OPEN_DT'}
  )
doc.wasAssociatedWith(this_run, this_script)
doc.used(this_run, resource, startTime)

Finally, we define an entity for the data obtained and indicate to what agent it was attributed, by what actvitity it was generated, and from what entity it was derived.


found = doc.entity('dat:found', {prov.model.PROV_LABEL:'Animals Found', prov.model.PROV_TYPE:'ont:DataSet'})
doc.wasAttributedTo(found, this_script)
doc.wasGeneratedBy(found, this_run, endTime)
doc.wasDerivedFrom(found, resource, this_run, this_run, this_run)

We can view our completed provenance record in, for example, JSON form using doc.serialize().

[link]
2.4. Project #1: Data Retrieval, Storage, Provenance, and Transformations

The purpose of this project is to practice using some of the tools available for retrieving and storing data sets within the context of a multi-contributor platform and repository, tracking provenance information for data sets, and performing some transformations using one or both of the presented paradigms.

For the purposes of this project description, we assume the submitting team consists of two members, Alice and Bob, with the team identifier alice_bob (in accordance with the requirements specified in Project #0).

1. Follow the GitHub submission instructions by forking the repository at https://github.com/Data-Mechanics/course-2018-spr-proj. Note that we may commit and push a few additional fixes or updates to this repository before the deadline, so check for updates and pull them into your fork on a regular basis. Set up a MongoDB and Python environment as necessary for the project (including the installation of all dependencies). You should be able to run the setup.js script to prepare the repository, and then to start your MongoDB server with authentication enabled and run the alice_bob/example.py example script.
2. As in Project #0, create a folder of the form alice_bob within the top-level directory of the project. All the code constituting your submission, including all of your scripts and algorithm files (i.e., for retrieval of data and for transforming data already within the repository), should be placed within this folder. Do not place data files within this folder, or submit any data sets or data files via GitHub.
  Note that the file name for each script should be reasonably descriptive, as it will act as the identifier for the script. Each script should contain exactly one extension of the dml.Algorithm base class (such that the class name matches exactly the filename). Consult alice_bob/example.py for a working example script and algorithm.
```
class example(dml.Algorithm):
    contributor = 'alice_bob'
    reads = []
    writes = ['alice_bob.lost', 'alice_bob.found']

    @staticmethod
    def execute(trial = False):
        ...

    @staticmethod
    def provenance(doc = prov.model.ProvDocument(), startTime = None, endTime = None):
        ...
              
```
  Beyond this requirement, the scripts should also follow reasonable modularity and encapsulation practices, and should logically perform related operations. A larger number of smaller, simpler, and more reusable scripts that each perform a small task is preferable.
1. Choose at least five publicly available data sets, data streams, services, or documents from at least three different data portals or services. Note that some of your data sets may have similar schemas and may contain similar information, but all five data sets cannot have very similar schemas or very similar information. Write a short narrative and justification (5-10 sentences) explaining how these data sets can be combined to answer an interesting question or solve a problem. You do not need to solve the actual problem in this project, and it is acceptable to merely combine data sets in a way that is likely to support one or more solutions involving the particular data you choose. Include this narrative in a README.md file within your directory (along with any documentation you may write in that file).
2. Implement algorithms that retrieve these data sets automatically (i.e., by implementing one or more dml.Algorithm subclasses and defining their reads and writes fields, as well as their execute methods). Any authentication credentials your scripts use should be included in the file auth.json. All your scripts should retrieve the credential information they need from your copy of the file by using the existing dml functionality for doing so.
  Do not submit your auth.json file and do not include hard-coded authentication credentials in your code. We already added it to the .gitignore file to ensure you do not accidentally submit the credentials file. The course staff will use their own authentication credentials when running your code. Your README.md file should list any idiosyncratic details associated with the services and/or credentials needed to run your scripts.
3. Implement algorithms that perform at least three non-trivial transformations that merge, combine, or otherwise process some of the five data sets into three new data sets. These new data sets should be inserted into the repository. You can choose one of the existing algorithms or tools we have considered so far, or something else (assuming you thoroughly document in your README.md file how to obtain and run those tools).
Extend each script that you implemented in Problem #2 so that it can produce the provenance information documenting the activities that take place during the operation of that script (i.e., by defining its provenance() method). Each algorithm should generate a single provenance document when its provenance() method is invoked.

[link]
3. Systems, Models, and Algorithms

When using mathematics to solve problems involving real-world entities and phenomenon, we can introduce the notions of a system (i.e., an organized collection of interdependent components and their possible states) that has distinct system states (which can capture spatial and temporal information about the system components, their behaviors, and their relationships). A mathematical model captures only some parts of the system and the possible system states, abstracting the other details away, and it does so using some particular symbolic language or representation.

Four common abstract mathematical objects used to represent models of systems are particularly relevant given our application domain of interest:

graphs consisting of a collection of nodes and edges (possibly weighted and directed) between those nodes,
vector spaces consisting of sets of vectors,
systems of equations and inequalities (or, equivalently, sets of constraints or set of logical formulas), and
functions (whether continuous or discrete).

In many cases, these mathematical objects can be used to represent each other (e.g., a graph can be represented as a function or as a system of equations), but each has its own characteristics that make it well-suited for modeling certain kinds of systems. Very often, two or more of these are combined when modeling a system.

Most data sets describing a real-world system are using one or more of the above languages for modeling that system. Identifying which of these is being used can inform our decisions about what techniques we can use to answer questions about that system or to solve problems involving that system. It can also tell us how we can convert to a different language or representation in order to convert, combine, or derive new data sets.

[link]
3.1. Systems, Models, and Metrics

We can provide abstract definitions for systems, models, and metrics. These then allow us to characterize satisfaction, search, and optimization problems in a general and consistent way.

[link]

Definition:

A system is a set S of distinct system states (possibly infinite). Often, each possible system state s ∈ S is called a model of the system.

One mathematical object that is commonly used to represent systems is a vector space where individual vectors are the system states (i.e., models) of that system.

[link]

Definition:

A constraint set is a set of logical formulas (written using a formal language that may contain terms such as integers, vector, matrices, sets, and so on), usually used to identify a subset of a system (i.e., a subset of the possible systems states or models). A system state or model satisfies a constraint or constraint set if the logical formulas are true for that model. The subset of all states that satisfy the constraints is sometimes called the feasible region.

When using vector spaces, it is often possible to represent constraints using a matrix equation, e.g.:

M ⋅ x

In the above, M might be a matrix from ℝ^{n × m}, x might be a variable vector in ℝ^m, and b might be a vector in ℝⁿ.

[link]

Definition:

Given a system S and a set of constraints, a constraint satisfaction problem is the problem of finding a model (or set of models) in S that satisfies that set of constraints.

[link]

Definition:

A metric over a system S is a function f: S → ℝ that maps system states to real numbers.

[link]

Definition:

Given a system S and metric f, an optimization problem is the problem of finding s ∈ S that maximizes or minimizes f:

argmax_{s ∈ S} f(s)

argmin_{s ∈ S} f(s)

[link]

Example:

Computing the argmax or argmin operation in the relational or MapReduce paradigms can be non-trivial, but it is still possible. For example, in the relational paradigm it is possible to compute it by combining two projections, an aggregation, and a selection.


def argmax(X, f):
    Y = [f(x) for x in X] # Projection.
    y_max = aggregate(Y, max)
    XF = [(x, f(x)) for x in X] # Projection.
    xs = [x for (x,y) in XF if y == y_max] # Selection.
    return xs[0]

[link]

Exercise:

Consider a state space S = {0,1}⁴ with states of the form x = (x₁, x₂, x₃, x₄) and the following set of constraints:

x₁ > x₂

x₃ ≥ x₄

What is the size of the state space?
Find a state in S that solves this constraint satisfaction problem.
Suppose that this problem involves maximizing the objective function f(x) = x₁ + x₂ + x₃ + x₄; what state in S solves this optimization problem?

[link]

Example:

Suppose we want to create an approximate, discrete version of the k-means algorithm. We can do so by treating the outcome of the k-means algorithm as the solution to an optimization problem. In order to make this concrete, we will work with the following set of points and Euclidean distances (as in a previous example).


P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2

For k = 1, we can create a state space S that consists of all possible positions for the mean within a particular bounded region. The metric would be the sum of the distances from all the points in P to the mean m. The solution would be the one that minimizes the metric.


S = list(product(range(0,20), range(0,20))) # Possible locations for one mean.

def metric(m):
    return sum([dist(m, p) for p in P])

o = min(S, key=metric)

For k = 2, we would need to create a state space that represents all possible pairs of means. We would also need to adjust the metric function to consider in the sum only the distances from each point to the closest (i.e., minimum-distance) mean to that point.


V = list(product(range(0,20), range(0,20))) # Possible locations for one mean.
S = list(product(V, V)) # Possible pairs of means.

def metric(M):
    return sum([min([dist(m, p) for m in M]) for p in P])

o = min(S, key=metric)

[link]

Exercise:

Consider the following problem: given a data set describing a set of 10,000 spectators and their heights, determine if it is possible to seat them in a stadium of 10,000 seats so that no one blocks anyone else's view. Determine whether this is a constraint satisfaction problem or an optimization problem, and identify the state space, constraint set, and (if applicable) metric.

[link]

Exercise:

For each of the following scenarios, determine whether it is an instance of a constraint satisfaction problem or an instance of an optimization problem. If it is an optimization problem, identify the objective function.

Find the lowest-cost route between two destinations on a transportation network.
Given a social network where any pair of individual members can be "connected", determine if the whole network is connected (i.e., there exists a path from every member to every other member).
Given a social network where any pair of individual members can be "connected", find the smallest group of individuals that must be removed in order to make the network disconnected.

[link]
3.2. Linear systems, satisfiability modulo theories, and linear programming

We often say a system or subset of a system is linear if it can be modeled using a Euclidean vector space and can be defined using a collection of linear constraints of the form:

a₁ ⋅ x₁ + ... + a_n ⋅ x_n ♦ c

Where (x₁, ..., x_n) ∈ ℝⁿ is a vector, the a_i are coefficients in ℝ, c in ℝ is a constant, and ♦ is a relation operator such as =, ≤, <, ≥, or >. Note that the constraints contain no higher-order x_i terms such as x_i².

There are three common techniques for solving constraint satisfaction and optimization problems involving linear systems:

finding an exact or approximate solution (e.g., using some combination of matrix inversion, QR factorization, and ordinary least squares),
checking if a solution exists (and finding a witness system state or model proving this), and
finding a solution that minimizes or maximizes some (linear) metric over the vector space.

The third item (solving an optimization problem) is arguably a generalization of the first approach if we remove the requirement that the metric must be linear (since the first approach minimizes the Euclidean distance metric, which is quadratic and thus non-linear).

[link]

Example:

Suppose we want to use an SMT solver to solve a flow maximization problem for a flow graph (the variable on each represents the flow amount, and the parenthesized number is the maximum possible flow on that edge):

!table([ [null, null, ('dd:x4 (6)`dr:x5 (3)``•'), null, null], [('r:x1 (?)``•'), ('ur:x2 (7)`dr:x3 (8)``•'), null, ('r:x7 (5)``•'), ('•')], [null, null, ('ur:x6 (4)``•'), null, null], ])

Here, the system is the set of possible states of the graph in terms of the amount of material flowing along each edge (we model the flow that goes along each edge in the graph as a variable ranging over ℝ, and the flow in the overall graph containing seven edges as a vector in ℝ⁷).


import z3

(x1,x2,x3,x4,x5,x6,x7) = [z3.Real('x'+str(i)) for i in range(1,8)]

S = z3.Solver()

# Only allow non-negative flows.
for x in (x1,x2,x3,x4,x5,x6,x7):
    S.add(x >= 0)
    
# Edge capacity constraints.
S.add(x2 <= 7, x3 <= 8, x4 <= 6)
S.add(x5 <= 3, x6 <= 4, x7 <= 5)

# Constraints derived from graph topology.
S.add(x1 == x2+x3, x2 == x4+x5, x3+x4 == x6, x5+x6 == x7)

S.add(x1 > 0) # We want a positive flow.

print(S.check())
print(S.model())

If we are allowed to vary only the variable for incoming edge, we can use the SMT solver together with binary search to find the maximum flow through the graph.


flow = 0
for i in range(5, -1, -1):
    S.push()
    S.add(x1 >= (2**i + flow))
    if str(S.check()) != 'unsat':
        flow += 2**i
    S.pop()

S.add(x1 >= flow)
print(S.model())

We could alternatively build up the constraint set from a starting matrix.


def dot(xs, ys):
    return sum([x*y for (x,y) in zip(xs, ys)])

x = [x1,x2,x3,x4,x5,x6,x7]

M = [
  [ 0,-1, 0, 0, 0, 0, 0],
  [ 0, 0,-1, 0, 0, 0, 0],
  [ 0, 0, 0,-1, 0, 0, 0],
  [ 0, 0, 0, 0,-1, 0, 0],
  [ 0, 0, 0, 0, 0,-1, 0],
  [ 0, 0, 0, 0, 0, 0,-1],

  [ 1, 0, 0, 0, 0, 0, 0],
  [ 0, 1, 0, 0, 0, 0, 0],
  [ 0, 0, 1, 0, 0, 0, 0],
  [ 0, 0, 0, 1, 0, 0, 0],
  [ 0, 0, 0, 0, 1, 0, 0],
  [ 0, 0, 0, 0, 0, 1, 0],
  [ 0, 0, 0, 0, 0, 0, 1],

  [-1, 1, 1, 0, 0, 0, 0],
  [ 0,-1, 0, 1, 1, 0, 0],
  [ 0, 0,-1,-1, 0, 1, 0],
  [ 0, 0, 0, 0,-1,-1, 1]
  ]

b = [-7,-8,-6,-3,-4,-5,
      0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0]

S = z3.Solver()

for i in range(len(M)):
    S.add(b[i] <= dot(M[i], x))

print(S.check())
print(S.model())

[link]

Example:

Suppose we want to use linear programming to solve the problem described in the example above.


from scipy.optimize import linprog

M = [
  [ 1,-1,-1, 0, 0, 0, 0],
  [ 0, 1, 0,-1,-1, 0, 0],
  [ 0, 0, 1, 1, 0,-1, 0],
  [ 0, 0, 0, 0, 1, 1,-1],
  ]
b = [0, 0, 0, 0]
c = [-1, 0, 0, 0, 0, 0, 0]
bounds = [(0, None), (0, 7), (0, 8), (0, 6), (0, 3), (0, 4), (0, 5)]

result = linprog(c, A_ub = M, b_ub = b, bounds=bounds, options={"disp": True})
print(result)

Note that this technique produces a potentially different solution from the one in the example above (though both are optimal).

[link]

Example:

Suppose that you want to distribute two types of infrastructure improvements across the 20 neighborhoods of a city: road repairs and electric grid updates. Road repairs cost $100,000 per mile and electric grid updates cost $50,000 per mile. You have access to a data set I that contains tuples of the form (neighborhood, road-miles, grid-miles) describing the quantity of each type of infrastructure in each neighborhood.

You want to distribute the cost of improvements equitably such that every neighborhood receives exactly the same amount of funding, and every neighborhood is able to use all that funding it receives. If your overall budget is capped at $10,000,000, is this possible?

One approach is to fix some constant c and create two variables x_i and y_i and three constraints for every entry (i, r_i, g_i) in I:

x_i ⋅ 100,000 + y_i ⋅ 50,000 ≥ c

x_i ≤ r_i

y_i ≤ g_i

The above constraints ensure that at least c funding is used in the neighborhood, but also that the amount of infrastructure repairs does not exceed the potential needs of the neighborhood. Thus, the state space consists of 20 ⋅ 2 dimensions (i.e., the two vectors x and y having one component for every neighborhood).

Note that if we wanted to check whether it is possible to disburse all the available funding equitably, we could set c = 10,000,000/20. Alternatively, if we wanted to maximize c, we could try larger and larger values of c until it cannot be increased without causing the problem to become unsolvable. Alternatively, we could treat c as another variable and define the objective function that must be maximized as f(x, y, c) = c.

[link]
3.3. Graph and Spatial Problems as Constraint Satisfaction and Optimization Problems

We have already seen how we can define some basic algorithms that have natural spatial interpretations (basic clustering and shortest) using data transformations assembled from building blocks drawn from the relational and the MapReduce paradigms. These algorithms employed two different representations: a vector space (clustering) and a graph (shortest paths). In this section, we review how algorithms such as these can be viewed as solutions to constraint satisfaction and/or optimization problems.

[link]

Definition:

Suppose we have a graph consisting of a set of nodes N = {n₁, n₂, ...} and a set of edges E of the form (n₁, n₂). The minimum spanning tree of this graph is the smallest subset T ⊂ E such that in the subgraph consisting of the edges in T, every node can be reached from every other node (i.e., by traversing a path along the edges in T).

[link]

Example:

The problem of finding a minimum spanning tree can be broken down into a constraint satisfaction problem:

the set of system states is the collection of edge subsets (i.e., all 2^|E| possible subsets);
the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another) and the requirement that the subset T must be a tree (or, equivalent, that it must have at most |N| − 1 edges).

This problem can also be broken down into an optimization problem that minimizes a metric:

the set of system states is the collection of edge subsets (i.e., all 2^|E| possible subsets);
the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another);
the metric is a function that computes the size of T (i.e., the number of edges in the spanning subtree).

[link]
3.4. Decomposition Techniques

General-purpose techniques for solving constraint satisfaction and optimization problems can sometimes be inefficient (potentially having a running time that is exponential in the number of state space dimensions and/or the number of constraints). This can be especially costly when state space dimensions or constraints are derived from large data sets having hundreds, thousands, or hundreds of thousands of entries.

In such scenarios, it may be possible to apply general-purpose decomposition techniques for solving constraint satisfaction and optimization problems in a scalable way. Usually, these rely on certain assumptions about the properties of the constraints and/or the objective function.

[link]

Definition:

Suppose we have a constraint satisfaction problem with an n-dimensional state space S containing system states of the form (x₁, ...,x_n) and a set of constraints C. If the set of constraints C can be separated into two disjoint subsets such that C = C₁ ∪ C₂ where C₁ and C₂ share no variables, then the problem can be decomposed into two independent subproblems.

[link]

Fact:

Suppose that a constraint satisfaction problem with state space S and constraint set C can be decomposed into two independent subproblems where C = C₁ ∪ C₂. If s₁ ∈ S is a solution to the subproblem corresponding to C₁ and s₂ ∈ S is a solution to the subproblem corresponding to C₂, then we can combine s₁ and s₂ to obtain a state that satisfies all the constraints in C (using a linear-time scan through the representation of s₁ and s₂).

[link]

Definition:

Suppose we have a constraint satisfaction problem with an (n+m)-dimensional state space S containing system states of the form (x₁, ...,x_n, y₁, ..., y_m) and a set of constraints C = C_x ∪ C_xy ∪ C_y where:

all constraints in C_x only contain variables from the list x₁, ...,x_n;
all constraints in C_y only contain variables from the list y₁, ...,y_m;
every constraint in C_xy has at least one variable from x₁, ...,x_n and at least one from y₁, ...,y_n.

Then the constraints contained in C_xy are called the complicating constraints and the variables found in those constraints are called the complicating variables.

[link]

Example:

Suppose we have a constraint satisfaction problem with states of the form (x₁, ...,x_n, y₁, ..., y_m) ∈ {0,1}^n+m and a constraint set C = C_x ∪ C_xy ∪ C_y. Suppose also that x₁ and y₁ are the only complicating variables and x₁ ≥ y₁ is the only complicating constraint. Then we can consider four different subsets of the state space:

all states in which x₁ = 0 and y₁ = 0;
all states in which x₁ = 1 and y₁ = 0;
all states in which x₁ = 0 and y₁ = 1;
all states in which x₁ = 1 and y₁ = 1.

Note that for each of the four scenarios, we can easily extend the two subproblems corresponding to C_x and C_y with one additional constraint (such as x₁ = 1 for C_x and y₁ = 0 for C_y). We can then eliminate the complicating constraint (which is redundant at that point), allowing us to decompose the new problem into two independent subproblems. If we find a solution to both subproblems in any of these four cases, then we have found a solution to the overall original problem corresponding to C.

This approach can lead to significant performance gains if the algorithmic complexity of solving the constraint satisfaction problem is worse than linear. For example, suppose that there are k constraints and solving the problem takes exponential time in the number of constraints (such as 2^k). If we can split the problem into two independent subproblems (each of size k/2), then we have can compute the overall running time needed to solve all four pairs of independent subproblems:

4 ⋅ 2 ⋅ 2^k/2

2^{k/2 + 3}

The running time 2^{k/2 + 3} is substantially faster than 2^k for large k. In fact, 2^{k/2 + 3} ≈ √(2^k).

In addition, note that the third scenario where x₁ = 0 and y₁ = 1 does not satisfy the complicated constraint x₁ ≥ y₁, so we need not consider it at all. Thus, the running time would actually be 3 ⋅ 2 ⋅ 2^k/2 if we avoid this case.

[link]

Fact:

Suppose we have an optimization problem with an (n+m)-dimensional state space S containing system states of the form (x₁, ...,x_n, y₁, ..., y_m), a set of constraints C = C_x ∪ C_y, and an objective function f(x₁, ...,x_n, y₁, ..., y_m) = g(x₁, ...,x_n) + h(y₁, ...,y_m). Note that there are no complicating constraints or variables.

In this scenario, we can decompose the problem into two independent optimization problems (with objective functions g and h, respectively). We can then find a solution s₁ that optimizes g under constraints C_x and a solution s₂ that optimizes h under constraints C_y. Finally, we can combine these two solutions s₁ and s₂ into a solution for the overall problem that optimizes f. As in the corresponding fact about constraint satisfaction problems, we can then use a linear-time scan through the representations of s₁ and s₂ to build this overall solution.

[link]
4. Statistical Analysis

Given a data set some of the optimization techniques we have studied can be used to find a best-fit model for that data. For example, given a data set of points in ℝ² representing some collection of observations measured along two dimensions, it is possible to find an ordinary least squares "best fit" linear function (using algebra, optimization, gradient descent, and so on). However, just because we have found a "best fit" model for a data set from some given space of functions does not mean that the model is actually a good one for that particular data set. Perhaps the relationship between the two dimensions is not linear, or even completely non-existent. How can we determine if this is the case?

In this section, we will present some statistical methods that can help us decide whether a linear model is actually appropriate for a given two-dimensional data set, and to quantify how appropriate it might be. We will review the mathematical foundations underlying the methods, how they can be applied, and how they can be realized as data transformations in the paradigms we have studied in earlier sections.

[link]
4.1. Review of Facts about Projections from Linear Algebra

We review a few useful definitions and facts about vectors in ℝⁿ from linear algebra. These provide a mathematical foundation for defining the statistical constructs discussed in this section.

[link]

Definition:

A vector is an ordered, finite tuple of real numbers (x₁, ..., x_n) ∈ ℝⁿ. Note that we will use individual variables to denote an entire vector (e.g., x ∈ ℝⁿ), and that same variable with a subscript index (e.g., x_i) to denote individual components of that vector; thus, we have that x = (x₁, ..., x_n).

[link]

Fact:

For a vector x ∈ ℝⁿ and a scalar s ∈ ℝ, we can scale x by computing s ⋅ x = (s ⋅ x₁, ..., s ⋅ x_n). Note that for a scalar s ∈ ℝ, we often use the shorthand notation x/s to mean (1/s) ⋅ x.

[link]

Definition:

For a vector x ∈ ℝⁿ the Euclidean length or norm of x is ||x|| = √(x₁² + ... + x_n²). We can also obtain a normalized unit vector of length one that points in the same direction as x by computing x/||x||.

[link]

Definition:

Given two vectors x ∈ ℝⁿ and y ∈ ℝⁿ, the dot product of the two vectors is x ⋅ y = x₁ ⋅ y₁ + ... + x_n ⋅ y_n.

[link]

Fact:

We can compute the projection of a vector x ∈ ℝⁿ onto another vector y ∈ ℝⁿ by computing (x ⋅ (y/||y||)) ⋅ (y/||y||).

[link]

Example:

Given two vectors (5,7) ∈ ℝ² and (3,0) ∈ ℝ², the normalized version of (3,0) is (1/3) ⋅ (3,0) = (1,0). The projection of (5,7) onto (3,0) is then:

((5,7) ⋅ ((1/3) ⋅ (3,0))) ⋅ ((1/3) ⋅ (3,0)

((5,7) ⋅ (1,0)) ⋅ (1,0)

(5 ⋅ 1 + 7 ⋅ 0) ⋅ (1,0)

5 ⋅ (1,0)

(5,0).

[link]

Theorem (Cauchy-Schwarz inequality):

Given two vectors x ∈ ℝⁿ and y ∈ ℝⁿ, the following inequality holds:

|x ⋅ y| / (||x|| ⋅ ||y||)

≤

Another way of writing the above is:

| (x/||x||) ⋅ (y/||y||) |

≤

The intuitive interpretation is as follows: if we try to find the projection of a unit vector x/||x|| onto another unit vector y/||y||, the resulting vector has length at most 1.

[link]
4.2. Defining Mean and Standard Deviation using Concepts in Linear Algebra

In this section, we will assume that we are working with a data set of the form [(x₁, y₁), ..., (x_n, y_n)] that contains n (not necessarily distinct) tuples and where for 1 ≤ i ≤ n we have (x_i, y_i) ∈ ℝ². For the purposes of analysis, we will also often use two projections of this data set as vectors in their own right: (x₁, ...,x_n) ∈ ℝⁿ and (y₁, ...,y_n) ∈ ℝⁿ. We will also use the shorthand notation x = (x₁, ...,x_n) and y = (y₁, ...,y_n).

[link]

Definition (arithmetic mean):

We define the arithmetic mean μ(x) of a one-dimensional or one-column data set x ∈ ℝⁿ where x = (x₁, ..., x_n) as:

μ(x)

(1/n) ⋅ (x₁ + ... + x_n)

[link]

Fact:

For x ∈ ℝⁿ, μ(x) is the quantity that minimizes the following objective function:

μ(x)

argmin_{a ∈ ℝ} (Σ_i=1ⁿ (x_i - a)²)

To prove the above, assume that the true mean is some specific value m. Next, we choose some arbitrary a in the formula and proceed algebraically from there by multiplying the terms under the summation and then using the distributive property to factor out the terms without an index subscript:

Σ_i=1ⁿ (x_i - a)²

Σ_i=1ⁿ ((x_i − m) + (m − a))²

Σ_i=1ⁿ ( (x_i − m)² + 2 ⋅ ((x_i − m) ⋅ (m − a)) + (m − a)² )

Σ_i=1ⁿ (x_i − m)² + 2 ⋅ (m − a) ⋅ (Σ_i=1ⁿ (x_i − m)) + Σ_i=1ⁿ (m − a)²

Σ_i=1ⁿ (x_i − m)² + 2 ⋅ (m − a) ⋅ (Σ_i=1ⁿ x_i − Σ_i=1ⁿ m)) + Σ_i=1ⁿ (m − a)²

In the above, notice that Σ_i=1ⁿ x_i = Σ_i=1ⁿ m because m is the true mean, so the second term is just 2 ⋅ (m − a) ⋅ 0 = 0. Thus, we can eliminate it. Furthermore, we have Σ_i=1ⁿ (m − a)² = n ⋅ (m − a)². Applying all these substitutions, we have:

Σ_i=1ⁿ (x_i - a)²

(Σ_i=1ⁿ (x_i − m)² ) + n ⋅ (m − a)²

Notice that setting a = m minimizes the above because the first term does not contain a at all, and the second term becomes 0 exactly when a = m. Thus, the a that minimizes the objective Σ_i=1ⁿ (x_i - a)² is exactly the mean.

[link]

Definition:

For a given vector x ∈ ℝⁿ, suppose we use μ(x) to create a vector with n entries (μ(x), μ(x), ..., μ(x)) in which every entry is μ(x). We can call this vector x_μ.

[link]

Fact:

The vector x_μ minimizes a particular objective function:

x_μ

argmin_{v is on the diagonal in ℝⁿ} ||x - v||

Another way of thinking about this is that x_μ as defined above is the projection of the vector x onto the diagonal line in ℝⁿ. This means that x_μ is also the point on the diagonal in ℝⁿ that is closest to x.

Notice that the geometric, vector-based interpretation of the mean preserves an important property of the mean of a data set x: rearranging the elements of x should not change the mean. In fact, this is guaranteed because the dimensions are interchangeable: for any permuted version of x (let us call it x'), the distance of x' from the diagonal will always be the same as the distance of x from the diagonal because we can simply relabel the dimensions to bring x' back to its original form x.

[link]

Example:

Suppose that our data consists of only two points in one dimension: x₁ = 10 and x₂ = 2. We can compute:

μ(x)

(1/2) ⋅ (10 + 2)

x_μ

(6, 6)

Notice that (6, 6) is on the diagonal in ℝ². Notice also that the vector x − x_μ = (10, 2) - (6, 6) = (4, -4) has a slope of − 1 (it is orthogonal to the diagonal). This is consistent with x_μ being a projection of x onto the diagonal.

[link]

Definition (standard deviation):

We define the standard deviation σ(x) of a one-dimensional or one-column data set x ∈ ℝⁿ where x = (x₁, ..., x_n) as:

σ(x)

√( (1/n) ⋅ ((x₁ − μ(x))² + ... + (x_n − μ(x))²) )

[link]

Fact:

For x ∈ ℝⁿ, σ(x) can be rewritten in terms of the norm (i.e., length) ||x − μ(x)|| of the vector x − μ(x):

σ(x)

(1/√(n)) ⋅ √( ((x₁ − μ(x))² + ... + (x_n − μ(x))²) )

(1/√(n)) ⋅ ||x − x_μ||

In other words, we can view the standard deviation of a data vector x as a normalized distance between x and the diagonal in ℝⁿ.

We might ask why the extra scalar term 1/√(n) is necessary in the above. After all, the term ||x − x_μ|| already measures the distance between the vector x and the diagonal. Furthermore, the individual x_i terms must represent some units specific to the domain from which the data is drawn (e.g., temperature, height, velocity, and so on), while n and √(n) have no units relevant to the domain of the data (they are just quantities associated with how many data points were collected).

The reason is that when the number of components in the vector x changes, the number of dimensions in the space from which the vector is drawn also changes. But the number of dimensions influences the lengths of vectors (if the individual vector components are about the same): the more dimensions there are, the longer a vector can become because the distances from all dimensions contribute to the length of a vector.

[link]

Example:

Consider the lengths of the following vectors (each drawn from a vector space having its own distinct number of dimensions):

||(1, 1)||

√(2)

||(1, 1, 1)||

√(3)

||(1, 1, 1, 1)||

√(4)

||(1, 1, 1, 1, 1)||

√(5)

⋮

||(1, 1, 1, 1, 1, ..., 1)||

√(n)

In general, we can see that the following holds for a vector of length n in which components are all the same value s:

||(s, s, s, ..., s)||

√(n ⋅ s²)

s ⋅ √(n)

The above example shows that if we want to compare the standard deviations of two distinct data sets that are of different sizes, we need to account for the fact that the vectors are drawn from different vector spaces (in terms of the number of dimensions). The scalar term 1/√(n) in the standard deviation formula accounts for this difference and makes comparisons possible.

One practical issue that can be a concern when defining statistical analysis workflows (e.g., using the relational or MapReduce paradigms) over large data sets is how optimizations based on algebraic laws may interact with the representation size of numerical values within the language, library, or framework being used.

[link]

Example:

Suppose that we have a large data set x ∈ ℝ¹⁰⁰⁰consisting of 1000 entries: x = (x₁, ..., x₁₀₀₀). Assume that 995 of these entries are approximately 0.1 while the other five are all larger (i.e., let us assume they are all 2). Furthermore, suppose that we have a representation size for floating point numbers such that the smallest floating point number we can represent is 0.001 (anything else is cut off and thus rounded to 0).

If we want to compute the mean, we might decide to first scale every entry in the vector x by dividing it by n to obtain the vector (x₁/n, ..., x₁₀₀₀/n). However, notice that this will lead to 0.1/1000 = 0.0001 as the true mathematical answer for most of the data. Due to rounding errors, 995 of the data points will be treated as if they were 0. Thus, we will have:

μ(x)

(2 + 2 + 2 + 2 + 2) / 1000

0.01

But notice that the true mean is quite different; the error we see above is an entire order of magnitude (and grows as the number of data points grows):

μ(x)

(2 + 2 + 2 + 2 + 2 + 0.1 ⋅ 995) / 1000

(10 + 99.5) / 1000

0.1095

One intuitive response might be to avoid the optimization and perform the summation of all components of x before performing division by n. But this may again lead to an issue when there are too many data points, as the total may exceed the maximum representation size of a number!

Issues such as the above are less common in contemporary tool chains because many libraries use representations for numerical data structures that allow arbitrary precision and size. If such a library is not available, however, we must ensure that our optimizations do not introduce such errors. In the specific scenario above, the solution might be to split the data up into smaller chunks, perform the optimization over those small chunks, and then aggregate the results of those individual computations (though this would only work for associative operations such as summation).

[link]
4.3. Covariance and Correlation

Given a data set that has two-dimensional entries of the form (x_i, y_i), it is possible to compare the two dimensions by comparing their standard deviation vectors.

[link]

Definition (covariance):

For x ∈ ℝⁿ and y ∈ ℝⁿ, we define the covariance between x and y as:

cov(x, y)

(1/n) ⋅ ((x₁ - μ(x)) ⋅ (y₁ - μ(y)) + ... + (x_n - μ(x)) ⋅ (y_n - μ(y)))

(1/n) ⋅ (Σ_i=1ⁿ (x_i - μ(x)) ⋅ (y_i - μ(y)))

(1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)

((1/√(n)) ⋅ (x − x_μ)) ⋅ ((1/√(n)) ⋅ (y − y_μ))

Notice that the last line in the above indicates that we can view the covariance as the dot product of the vectors (x − x_μ) and (y − y_μ) (each of which represent how x and y deviate from their respective means).

[link]

Definition (correlation coefficient):

For x ∈ ℝⁿ and y ∈ ℝⁿ, we define the correlation coefficient (also known as the Pearson product-moment correlation coefficient) between x and y as:

ρ(x, y)

cov(x, y) / (σ(x) ⋅ σ(y))

[link]

Fact:

For x ∈ ℝⁿ and y ∈ ℝⁿ, it must be that -1 ≤ ρ(x, y) ≤ 1. We can see this by noticing that the formula for ρ(x, y) appears in the Cauchy-Schwarz inequality:

ρ(x, y)

cov(x, y) / (σ(x) ⋅ σ(y))

((1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)) / ((1/√(n)) ⋅ ||x − x_μ|| ⋅ (1/√(n)) ⋅ ||y − y_μ||)

((1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)) / ((1/n) ⋅ ||x − x_μ|| ⋅ ||y − y_μ||)

((x − x_μ) ⋅ (y − y_μ)) / (||x − x_μ|| ⋅ ||y − y_μ||)

Notice that the only difference between the above and the Cauchy-Schwarz inequality is that in the above, the numerator is not guaranteed to be positive. Thus, we have:

−1

≤

cov(x, y) / (σ(x) ⋅ σ(y))

≤

In addition to being bounded by −1 and 1, the correlation coefficient has some other useful properties. If x and y have a positive linear correlation, then ρ(x, y) is positive. If they have a negative linear correlation, it is negative. Given two pairs of data sets x, y and x', y', if |ρ(x, y)| > |ρ(x', y')|, then x and y have a stronger correlation with each other than do x' and y'.

While the correlation coefficient is useful for comparing correlations, the quantity itself should not be used to infer anything about the extent of the correlation. To do that with some rigor, it is necessary to introduce a few additional concepts.

[link]

Example:

Suppose that our data consists of only a pair of two-dimensional tuples: x = (2, 6) and y = (20, 60). In other words, the data set is [(2,20), (6,60)]. Notice that the slopes of the orthogonal projections of x and of y onto the diagonal are the same, and if we normalize these projections, they become the same vector. That means (x − x_μ) / ||x − x_μ|| = (y − y_μ) / ||y − y_μ||. But the dot product of a vector with itself is the square of its length: v ⋅ v = ||v||². Thus, we have:

cov(x, y) / (σ(x) ⋅ σ(y))

(x − x_μ) ⋅ (y − y_μ) / (||x − x_μ|| ⋅ ||y − y_μ||)

Thus, because the two dimensions have a positive linear correlation in this example, the correlation coefficient is 1.

[link]
4.4. Observations, Hypothesis Testing, and Significance

The correlation coefficient gives us a way to quantify that there may exist a correlation between two dimensions in a data set. However, this correlation may be an artifact of the way we happened to obtain the data. For example, we may have obtained a biased sample. We can account for this by modeling the likelihood of obtaining a biased sample given the assumption that no correlation exists. If it is very unlikely that we would obtain a sample with the observed correlation if no correlation actually exists, then we might conclude that it is likely that a correlation does exist.

To take this approach, we need to compare the correlation we observe against something (i.e., to decide how likely or unlikely it is if there is no correlation). This means we need to define mathematically what it means for there to be no correlation.

[link]

Definition (distribution):

A distribution over a set S is a function f: S → ℝ. If S represents a set of possible observations of a system (e.g., a collection of system states), then a distribution f might represent the probability that the system is in that state, or the frequency with which that system state has been or can be observed.

[link]

Definition (p-value):

Let a subset of the real number line S ⊂ ℝ represent all possible system states that can be observed, and let f be a distribution over S that represents the probability of observing each of the possible states in S. Given some state s ∈ S, the probability of observing s is then f(s). We define the p-value at the observed state s to be the probability of observing any state that is equal to or more extreme than the state s.

One way to define which states are more extreme is to consider all the states from s to the nearest endpoint. Suppose S is an interval [a, b]. If s > (a − b) / 2, then the p-value would be computed by taking the definite integral of f from s to b. Otherwise, it would be computed by taking the definite integral of f from s to a.

In order to use the p-value concept to reason about correlation coefficients, it is necessary to model how the correlation coefficient should behave if no correlation exists. One way to do this using the data alone is to consider every permutation of the data.

[link]

Example:

Assume we have a two-dimensional data set [(x₁, y₁), ..., (x_n, y_n)], also represented by x ∈ ℝⁿ and y ∈ ℝⁿ. One way to obtain a range of possible correlation coefficients is to compute the correlation coefficient for all n! possible permutations of the second dimension. Intuitively, if the data has no correlation, then performing this permutation should not change the correlation coefficient much. On the other hand, if the data does have a correlation, we should get a distribution of correlation coefficients with respect to which the true correlation coefficient is "extreme".


from random import shuffle
from math import sqrt

data = [(18, 28), (24, 18), (27, 31), (14, 15), (46, 23),
        (36, 19), (27, 10), (34, 25), (19, 15), (13, 13),
        (4, 2), (17, 20), (28, 12), (36, 11), (26, 14),
        (19, 19), (24, 13), (25, 6), (20, 8), (17, 22),
        (18, 8), (25, 12), (28, 27), (31, 28), (35, 22),
        (17, 8), (19, 19), (23, 23), (22, 11)]
x = [xi for (xi, yi) in data]
y = [yi for (xi, yi) in data]

def permute(x):
    shuffled = [xi for xi in x]
    shuffle(shuffled)
    return shuffled

def avg(x): # Average
    return sum(x)/len(x)

def stddev(x): # Standard deviation.
    m = avg(x)
    return sqrt(sum([(xi-m)**2 for xi in x])/len(x))

def cov(x, y): # Covariance.
    return sum([(xi-avg(x))*(yi-avg(y)) for (xi,yi) in zip(x,y)])/len(x)

def corr(x, y): # Correlation coefficient.
    if stddev(x)*stddev(y) != 0:
        return cov(x, y)/(stddev(x)*stddev(y))

def p(x, y):
    c0 = corr(x, y)
    corrs = []
    for k in range(0, 2000):
        y_permuted = permute(y)
        corrs.append(corr(x, y_permuted))
    return len([c for c in corrs if abs(c) >= abs(c0)])/len(corrs)

In the above, we compute the correlation coefficient for a large number of permuted data sets. We then compute the fraction of these data sets that have a correlation coefficient with a higher absolute value than the correlation coefficient of the original data. We can call this fraction the p-value. Ideally, we would try all possible permutations; since this is infeasible, we can use this technique to get an approximation.

Alternatively, we can use a library to compute the same information. The scipy.stats.pearsonr() function will return the correlation coefficient and the p-value.


import scipy.stats
print(scipy.stats.pearsonr(x, y))

Another approach is to mathematically derive a distribution of correlation coefficients based on some assumptions about how the observation vectors x and y are distributed if they are independent of one another (i.e., not correlated) and each follow some distribution (e.g., the normal distribution). Then, computing a definite integral of this distribution from some specific observation to the nearest boundary would yield the p-value for that observation. Taking this approach usually requires parameterizing the distribution using the size of the data set (i.e., the number of samples) because this can influence how the distribution of correlation coefficients should be computed.

[link]

Example:

Suppose that you have a data set of n observations in which each tuple is of the form (w_i, x_i, y_i, z_i), where w = (w₁, ..., w_n), x = (x₁, ..., x_n), y = (y₁, ..., y_n) and z = (z₁, ..., z_n). You compute some covariance and correlation coefficient quantities and find the following relationships:

cov(x, y)

cov(w, z)

ρ(y, z)

ρ(w, x)

Can you conclude that the correlation between x and y is stronger than the correlation between w and z?
Can you conclude that the correlation between y and z is stronger than the correlation between w and x?
Suppose that we change the last inequality to ρ(y, z) < 0. Does this change your answers to any of the above?

[link]

Exercise:

Suppose you have a data set D consisting of tuples of the form (x_i, y_i) and you want to define the necessary transformations using the MapReduce paradigm to compute the covariance for this data set. Provide the definitions of the map and reduce operations necessary to accomplish this.

[link]

Example:

Suppose you have a data set D consisting of tuples of the form (x_i, y_i) and you want to compute the correlation coefficient for this data set with a transformation sequence that uses building blocks from the relational paradigm. The flow diagram for this sequence of transformations is provided below.

!table([ [['dr:prod`r:proj``(xi,yi)'], ['r:agg plus``(xi,yi,1)'], ['d:proj``(sx,sy,n)']], [['d:agg plus``(((xi-μ(x))^2)/n, ((y-μ(y))^2)/n)'], ['rd:proj`l:proj``(xi,yi,μ(x), μ(y), n)'], ['l:prod``(μ(x) = sx/n, μ(y) = sy/n, n)']], [['d:proj``(σ(x)^2, σ(y)^2)'], ['d:prod``(cov(x, y)'], ['l:agg plus``(1/n)*(x-μ(x))*(y-μ(y))']], [['r:prod``(σ(x), σ(y))'], ['r:proj``(cov(x,y), σ(x), σ(y))'], ['(ρ(x, y))']] ])

[link]
4.5. Sampling and Inference

Often, data sets are partial observations of the overall system. Despite this, we may still be able to infer things about the system's properties from those partial observations. In this subsection we go over just a few of the many methods available for making inferences based on data obtained via sampling.

[link]

Definition (sampling):

Given some data set S (which may not be known in its entirely and may only exist in theory, e.g., the data set of all humans, or the data set of all Earth-size planets in the galaxy), we call any subset T ⊂ S a sample of S.

Typically, the data set being sampled will contain records or tuples, each corresponding to an item or an individual. Each tuple provides some information about the properties or measurements of that item or individual along certain dimensions (e.g., size, location, and so on).

[link]

Definition (probability sampling):

A probability sample of a data set S is any sample T ⊂ S that is collected under the condition that for every tuple, the probabiility of selecting that tuple is known in advance.

[link]

Fact (proportional reasoning between independent variables):

Assume a data set S has information that can be used to derive two relevant binary characteristics describing the individuals or items in that data set (e.g., whether an individual is male or female, whether an individual resides in a particular neighborhood, and so on). We can define two (possibly overlapping) subsets A ⊂ S and B ⊂ S corresponding to those subsets that have each of those characteristics.

If the two characteristics are independent we can conclude that:

|A ∩ B|/|B|

|A ∩ S|/|S|

|A|/|S|

In other words, if we treat the tuples from B as a sample of S and we find the fraction of those tuples that are also in A, we can conclude that the same would have occurred if we had sampled the entirety of S.

[link]

Example:

Assume we know that within a certain geographical region, one out of every 10 people is a member of an online social networking mobile application that collects the geographical locations of its users. Suppose we take a sample of how many unique users were present in a particular neighborhood within that region over the course of a week, and we get a total of 1200 users. Assuming that there is no correlation between that particular neighborhood and the popularity of that mobile application (i.e., residency in that neighborhood and membership in the social network are independent variables), we can then infer the total population in that neighborhood by solving the following equation:

1/10

1200/n

The above yields n = 12,000, so we can conclude that there are about 12,000 residents in that particular neighborhood based on our observations.

In the above example, we could have also worked in the other direction if we only knew the overall population in the region, the population of the neighborhood, and the popularity of the application within the population in that neighborhood. This would allow us to derive the total number of users of the application in the region.

[link]

Fact (capture-recapture):

Suppose that we can use probability sampling of a data set of size N, we know that the probability of selecting any individual tuple is equivalent to 1/N, and we do not know the size N of the data set. In addition, suppose we are allowed to "mark" any tuple that appears in a sample we take so that we can identify it (with complete certainty) as a previously marked tuple if we see it in subsequent samples.

Under these conditions, we can estimate the size of the entire data set (an unknown quantity) in the following way. First, we take a sample of size n of the data set, and we mark every tuple in that sample. Next, we take another (independent) sample of size K and count the number of marked tuples in that sample (suppose it is k ≤ K). We can then solve the following equation for N:

k/K

n/N

(n ⋅ K) / k

[link]

Example:

Suppose that we want to know how many distinct people regularly visit a particular grocery store, and we can identify visitors in some way (e.g., by recording their unique identifiers). We can fix a particular time period and collect the identifiers of all the people who visit that store (suppose we collect n of them). We can then fix another time period, again record the identifiers of all K visitors during that time period, and then determine what fraction k/K of the identifiers from the first collection appears in the second collection. We could then estimate that the total number of individuals who visit the store regularly is around (n ⋅ K) / k.

[link]

Exercise:

You are obtaining a probability sample of a population such that the probability of selecting every individual is the same. You take a sample of 1000 individuals and find that 200 of them are members of a particular social network. Answer the following questions.

Assuming that membership in the social network is completely independent of age, and that there are 20,000 individuals who are 18-24 years old in the overall population, provide one possible estimate k for the number of individuals in the 18-24 age range who are members of the social network.
Suppose you determine that the social network has 10,000 unique members. What is the size n of the population from which your sample was taken?

[link]

Fact (Algorithm R and reservoir sampling):

Using Algorithm R, it is possible to take a probability sample of size k from an incoming sequence of data set entries (e.g., tuples) without knowing the length of the sequence in advance. Furthermore, it is guaranteed that every item in the sequence had an equal probability of being added to the sample.

[link]

Exercise:

Assume we have enough storage in our database to store n + k tuples, and it currently holds a data set X that has n tuples. However, we expect that tomorrow, we will obtain a new data set Y that also has n tuples, so we cannot store it unless we delete X completely. We need to take a probability sample of size k from the collection X ∪ Y such that any item has an equal likelihood of being added to the sample. Is this possible? Explain your answer. You may assume that every tuple in X and Y has a unique identifier and there is no overlap between X and Y.

[link]

Exercise:

You were using Algorithm R to perform reservoir sampling of packets as they were arriving, with a sample size of 100. You were also recording the number of packets there were corrupted, but you did not record the total number of packets that arrived.

Later on, someone asks you to compute the total number of packets that had arrived during the sampling period. You look at your sample and you find that out of 100 packets, 7 were corrupted. You also recorded that a total of 1400 packets had been corrupted during the sample collection period. How many packets arrived during the sample collection period?

[link]

Exercise:

Suppose there are exactly n entries in a data set of all neighborhoods and that these entries can be divided into two dimensions: x = (x₁, ..., x_n) and y = (y₁, ..., y_n). Furthermore, for all 1 ≤ i ≤ n, we have that x_i ∈ {0,1} and y_i ∈ {0,1}, where each observation of the form (x_i, y_i) specifies whether the ith neighborhood has a hospital (the x_i component) and whether it has a school (the y_i component).

We compute the correlation coefficient and determine that ρ(x, y) = 0. If there are 12 neighborhoods with a school, there are 6 neighborhoods that have both a school and a hospital, and there are 8 hospitals in total, what can these facts reasonably imply about the total number of neighborhoods n?

[link]
4.6. Project #2: Modeling, Optimization, and Statistical Analysis

The purpose of this project is to practice using the modeling, optimization, and statistical analysis tools we have covered to answer questions or solve problems using the data sets obtained by all the teams in Project #1.

For the purposes of this project description, we assume a team consisting of two members, Alice and Bob, with the team identifier alice_bob in accordance with Project #0. Teams consisting of up to four people are permitted for this project.

For this project, teams will work with the GitHub repository created for Project #1, consisting of all the teams' submissions (merged via merge requests at the time of submission) at https://github.com/Data-Mechanics/course-2018-spr-proj.
- If your Project #1 team is not going to change, you should be able to continue working on your project on your own fork. Pull all the updates from the project repository to sync your fork with it (note that this will now include all the other teams' projects and scripts from Project #1).
- If you are creating a new or larger team, you should either create a new fork of the project repository, or rename the folder in your working fork to accurately represent the members of your new team. It is up to you how you combine your individual scripts and whether you wish to include or discard any redundant scripts or configuration files, but your project structure should be consistent with that of a single team and single team project.
As before, the course staff may commit and push additional fixes or updates to the repository before the project deadline, so check for updates and pull them into your fork on a regular basis. If you are utilizing another team's data or scripts, or are allowing another team to utilize your data or scripts, do not duplicate them in your own file tree. Instead, reference them in their original location (using the other team's collection prefix identifier and/or folder path). If another team would like to use an updated version of your data sets and/or scripts, perform a pull request so that the course staff can merge your latest code in the project repository. At that point, the other team can pull that update from the project repository.
Choose a problem or family of problems that you wish to solve using some collection of relevant data sets. These data sets could already be available in the repository (thanks to a Project #1 submission by your team or another team), or they may be additional data sets you plan to obtain with new retrieval scripts as you did in Project #1. The problem itself could require solving a graph problem, performing a statistical analysis or analyses, solving a constraint satisfaction or optimization problem, developing a scoring mechanism, or achieving some other quantifiable goal for which at least some of the techniques we have learned so far might be useful. It could be the original problem you intended to solve in Project #1, or something new.
1. Write a short narrative and justification (at least 7-10 sentences) explaining how the data sets and techniques you hope to employ can address the problem you chose. Depending on what techniques you are using and what problem you are trying to solve, you may need to provide a justification or evaluation methodology for the techniques you are using (i.e., why you believe the chosen techniques will solve the problem you are trying to address). Include this narrative in the README.md file within your directory (you may only need to update your existing file).
2. Implement a non-trivial constraint satisfaction or optimization technique that requires the data sets you have collected and addresses aspects of the problem or problems described in part (a). You can choose one of the existing algorithms or tools we have considered so far, or something else that qualifies as a constraint satisfaction or optimization technique (assuming you thoroughly document in your README.md file how to obtain and run those tools). The technique can be a combinatorial or specialized algorithm (i.e., you can use k-means, gradient descent, dynamic programming, greedy algorithms, graph algorithms, and so on).
3. Implement a non-trivial statistical analysis or inference algorithm that uses the data sets you have collected and addresses aspects of the problem or problems described in part (a). As in part (b), you can choose one of the existing algorithms or tools we have considered so far, or something else (assuming you thoroughly document in your README.md file how to obtain and run those tools).
As in Project #1, your algorithms should be implemented within scripts that extend the dml.Algorithm base class, should follow reasonable modularity and encapsulation practices, and should perform logically related operations. If the results you compute consist of new data sets; these should be inserted into the repository along with all the others in the usual way (and should be considered derived data sets).
1. Ensure that whatever scripts or algorithms you implemented in Problem #2 can also produce the provenance information documenting the activities that take place and the data they generate during their operation. Each algorithm should produce a single provenance document when its provenance() method is invoked.
2. Ensure that whatever scripts or algorithms you implemented in Problem #2, add the capability to run the algorithm in trial mode, which should occur if the trial parameter of the execute() method is set to True. In trial mode, the algorithm should complete its execution very quickly (in at most a few seconds) by operating on a very small portion of the input data set(s). However, it should still run through most (or, ideally, all) the code paths in the algorithm definition when it does so. This will make it possible to easily test the algorithm without running it on the entire data set.

[link]
5. Visualizations and Web Services

[link]
5.1. Visualizations

[link]
5.2. Web Services

In this section, we present examples of how a few existing tools can be used to implement and test a basic web service. More generally, there are innumerable frameworks for building server-side and client-side application components.

[link]

Example:

The following presents one possible web service implementation using the Flask framework.


import jsonschema
from flask import Flask, jsonify, abort, make_response, request
from flask.ext.httpauth import HTTPBasicAuth

app = Flask(__name__)
auth = HTTPBasicAuth()

users = [
  { 'id': 1, 'username': u'alice' },
  { 'id': 2, 'username': u'bob' }
]

schema = {
  "type": "object", 
  "properties": {"username" : {"type": "string"}},
  "required": ["username"]
}

@app.route('/client', methods=['OPTIONS'])
def show_api():
    return jsonify(schema)

@app.route('/client', methods=['GET'])
@auth.login_required
def show_client():
    return open('client.html','r').read()

@app.route('/app/api/v0.1/users', methods=['GET'])
def get_users(): # Server-side reusable name for function.
    print("I'm responding.")
    return jsonify({'users': users})

@app.route('/app/api/v0.1/users/', methods=['GET'])
def get_user(user_id):
    user = [user for user in users if user['id'] == user_id]
    if len(user) == 0:
        abort(404)
    return jsonify({'user': user[0]})

@app.errorhandler(404)
def not_found(error):
    return make_response(jsonify({'error': 'Not found foo.'}), 404)

@app.route('/app/api/v0.1/users', methods=['POST'])
def create_user():
    print(request.json)
    if not request.json:
        print('Request not valid JSON.')
        abort(400)

    try:
        jsonschema.validate(request.json, schema)
        user = { 'id': users[-1]['id'] + 1, 'username': request.json['username'] }
        users.append(user)
        print(users)
        return jsonify({'user': user}), 201
    except:
        print('Request does not follow schema.')
        abort(400)

@auth.get_password
def foo(username):
    if username == 'alice':
        return 'ecila'
    return None

@auth.error_handler
def unauthorized():
    return make_response(jsonify({'error': 'Unauthorized access.'}), 401)

if __name__ == '__main__':
    app.run(debug=True)

The HTML page below can be used to test the POST request handling capabilities of the application.


<script>
  function postSomething() {
    var http = new XMLHttpRequest();
    var url = "http://localhost:5000/app/api/v0.1/users";
    var json = JSON.stringify({'name':'carl'});
    http.open("POST", url, true);
    http.setRequestHeader("Content-type", "application/json");
    http.onreadystatechange = function() {
      if (http.readyState == 4 && http.status == 200) {
        console.log(http.responseText);
      }
    }
    http.send(json);
  }
</script>
<button onclick="postSomething();">Post new user</button>

[link]
5.3. Project #3: Visualizations, Web Services, and Complete Project

The purpose of this project is to expose the data sets and capabilities you assembled in previous projects using visualizations and/or web services, and to present your completed implementation (and any associated results). This is also the last project submission, and should represent the culmination of your work during the semester.

Extend your project with two new features/components, where each feature/component is either an interactive web-based visualization that can be displayed in a standard web browser or a web service with a RESTful web API. You might choose to create two visualizations given the results you produced a previous project, or you might choose to create an interactive client-server (i.e., web interface and web service) application that allows users to invoke an algorithm or statistical analysis using their own specific parameters.
Complete a well-written and thorough description of your project (and the components you implemented) in the form of a report (which can be a README.md file within your repository, though HTML or PDF are also acceptable). It is expected that the report should come out to at least 3-5 pages (if printed in a 12-point font with 1.5 spacing at most on 8.5 by 11 in. sheets), but there's no upper limit on length.
The report should bring together all the text, diagrams, and images you generated and put them together into a coherent summary that explains what you did for someone who is not familiar with your project. Some ideas for what you may want to include (this is a rough outline in the form of introduction, methods, results, and future work, with suggested subtopics that may or may not apply in your case, so adjust or substitute as appropriate).
- an introduction/narrative describing the problem or topic you chose to address and any associated motivation or background:
  - list the data sets you used or other online resources you collected or retrieved automatically, and their origins, and
  - describe any data sets you assembled manually or programmatically using the above;
- specify which algorithms, analysis techniques, or tools you used, and describe any interesting issues you encountered with regard to:
  - usability and/or performance of the tool or technique,
  - necessary adjustments or transformations to the data in order to make it compatible with the tool or technique, and/or
  - limitations of the technique or the data;
- summarize the results and your conclusions about the problem or topic (whether or not they are definitive or open problems remain), and include screenshots of the visualizations (or the visualizations themselves);
- describe any ideas for future work that you had while working on the project that you think might be useful to pursue, or that you did not have a chance to pursue (this will be particularly useful to us for the next iteration of the course).
Complete a poster for the poster session that will take place 3-5 PM on Tuesday, May 1st at the Hariri Institute for Computing. The poster can contain the same information that your report contains, though visuals are usually more effective. You may choose to bring a demo to accompany your poster presentation. Submit your poster file as part of the pull request with your final submission. You may place the poster in the root directory of your project/group folder.

[link] References

[1]	"World's population increasingly urban with more than half living in urban areas". https://www.un.org/development/desa/en/news/population/world-urbanization-prospects.html
[2]	Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. West. "Growth, innovation, scaling, and the pace of life in cities". Proceedings of the National Academy of Sciences of the United States of America 2007;104(17):7301-7306. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852329/
[3]	Robert Albright, and Alan Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker White. "SGL: A Scalable Language for Data-driven Games". Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data 2008. http://www.cs.cornell.edu/~sowell/2008-sigmod-games-demo.pdf
[4]	Walker White, Benjamin Sowell, Johannes Gehrke, and Alan Demers. "Declarative Processing for Computer Games". Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games 2008. http://www.cs.cornell.edu/~sowell/2008-sandbox-declarative.pdf
[5]	W3C Working Group. "PROV Model Primer". https://www.w3.org/TR/prov-primer/
[6]	Robert Ikeda and Jennifer Widom. "Data Lineage: A Survey". http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf
[7]	Y. Cui and J. Widom. "Lineage Tracing for General Data Warehouse Transformations". The VLDB Journal 2003;12(1):41-58. http://ilpubs.stanford.edu:8090/525/1/2001-5.pdf
[8]	Gerd Gigerenzer. "Mindless statistics". The Journal of Socio-Economics 2004;33(5):587-606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf
[9]	Nihar B. Shah and Dengyong Zhou. "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing". CoRR 2014. http://www.eecs.berkeley.edu/~nihar/publications/double_or_nothing.pdf

[link]
Appendix A. Using GitHub

In this course, staff will use GitHub to distribute skeleton and evaluation code for assigned projects, and students will maintain their code, collaborate on their project, and submit their solution or implementation using GitHub. Students will need to register on GitHub. Students should be able to sign up for a student discount, as well, using their Boston University credentials.

[link]
A.1. Retrieving and Submitting Assigned Projects on GitHub

We assume that students are organized into groups (initially up to two students working together, though this limit will be raised to three for the later projects). Each group should designate a primary member. The workflow for student projects will be as follows.

For each project assigned during the course, a private repository will be created by course staff under the Data-Mechanics organization account with a name of the form course-YYYY-SSS-proj-NNNN where YYYY is the year, SSSS is the semester (spr or fal) and NNNN is zero, one, two, and so on.
- The first project (course-YYYY-SSS-proj-zero) will be a public repository so that those who are not members of the Data-Mechanics organization can see and fork it in the steps below, as that will be how the course staff obtain everyone's GitHub usernames.
To begin working on that particular project, one of the group's members will fork that project repository, and will ensure that the other group members have the necessary privileges to contribute to it by adding them as collaborators.
- Within the forked repository's file tree, each group will create a subdirectory of the form BUlogin1, BUlogin1_BUlogin2, or BUlogin1_BUlogin2_BUlogin3 (depending on the number of members), where the login names BUloginN are the official Boston University login names for the members, and they are ordered in ascending alphabetical order and separated by underscores. All changes constituting work on the project should be made within this subdirectory and nowhere else, unless otherwise specified in the posted project instructions.
- It will be the responsibility of the group to ensure that their repository can be synced with the original project repository (this step only needs to be performed once).
- The group must also regularly sync their fork with the original repository in case course staff have made any changes to it. Course staff will usually announce when such changes take place.
The group members may use as many branches as they need in their own fork, but their master branch will represent the group's submission. At some point before the project deadline, the group must submit a pull request to officially submit their work. Only the changes that were committed before the merge request is made will be accepted as submitted.

[link]
Appendix B. Other Resources

[link]
B.1. MongoDB and Related Resources

In this course, staff will use MongoDB to store and work with large data sets. This section lists some resources that you may find useful.

Some references, tutorials, and other materials of interest mentioned during lecture include the following.

The official documentation (https://docs.mongodb.org/manual/), and in particular:
- how to assemble relational queries (https://docs.mongodb.org/manual/core/aggregation-pipeline/) and
- how to assemble map-reduce queries (https://docs.mongodb.org/manual/core/map-reduce/).
Issues and feature requests mentioned during lecture include:
- flattening the results generated by mapReduce operations: https://jira.mongodb.org/browse/SERVER-2517 and
- deprecation and future removal of server-side eval() and stored procedure capabilities: https://jira.mongodb.org/browse/SERVER-17453.
A tutorial suggested by a student: http://openmymind.net/mongly/.
PyMongo Python library documentation: http://api.mongodb.org/python/current/.
Some resources dealing with using and building geospatial data set indices include:
- the GeoJSON format,
- using geospatial indices and queries in MongoDB,
- example illustrating how to create a GEO2D index using PyMongo, and
- extracts of OpenStreetMap data for metropolitan areas.

[link]
B.2. Installation Resources for Other Software Packages and Libraries

This subsection contains links to other useful resources that may help with installing packages and libraries we use in this course.

Some resources that may be useful when working with the prov package:

when installing the prov library for Python on Windows, you may have some trouble installing the lxml package. Try obtaining the precompiled package here, instead, and installing it using pip install *.whl;
a short tutorial is available here.

As we use the JSON format for storing authentication, configuration, and provenance information, the JSON Schema is useful and there is a corresponding jsonschema Python package that you will need to install.

Two useful kinds of tools for solving problems that can be represented as systems of equations and inequalities are SMT solvers and linear optimization packages.

The SciPy library collection includes the optimize.linprog module for solving linear optimization problems, and the library is relatively straightforward to install:
- using Windows, it may be easiest to obtain and install precompiled binaries for numpy and scipy (in that order);
- using Mac OS X or Linux, you will likely need to download and install a Fortran compiler first, after which running pip install numpy and pip install scipyshould be sufficient.
The Z3 SMT solver is now open source and very easy to use with Python; precompiled binaries are also available.

Data Mechanicsfor Pervasive Systems and Urban Applications

[link] 1. Introduction, Background, and Motivation

[link] 1.1. Overview

[link] 1.2. Data Mechanics Repository and Platform

[link] 1.3. Mathematical Modeling, Analysis Algorithms, and Optimization Techniques

[link] 1.4. Project #0: Preparation and Trial Submission

[link] 2. Modeling Data and Data Transformations

[link] 2.1. Relational Data and the MapReduce Paradigm

[link] 2.2. Composing Transformations into Algorithms

[link] 2.3. Data Provenance

[link] 2.4. Project #1: Data Retrieval, Storage, Provenance, and Transformations

[link] 3. Systems, Models, and Algorithms

[link] 3.1. Systems, Models, and Metrics

[link] 3.2. Linear systems, satisfiability modulo theories, and linear programming

[link] 3.3. Graph and Spatial Problems as Constraint Satisfaction and Optimization Problems

[link] 3.4. Decomposition Techniques

[link] 4. Statistical Analysis

[link] 4.1. Review of Facts about Projections from Linear Algebra

[link] 4.2. Defining Mean and Standard Deviation using Concepts in Linear Algebra

[link] 4.3. Covariance and Correlation

[link] 4.4. Observations, Hypothesis Testing, and Significance

[link] 4.5. Sampling and Inference

[link] 4.6. Project #2: Modeling, Optimization, and Statistical Analysis

[link] 5. Visualizations and Web Services

[link] 5.1. Visualizations

[link] 5.2. Web Services

[link] 5.3. Project #3: Visualizations, Web Services, and Complete Project

[link] References

[link] Appendix A. Using GitHub

[link] A.1. Retrieving and Submitting Assigned Projects on GitHub

[link] Appendix B. Other Resources

[link] B.1. MongoDB and Related Resources

[link] B.2. Installation Resources for Other Software Packages and Libraries

[link]
1. Introduction, Background, and Motivation

[link]
1.1. Overview

[link]
1.2. Data Mechanics Repository and Platform

[link]
1.3. Mathematical Modeling, Analysis Algorithms, and Optimization Techniques

[link]
1.4. Project #0: Preparation and Trial Submission

[link]
2. Modeling Data and Data Transformations

[link]
2.1. Relational Data and the MapReduce Paradigm

[link]
2.2. Composing Transformations into Algorithms

[link]
2.3. Data Provenance

[link]
2.4. Project #1: Data Retrieval, Storage, Provenance, and Transformations

[link]
3. Systems, Models, and Algorithms

[link]
3.1. Systems, Models, and Metrics

[link]
3.2. Linear systems, satisfiability modulo theories, and linear programming

[link]
3.3. Graph and Spatial Problems as Constraint Satisfaction and Optimization Problems

[link]
3.4. Decomposition Techniques

[link]
4. Statistical Analysis

[link]
4.1. Review of Facts about Projections from Linear Algebra

[link]
4.2. Defining Mean and Standard Deviation using Concepts in Linear Algebra

[link]
4.3. Covariance and Correlation

[link]
4.4. Observations, Hypothesis Testing, and Significance

[link]
4.5. Sampling and Inference

[link]
4.6. Project #2: Modeling, Optimization, and Statistical Analysis

[link]
5. Visualizations and Web Services

[link]
5.1. Visualizations

[link]
5.2. Web Services

[link]
5.3. Project #3: Visualizations, Web Services, and Complete Project

[link]
Appendix A. Using GitHub

[link]
A.1. Retrieving and Submitting Assigned Projects on GitHub

[link]
Appendix B. Other Resources

[link]
B.1. MongoDB and Related Resources

[link]
B.2. Installation Resources for Other Software Packages and Libraries