Data Mechanicsfor Pervasive Systems and Urban Applications

NOTE: This page contains all the examples presented during the lectures, as well as all the assigned projects. Click here to go back to the main page with the course information and schedule.

1. Introduction, Background, and Motivation
2. Modeling Data and Data Transformations
3. Systems, Models, and Algorithms
4. Statistical Analysis
5. Visualizations and Web Services
- 5.1. Visualizations
- 5.2. Web Services
- 5.3. Project #3: Visualizations, Web Services, and Complete Project
References
Appendix A. Using GitHub
- A.1. Retrieving and Submitting Assigned Projects on GitHub
Appendix B. Other Resources
- B.1. MongoDB and Related Resources
- B.2. Installation Resources for Other Software Packages and Libraries

[link] 1. Introduction, Background, and Motivation

[link] 1.1. Overview

With over half of the world's population living in cities [#], and given the possibility that cities are a more efficient [#] way to organize and distribute resources and activities, the rigorous study of cities as systems that can be modeled mathematically is a compelling proposition. With the advent of pervasive infrastructures of sensors and computational resources in urban environments (smart cities, software-defined cities, and the Internet of Things), there is a potential to inform and validate such models using actual data, and to exploit them using the interoperability of the pervasive computational infrastructure that is available.

In this course, we introduce and define the novel term data mechanics to refer specifically to the study of a particular aspect of large, instrumented systems such as cities: how data can flow through institutions and computational infrastructures, how these flows can interact, integrate, and cleave, and how they can inform decisions and operations (in both online and offline regimes). We choose the term mechanics specifically because it connotes the study of mechanics (e.g., classical mechanics) in physics, where universal laws that govern the behavior and interactions between many forms of matter are studied. We also choose this term to emphasize that, often, the data involved will be tied to the physical environment: geospatial and temporal information is a common and integral part of data sets produced and consumed by instrumented environments.

[link] 1.2. Data Mechanics Repository and Platform

This course is somewhat unusual in that there are specific, concrete software development goals towards which the staff and students are working. In particular, one goal of the course is to construct a new general-purpose service platform for collecting, integrating, analyzing, and employing (both in real time and offline) real data sets derived from urban sensors and services (particularly those in the city of Boston). We will consider several real data sets and data feeds, including:

We will also secure additional data sets from the above sources as well as a few others, and for the purposes of student projects, there are no limits on other sources of data or computation (e.g., Twitter, Mechanical Turk, and so on) that can be employed in conjunction with some of the above.

Because this course involves the construction of a relatively large software application infrastructure, we will have the opportunity to introduce and practice a variety of standard software development and software engineering concepts and techniques. This includes source control, collaboration, documentation, modularity and encapsulation, testing and validation, inversion of control, and others. Students will also need to become familiar with how to use web service APIs used by government organizations (e.g., Socrata) to make queries and retrieve data.

The overall architecture of the service platform will have at least the following:

a database/storage backend ("repository") that houses:
- original and derived data sets, with annotations that include:
  - from where, when, and by what algorithm it was retrieved
  - using what integration algorithms it was derived
- algorithms for data retrieval or integration, with references that include:
  - when it was written and by whom
  - in what data sets it is derived
  - from what component algorithms it is composed (if it is such)
a web service ("platform") witha n application program interface (API) for running analysis and optimization algorithms:
- a defined language for defining analysis and optimization algorithms over the data stored in the repository
- an interface for submitting and running algorithms
other features for data and result visualization, simulation using partial data, etc.

[link] 1.3. Mathematical Modeling, Analysis Algorithms, and Optimization Techniques

There are a variety of problems that it may be possible to address using the data in the repository and the capabilities of the platform. This course will cover a variety of useful online and offline optimization topics in a mathematically rigorous but potentially application-specific way, including:

a defined language for defining analysis and optimization algorithms over the data stored in the repository,
dual decomposition,
online optimization.

The goal is to apply some of these techniques to the data sets(including integrated or derived data sets) and solve practical problems. Some of the problems raised in discussions with the City of Boston DoIT and MassDOT teams are:

characterizing intersections and coming up with a metric that incorporates:
- intersection throughput (people per unit time),
- modes of transportation (public transport, biking, walking),
- intersection safety (vehicle speed, accidents, and so on),
- intersection organization (no left turns, and so on);
characterizing streets and deriving metrics for:
- number of parking spaces,
- probability of an accident (e.g., using different modes of transportation),
- senior and handicapped mobility;
characterizing neighborhoods:
- economic condition and gentrification,
- senior and handicapped accessibility;
how to allocate resources to optimize some of the metrics above:
- where to perform repairs,
- how to improve housing affordability,
- where to place bike racks of ride sharing stations,
- senior and handicapped accessibility;
answering immediate questions relevant to an individual (e.g., in an app):
- is a building accessible,
- is there a place to park (or will there likely be at some other time),
- is a neighborhood safe (or is it becoming less or more safe),
- is a neighborhood affordable (or is it becoming less or more affordable),
- are healthcare services nearby.
integrating public transportation data with other data (e.g., Waze):
- why are buses on a certain route late,
- performance and problems during unexpected events (e.g., snow).

The above list is far from complete, and we will update it as the course progresses. Students are encouraged to discuss project ideas with faculty and one another.

[link] 1.4. Project #0: Preparation and Trial Submission

The purpose of this project is to obtain the necessary credentials for subsequent projects and for everyone (including course staff) to become familiar with the submission process and address any issues.

Choose an initial team consisting of one or two members. If Alice and Bob form a team and have the BU login names alice@bu.edu and bob@bu.edu, the unique identifier for their team will be alice_bob, where, in cases with two members, the two usernames are in ascending alphabetical order and separated by an underscore. For a one-member team, it is simply the BU login name of the sole member (e.g., alice).
Depending on which data you think you might use, sign up for the City of Boston Data Portal and/or the MBTA Developer Portal. We will make other, larger raw data sets from these organizations available soon, once we obtain them and set up the appropriate servers.
Create a GitHub account and follow the GitHub submission instructions by forking the public repository at https://github.com/Data-Mechanics/course-2016-fal-proj-zero, adding a single folder named using the group identifier (alice or alice_bob) that contains a single ASCII text file members.txt. Each team member should commit a line to that text file specifying a mapping from their GitHub username and their BU username. For example, if Alice and Bob have the GitHub usernames alicegh and bobgh, respectively, then the file should look as follows:
```
alicegh:alice
bobgh:bob
        
```
One of the team members should then make a pull request as specified in the instructions to submit the project. On subsequent projects, you will be forking a private repository once we add your user account to the Data-Mechanics organization (which we will retrieve from the members.txt file you add above).
Download and install Python 3 and MongoDB on your system if you do not have them already, review the Socrata API (try the example below with different data sets and query parameters), then install the Data Mechanics Library (dml) and the PROV (prov) modules for Python.
```
import urllib.request
import json

url = "https://data.cityofboston.gov/resource/awu8-dc52.json?$limit=10"
response = urllib.request.urlopen(url).read().decode("utf-8")
print(json.dumps(json.loads(response), sort_keys=True, indent=2))
        
```

[link] 2. Modeling Data and Data Transformations

To study rigorously the ways data can behave and interact within an infrastructure that generates, transforms, and consumes data (e.g., to make decisions), it is necessary to define formally what data and data transformations are. One traditional, widely used, and widely accepted model for data is the relational model: any data set is a relation (i.e., a subset of a product of sets), and transformations on data are functions between relations. While the relational model is sufficient to define any transformation on data sets, the MapReduce paradigm is one modern framework for defining transformations between data sets.

In modern contexts and paradigms, these models can be useful when studying relatively small collections of individual, curated data sets that do not change dramatically in the short term. However, these alone are not sufficient in a context in which data sets are overwhelmingly multitudinous, varying in structure, and continuously being generated, integrated, and transformed. One complementary discipline that can provide useful tools for dealing with numerous interdependent data sets is that of data provenance or data lineage. The provenance of a data set (or subset) is a formal record of its origin, which can include how the data was generated, from what data or other sources it was created or derived, what process was used or was responsible for creating or deriving it, and other such information. This information can be invaluable for a variety of reasons beyond the obvious ones (i.e., the origin of the data), such as:

the same data be generated again or reproduced if an error occurs,
a data set can be updated from the original source if the source has been updated or changed,
the source of an inconsistency or aberration of data can be investigated,
any of the above could be applied to a subset because recomputing or investigating the entire data set would be prohibitively costly.

[link] 2.1. Relational Data and the MapReduce Paradigm

The relational model for data can be expressed in a variety of ways: a data set is a relation on sets, a logical predicate governing terms, a collection of tuples or records with fields and values, a table of rows with labelled columns, and so on. Mathematically, they are all equivalent. In this course, we will adopt a particular model because it is well-suited for the tools and paradigms we will employ, and because it allows for one fairly clean mathematical integration of the study of relational data and data provenance.

[link]

Definition: A data set (also known as a store or database) is a multiset R: a collection (possibly with duplicates) of tuples of the form (x₁,...,x_n) taken from the set product X₁ × ... × X_n. Typically, some distinguished set (e.g., the left-most in the set product) will be a set of keys, so that every tuple contains a key. Whether a set is a key or not often depends on the particular paradigm and context, however.

[link]

Definition: A data transformation T: A → B is a mapping from one space of data sets A to another space of data sets B. Notice that an individual data set S (i.e., a relation or, equivalently, a set of tuples) is just an element of A.

Some of the typical building blocks for data transformations in the relational model are:

union and difference (intersection can also be defined in terms of these),
projection (sometimes generalized into extended projection),
selection (filtering),
renaming,
Cartesian product,
variants of join operations (many can be constructed using the above, but other variants have been added as extensions),
aggregation (an extension).

One common operation on relations that is not possible to express in traditional formulations but found in some relational database systems is the transitive closure of a data set. Normally, this requires an iterative process consisting of a sequence of join operations.

[link]

Example: We can model and illustrate the transformations that constitute the MapReduce paradigm using Python. Note that selection and projection can be implemented directly using Python comprehensions, but we define wrappers below for purposes of illustration.


def union(R, S):
    return R + S

def difference(R, S):
    return [t for t in R if t not in S]

def intersect(R, S):
    return [t for t in R if t in S]

def project(R, p):
    return [p(t) for t in R]

def select(R, s):
    return [t for t in R if s(t)]
 
def product(R, S):
    return [(t,u) for t in R for u in S]

def aggregate(R, f):
    keys = {r[0] for r in R}
    return [(key, f([v for (k,v) in R if k == key])) for key in keys]

In the MapReduce paradigm, a smaller set of building blocks inspired by the functional programming paradigm (supported by languages such as ML and Haskell) exist for defining transformations between data sets. Beyond adapting (with some modification) the map and reduce (a.k.a., "fold") functions from functional programming, the contribution of the MapReduce paradigm is the improvement in the performance of these operations on very large distributed data sets. Because of the elegance of the small set of building blocks, and because of the scalability advantages under appropriate circumstances, it is worth studying the paradigm's two building blocks for data transformations: map and reduce:

a map operation will apply some user-specified computation to every tuple in the data set, producing one or more new tuples,
a reduce operation will apply some user-specified aggregation computation to every set of tuples having the same key, producing a single result.

Notice that there is little restriction on the user-specified code other than the requirement that it be stateless in the sense that communication and coordination between parallel executions of the code is impossible. It is possible to express all the building blocks of the relational model using the building blocks of the MapReduce paradigm.

[link]

Example: We consider a few simple examples that illustrate how transformations can be constructed within the relational model. We start by showing how select can be used with a predicate to filter a data set.


>>> def red(t): return t == 'tomato'
>>> select(['banana', 'tomato'], red)
['tomato']

Suppose we have two data sets and want to join them on a common field. The below sequence illustrates how that can be accomplished by building up the necessary expression out of simple parts.


>>> X = [('Alice', 22), ('Bob', 19)]
>>> Y = [('Alice', 'F'), ('Bob', 'M')]

>>> product(X,Y)
[(('Alice', 'F'), ('Alice', 22)), (('Alice', 'F'), ('Bob', 19)), (('Bob', 'M'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> select(product(X,Y), lambda t: t[0][0] == t[1][0])
[(('Alice', 'F'), ('Alice', 22)), (('Bob', 'M'), ('Bob', 19))]

>>> project(select(product(X,Y), lambda t: t[0][0] == t[1][0]), lambda t: (t[0][0], t[0][1], t[1][1]))
[('Alice', 'F', 22), ('Bob', 'M', 19)]

Finally, the sequence below illustrates how we can compute an aggregate value for each unique key in a data set (such as computing the total age by gender).


>>> X = [('Alice', 'F', 22), ('Bob', 'M', 19), ('Carl', 'M', 25), ('Eve', 'F', 27)]

>>> project(X, lambda t: (t[1], t[2]))
[('F', 22), ('M', 19), ('M', 25), ('F', 27)]

>>> aggregate(project(X, lambda t: (t[1], t[2])), sum)
[('F', 49), ('M', 44)]

The following sequence explains what is happening inside the aggregate function for a particular key.


>>> Y = project(X, lambda t: (t[1], t[2]))
>>> keys = {t[0] for t in Y}
>>> keys
{'F', 'M'}
>>> [v for (k,v) in Y if k == 'F']
[22, 27]
>>> sum([v for (k,v) in Y if k == 'F'])
49
>>> ('F', sum([v for (k,v) in Y if k == 'F']))
('F', 49)

[link]

Example: We can model and illustrate the two basic transformations that constitute the MapReduce paradigm in a concise way using Python.


def map(f, R):
    return [t for (k,v) in R for t in f(k,v)]
    
def reduce(f, R):
    keys = {k for (k,v) in R}
    return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]

A map operation applies some function f to every key-value tuple and produces zero or more new key-value tuples. A reduce operation collects all values under the same key and performs an aggregate operation f on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.

At the language design level, the relational model and the MapReduce paradigm are arguably complementary and simply represent different trade-offs; they can be used in conjunction on data sets represented as relations. Likewise, implementations of the two represent different performance trade-offs. Still, contexts in which they are used can also differ due to historical reasons or due to conventions and community standards.

[link]

Example: We can use the Python implementations of the map and reduce operations in an example above to implement some common transformations in the relational model. For this example, we assume that the first field in each tuple in the data set is a unique key (in general, we assume there is a unique key field if we are working with the MapReduce paradigm). We illustrate how selection can be implemented in the code below.


R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]

X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).

We can also perform projection and aggregation.


R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]

X = map(lambda k,v: [(v[0], v[1])], R) # Projection keeps only gender and age.
Y = reduce(lambda k,vs: (k, sum(vs)), X) # Aggregates ages by gender.

We can also perform a simple join operation, although we also need to "combine" the two collections of data. This particular join operation is also simple because each tuple in the input is only joined with one other tuple.


R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]

X =   map(lambda k,v: [(k, ('Age', v))], R)\
    + map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
        lambda k,vs:\
          (k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
        X\
      )

[link]

Example: Suppose you have a data set containing tuples of the form (name, gender, age). You want to produce a result of the form (gender, total) where total is the sum of the age values of all tuples with the corresponding gender. The code below illustrates using Python how this can be done in the MapReduce paradigm.


INPUT = [('Alice', ('F', 19)),\
         ('Bob', ('M', 23)),\
         ('Carl', ('M', 20)),\
         ('Eve', ('F', 27))]
TEMP = map(lambda k,v: [(v[0], v[1])], INPUT)
OUTPUT = reduce(lambda k,vs: (k, sum(vs)), TEMP)

We provide an equivalent MapReduce paradigm implementation of the algorithm using MongoDB.


db.INPUT.insert({_id:"Alice", gender:"F", age:19});
db.INPUT.insert({_id:"Bob", gender:"M", age:23});
db.INPUT.insert({_id:"Carl", gender:"M", age:20});
db.INPUT.insert({_id:"Eve", gender:"F", age:27});

db.INPUT.mapReduce(
  function() {
    emit(this.gender, {age:this.age});
  },
  function(k, vs) {
    var total = 0;
    for (var i = 0; i < vs.length; i++)
        total += vs[i].age;
    return {gender:k, age:total};
  },
  {out: "OUTPUT"}
);

[link]

Exercise: Suppose you have a data set containing tuples of the form (name, city, age). You want to produce a result of the form (city, range) where range is defined as the difference between the oldest and youngest person in each city.

Construct a sequence of transformations that will yield this result using the basic building blocks available in the relational model (projections, selections, aggregations, and so on).
Use the MapReduce paradigm to construct a single-pass (one map operation and one reduce operation) MapReduce computation that computes the desired result. Hint: emit two copies of each entry (one for the computation of the maximum and one for the computation of the minimum).

[link]

Exercise: Consider the following line of Python code that operates on a data set D containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:


sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])

Identify which building blocks for transformations available in the relational model are being used in the above code, and draw a corresponding flow diagram that shows how they fit together.

[link] 2.2. Composing Transformations into Algorithms

Whether we are using the relation model or the MapReduce paradigm, the available building blocks can be used to assemble fairly complex transformations on data sets. Each transformation can be written either using the concrete syntax of a particular programming language that implements the paradigm, or as a data flow diagram that describes how starting and imtermediate data sets are combined to derive new data sets over the course of the algorithm's operation.

[link]

Example: We can use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the k-means clustering algorithm.


def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2

def plus(args):
    p = [0,0]
    for (x,y) in args:
        p[0] += x
        p[1] += y
    return tuple(p)

def scale(p, c):
    (x,y) = p
    return (x/c, y/c)

M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

OLD = []
while OLD != M:
    OLD = M

    MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
    PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
    PD = aggregate(PDs, min)
    MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MT = aggregate(MP, plus)

    M1 = [(m, 1) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MC = aggregate(M1, sum)

    M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
    print(sorted(M))

Below is a flow diagram describing the overall organization of the computation (nodes are data sets and edges are transformations). Note that the nodes labeled "..." are intermediate results that are implicit in the comprehension notation. For example, [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2] first filters product(MPD, PD) using a selection criteria if p==p2 and d==d2 and then performs a projection from tuples of the form ((m,p,d), (p2,d2)) to tuples of the form (m, p) to obtain the result.

!table([ ['rd:prod``P', null, null, 'd:agg min``PDs'], [null, 'r:proj``...', 'dr:prod`ru:proj``MPD', 'd:prod``PD'], ['ru:prod``M', null, null, 'd:selection``...'], ['u:proj``...', 'ld:prod``MC', 'l:agg sum``M1', 'l:proj`d:proj``...'], ['u:selection``...', 'l:prod``MT', null, 'll:agg plus``MP'] ])

[link]

Example: We can also use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the Floyd-Warshall shortest path algorithm.


N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]

P = product(N,N)
I = [((x,y),100 if x != y else 0) for (x,y) in P]
D = [((x,y),1) for (x,y) in E]

OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
    STEP = OUTPUT
    P = product(STEP, STEP)
    NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y])
    OUTPUT = aggregate(NEW, min)

SHORTEST = OUTPUT

[link]

Example: We can use building blocks drawn from the MapReduce paradigm that are available in MongoDB to construct an implementation of the k-means clustering algorithm. This implementation illustrates many of the idiosyncracies of the MapReduce abstraction made available by MongoDB.


db.M.remove({});
db.P.remove({});
db.createCollection("M");
db.createCollection("P");

db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

function flatten(X) {
    db[X].find().forEach(function(x) { db[X].update({_id: x._id}, x.value); });
}

function prod(X, Y, XY) {
  db[XY].remove({});
  db.createCollection(XY);
  db[X].find().forEach(function(x) {
    db[Y].find().forEach(function(y) {
      db[XY].insert({left:x, right:y});
    });
  });
}

function union(X, Y, XY) {
  db[XY].remove({});
  db.createCollection(XY);
  db[X].find().forEach(function(x) {
    db[XY].insert(x);
  });
  db[Y].find().forEach(function(y) {
    db[XY].insert(y);
  });
}

// We'll only perform a single product operation.
// Using map-reduce, we can perform argmax and argmin more easily.
// We can also use map-reduce to compare progress.

db.M.insert([{x:13,y:1}, {x:2,y:12}]);
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

db.createCollection("HASHOLD");
db.createCollection("HASHNEW");

var iter = 0;
do {

  db.M.mapReduce(
    function() { emit("hash", {hash: this.x + this.y})},
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      });
      return {hash: hash};
    },
    {out: "HASHOLD"}
  );

  prod("M", "P", "MP");
  db.MPD.remove({});
  db.MP.mapReduce(
    function() { emit(this._id, {m:this.left, p:this.right, d:dist(this.left, this.right)}); },
    function() { },
    {out: "MPD"}
  );
  flatten("MPD");

  // No need to flatten when multiple map-reduce operations are taking place.

  db.MPD2.remove({});
  db.MPD.mapReduce(
    function() { emit(this.p._id, this); },
    function(key, vs) {
      var j = 0;
      vs.forEach(function(v, i) {
        if (v.d < vs[j].d)
          j = i;
      });
      return vs[j];
    },
    {out: "MPD2"}
  );

  // Remember that the reduce operation will be applied multiple times, in any order.
  // Cannot assume all will come in at once. Must be associative and commutative.

  db.MPD2.mapReduce(
    function() { emit(this.value.m._id, {x: this.value.p.x, y:this.value.p.y, c:1}); },
    function(key, vs) {
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;         
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  flatten("M");

  db.M.mapReduce(
    function() { emit("hash", {hash: this.x + this.y})},
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      });
      return {hash: hash};
    },
    {out: "HASHNEW"}
  );

  var o = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
  var n = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
  print(o);
  print(n);
  print(iter);
  iter++;
} while (o != n);

[link]

Example: If we are certain that our k-means algorithm will have a relatively small (e.g., constant) number of means, we can take advantage of this by only tracking the means in a local variable and using .updateMany() to distribute the means to all the points at the beginning of each iteration. This leads to a much more concise (and, for a small number of means, efficient) implementation of the algorithm than what is presented in a previous example. In particular, it is no longer necessary to encode a production operation within MongoDB.


db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

var means = [{x:13,y:1}, {x:2,y:12}];
do {
  db.P.updateMany({}, {$set: {means: means}}); // Add a field to every object.
  db.P.mapReduce(
    function() {
      var closest = this.means[0];
      for (var i = 0; i < this.means.length; i++)
        if (dist(this.means[i], this) < dist(closest, this))
          closest = this.means[i];
      emit(closest, {x:this.x, y:this.y, c:1});
    },
    function(key, vs) {
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  means = db.M.find().toArray().map(function(r) { return {x:r.value.x, y:r.value.y}; });
  printjson(means);
} while (true);

We do not deal with the issue of convergence in this example; an equality function on JSON/BSON objects (i.e., the list of means) would need to be defined to implement the loop termination condition.

[link] 2.3. Data Provenance

Data provenance is an overloaded term that refers, in various contexts and communities, to the source, origin, or lifecycle of a particular unit of data (which could be an individual data point, a subset of a data set, or an entire data set). In this course, we will use the term primarily to refer to dependency relationships between data sets that may be derived from one another (usually over time). The study of data provenance (also called data lineage) is arguably still being developed. However, more recently, standards for general-purpose representations of data provenance have been established (e.g., PROV [#]).

While the research literature explores various ways of categorizing approaches to data provenance, there are two dimensions that can be used to classify provenance techniques (surveyed in more detail in the literature [#]):

from where the data was generated (e.g., what data sets or individual data entries) and how it was generated (e.g., what algorithms were used);
whether the linear is tracked at a fine granularity (e.g., per data entry such as a row in a data set) or a coarse granularity (e.g., per data set).

Because data transformations can be broken down into building blocks (e.g., selections and projections in the relational model, or map and reduce operations in the MapReduce paradigm), when studying data lineage of individual data items it may be possible to derive standard techniques [#] for determining data lineage given the components that make up the transformation.

There are at least two approaches one can use to determine provenance of a single entry within a data set produced using a specific transformation. One is to track the provenance of every entry within the transformation itself, and another is to determine the provenance of an entry after the transformation has already been performed (e.g., by another party and/or at some point in the past). The latter scenario is more likely if storage of per-entry provenance meta-data is an issue, or if the transformations are black boxes that cannot be modified or extended before they are applied.

[link]

Fact: For intersection, union, difference, projection, and selection transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set (i.e., the "where" provnenance of an individual entry in the output) can be determined using a linear search through the input data sets for that item.

For an aggregation transformation, a large number of entries in the input data set may influence an individual entry in the output data set. For such transformations, finding the per-entry provenance for an entry in the output data set can be non-trivial.

[link]

Fact: Suppose we know nothing about the internal structure of an aggregation transformation, but we want to determine what entries in the input data set influence a particular entry in the output data set. In this case, even running the entire transformation a second time (knowing the target entry) may provide no additional information about which subset of input data set entries produced that output data set entry. In this worst-case scenario, it would be necessary to run the transformation on all 2ⁿ subcollections of the input data set (for an input data set of size n), and to check for each one of those 2ⁿ outputs whether the particular entry of interest from the output data set was generated.

[link]

Fact: An aggregation transformation is key-preserving if any two entries under the same keys in the input data set always contribute to an entry with the same output key in the output data set. If an aggregation is key-preserving, then the provenance information for a given entry is easy to determine using linear search.

[link]

Fact: An aggregation transformation is context-free if any two entries in the input data set always contribute to the same output entry regardless of what other input entries are present. If an aggregation is context-free, then the provenance information for a given output data set entry can be computed in quadratic time. First, we can partition the input data set using the entries of the output data set (i.e., create one bucket of input data set entries for each output data set entry). Then, we can determine which partition contributed to the output data set entry of interest using linear search.

[link]

Example: If we introduce a function for computing powersets of entries in a data set (e.g., using a Python recipe for such a function), we can define an inefficient Python algorithm that can correctly determine per-entry provenance in the general case (not knowing any additional information about a transformation f).


from itertools import combinations, chain

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

def general_prov(f, X, y):
    for Y in reversed(powerset(X)): # From largest to smallest subset.
        if list(f(list(Y))) == [y]:
            return Y

[link]

Example: The function below determines the per-entry provenance for a key-preserving transformation. For this example, we assume that all data set entries are of the form (key, value).


def key_preserving_prov(f, X, y):
    keymap = {}
    
    # Build up the map from input data set
    # keys to output data set keys.
    for (x_key, x_val) in X:
        (y_key, y_val) = f([(x_key, x_val)])[0]
        keymap[x_key] = y_key
    (y_key, y_val) = y

    # Collect all the tuples that contribute
    # to the target result.
    pre_image = set()
    for (x_key, x_val) in X:
        if keymap[x_key] == y_key:
             pre_image.add((x_key, x_val))
    return pre_image

[link]

Example: The function below determines the per-entry provenance for a context-free transformation. For this example, we assume that all data set entries are of the form (key, value).


def context_free_prov(f, X, y):

    # Build the partitions of the input data set.
    partitions = []
    for (x_key, x_val) in X:
        found = False
        for i in range(0,len(partitions)):
            if len(f(partitions[i] + [(x_key, x_val)])) == 1:
                partitions[i].append((x_key, x_val))
                found = True
                break
        # Create a new partition if adding to any other
        # partition increases the size of the output data set.
        if found == False:
            partitions.append([(x_key, x_val)])

    # Find the corresponding partition.
    for partition in partitions:
        if y in f(partition):
             return partition

[link]

Exercise: For each of the following informal descriptions of data transformations, determine whether computing the per-entry provenance of a row in the output data set is easy (i.e., linear time) or difficult (i.e., cannot be done in polynomial time), assuming that the transformation's implementation is a black box (except for the provided description).

Given a data set containing entries of the form (name, age), the transformation produces a list of all entries corresponding to individuals who are of a legal voting age.
Given a data set containing entries of the form (product, department, price, date) describing sales that take place in a department store, the transformation computes the total revenue from each department for each possible date.
Given a data set specifying the geographical locations of a collection of radio receivers in a bounded region, determine an optimal placement of 100 radio towers within that region that minimizes the aggregate of the distances between each tower-receiver pair.

Tracking provenance at a coarse granularity (per data set) is relatively inexpensive in most applications (since the number of data sets is usually orders of magnitude less than the number of individual data set entries), and standards such as PROV provide authors of transformations a way to represent provenance information.

[link]

Example: Suppose Alice and Bob want to assemble a script that retrieves some information from an online resource and stores it in a database. We can use the prov Python package to programmatically assemble a course granularity provenance record that conforms to the PROV standard and describes a particular execution of this script.

We first create the document object and define the namespaces we will use for symbols in the document.


doc = prov.model.ProvDocument()
doc.add_namespace('alg', 'http://datamechanics.io/algorithm/alice_bob/') # The scripts in / format.
doc.add_namespace('dat', 'http://datamechanics.io/data/alice_bob/') # The data sets in / format.
doc.add_namespace('ont', 'http://datamechanics.io/ontology#')
doc.add_namespace('log', 'http://datamechanics.io/log#') # The event log.
doc.add_namespace('bdp', 'https://data.cityofboston.gov/resource/')

We then use an agent to represent the script, en entity to represent a resource, and an activity to repsent this invocation of the script. We also associate this current activity with the script, and we and indicate that his activity used the entity representing the resource.


this_script = doc.agent('alg:example', {prov.model.PROV_TYPE:prov.model.PROV['SoftwareAgent'], 'ont:Extension':'py'})
resource = doc.entity('bdp:wc8w-nujj', 
    {'prov:label':'311, Service Requests', 
    prov.model.PROV_TYPE:'ont:DataResource', 'ont:Extension':'json'}
  )
this_run = doc.activity(
    'log:a'+str(uuid.uuid4()), startTime, endTime, 
    {prov.model.PROV_TYPE:'ont:Retrieval', 'ont:Query':'?type=Animal+Found&$select=type,latitude,longitude,OPEN_DT'}
  )
doc.wasAssociatedWith(this_run, this_script)
doc.used(this_run, resource, startTime)

Finally, we define an entity for the data obtained and indicate to what agent it was attributed, by what actvitity it was generated, and from what entity it was derived.


found = doc.entity('dat:found', {prov.model.PROV_LABEL:'Animals Found', prov.model.PROV_TYPE:'ont:DataSet'})
doc.wasAttributedTo(found, this_script)
doc.wasGeneratedBy(found, this_run, endTime)
doc.wasDerivedFrom(found, resource, this_run, this_run, this_run)

repo.record(doc.serialize()) # Record the provenance document.

We can view our completed provenance record in, for example, JSON form using doc.serialize().

[link] 2.4. Project #1: Data Retrieval, Storage, Provenance, and Transformations

The purpose of this project is to practice using some of the tools available for retrieving and storing data sets within the context of a multi-contributor platform and repository, tracking provenance information for data sets, and performing some transformations using one or both of the presented paradigms. For the purposes of this project description, we assume the submitting team consists of two members, Alice and Bob, with the team identifier alice_bob (in accordance with the requirements specified in Project #0).

1. Follow the GitHub submission instructions by forking the repository at https://github.com/Data-Mechanics/course-2016-fal-proj. Note that we may commit and push a few additional fixes or updates to this repository before the deadline, so check for updates and pull them into your fork on a regular basis. Set up a MongoDB and Python environment as necessary for the project (including the installation of all dependencies). You should be able to run the setup.js script to prepare the repository, and then to start your MongoDB server with authentication enabled and run the alice_bob/example.py example script.
2. As in Project #0, create a folder of the form alice_bob within the top-level directory of the project. All the code constituting your submission, including all of your scripts and algorithm files (i.e., for retrieval of data and for transforming data already within the repository), should be placed within this folder. Do not place data files within this folder, or submit any data sets or data files via GitHub.
  Note that the file name for each script should be reasonably descriptive, as it will act as the identifier for the script. Each script should contain exactly one extension of the dml.Algorithm base class (such that the class name matches exactly the filename). Consult alice_bob/example.py for a working example script and algorithm.
```
class example(dml.Algorithm):
    contributor = 'alice_bob'
    reads = []
    writes = ['alice_bob.lost', 'alice_bob.found']

    @staticmethod
    def execute(trial = False):
        ...

    @staticmethod
    def provenance(doc = prov.model.ProvDocument(), startTime = None, endTime = None):
        ...
              
```
  Beyond this requirement, the scripts should also follow reasonable modularity and encapsulation practices, and should logically perform related operations. A larger number of smaller, simpler, and more reusable scripts that each perform a small task is preferable.
1. Choose at least five publicly available data sets, data streams, services, or documents (note that some of your data sets may have similar schemas and may contain similar information, but all five data sets cannot have very similar schemas or very similar information). Write a short narrative and justification (5-10 sentences) explaining how these data sets can be combined to answer an interesting question or solve a problem. You do not need to solve the actual problem in this project, and it is acceptable to merely combine data sets in a way that is likely to support one or more solutions involving the particular data you choose. Include this narrative in a README.md file within your directory (along with any documentation you may write in that file).
2. Implement algorithms that retrieve these data sets automatically (i.e., by implementing one or more dml.Algorithm subclasses and defining their reads and writes fields, as well as their execute methods). Any authentication credentials your scripts use should be included in the file auth.json (such that the file conforms to the auth.schema.json schema). All your scripts should retrieve the credential information they need from your copy of the file by using the existing dml functionality for doing so.
  Do not submit your auth.json file and do not include hard-coded authentication credentials in your code. We already added it to the .gitignore file to ensure you do not accidentially submit the credentials file. The course staff will use their own authentication credentials when running your code. Your README.md file should list any idiosyncratic details associated with the services and/or credentials needed to run your scripts.
3. Implement algorithms that perform at least three non-trivial transformations that merge, combine, or otherwise process some of the five data sets into three new data sets. These new data sets should be inserted into the repository. You can choose one of the existing algorithms or tools we have considered so far, or something else (assuming you thoroughly document in your README.md file how to obtain and run those tools).
Extend each of the scripts you implemented in Problem #2 to also store the provenance information documenting the activities that take place during the operation of that script (i.e., by defining their provenance methods). Each algorithm should generate a single provenance document when its provenance() method is invoked, and should insert that document into the repository using record().

[link] 3. Systems, Models, and Algorithms

When using mathematics to solve problems involving real-world entities and phenomenon, we can introduce the notions of a system (i.e., an organized collection of interdependent components and their possible states) that has distinct system states (which can capture spatial and temporal information about the system components, their behaviors, and their relationships). A mathematical model captures only some parts of the system and the possible system states, abstracting the other details away, and it does so using some particular symbolic language or representation.

Four common abstract mathematical objects used to represent models of systems are particularly relevant given our application domain of interest:

graphs consisting of a collection of nodes and edges (possibly weighted and directed) between those nodes,
vector spaces consisting of sets of vectors,
systems of equations and inequalities (or, equivalently, sets of constraints or set of logical formulas), and
functions (whether continuous or discrete).

In many cases, these mathematical objects can be used to represent each other (e.g., a graph can be represented as a function or as a system of equations), but each has its own characteristics that make it well-suited for modeling certain kinds of systems. Very often, two or more of these are combined when modeling a system.

Most data sets describing a real-world system are using one or more of the above languages for modeling that system. Identifying which of these is being used can inform our decisions about what techniques we can use to answer questions about that system or to solve problems involving that system. It can also tell us how we can convert to a different language or representation in order to convert, combine, or derive new data sets.

[link] 3.1. Systems, Models, and Metrics

We can provide abstract definitions for systems, models, and metrics. These then allow us to characterize satisfaction, search, and optimization problems in a general and consistent way.

[link]

Definition: A system is a set S of distinct system states (possibly infinite). Often, each possible system state s ∈ S is called a model of the system.

One mathematical object that is commonly used to represent systems is a vector space where individual vectors are the system states (i.e., models) of that system.

[link]

Definition: A constraint set is a set of logical formulas (written using a formal language that may contain terms such as integers, vector, matrices, sets, and so on), usually used to identify a subset of a system (i.e., a subset of the possible systems states or models). A system state or model satisfies a constraint or constraint set if the logical formulas are true for that model.

When using vector spaces, it is often possible to represent constraints using a matrix equation, e.g.:

M ⋅ x

In the above, M might be a matrix from ℝ^{n × m}, x might be a variable vector in ℝ^m, and b might be a vector in ℝⁿ.

[link]

Definition: Given a system S and a set of constraints, a constraint satisfaction problem is the problem of finding a model (or set of models) in S that satisfies that set of constraints.

[link]

Definition: A metric over a system S is a function f: S → ℝ that maps system states to real numbers.

[link]

Definition: Given a system S and metric f, an optimization problem is the problem of finding s ∈ S that maximizes or minimizes f:

argmax_{s ∈ S} f(s)

argmin_{s ∈ S} f(s)

[link]

Example: Computing the argmax or argmin operation in the relational or MapReduce paradigms can be non-trivial, but it is still possible. For example, in the relational paradigm it is possible to compute it by combining two projections, an aggregation, and a selection.


def argmax(X, f):
    Y = [f(x) for x in X] # Projection.
    y_max = aggregate(Y, max)
    XF = [(x, f(x)) for x in X] # Projection.
    xs = [x for (x,y) in XF if y == y_max] # Selection.
    return xs[0]

[link]

Exercise: For each of the following scenarios, determine whether it is an instance of a constraint satisfaction problem or an instance of an optimization problem. If it is an optimization problem, identify the objective function.

Given a data set describing a set of spectators and their heights, determine if it is possible to seat them in a stadium so that no one blocks anyone else's view.
Find the lowest-cost route between two destinations on a transportation network.
Given a social network where any pair of individual members can be "connected", determine if the whole network is connected (i.e., there exists a path from every member to every other member).
Given a social network where any pair of individual members can be "connected", find the smallest group of individuals that must be removed in order to make the network disconnected.

[link] 3.2. Graph and Spatial Algorithms as Constraint Satisfaction and Optimization Problems

We have already seen how we can define some basic algorthms that have natural spatial interpretations (basic clustering and shortest) using data transformations assembled from building blocks drawn from the relational and the MapReduce paradigms. These algorithms employed two different representations: a vector space (clustering) and a graph (shortest paths). In this section, we review how algorithms such as these can be viewed as solutions to constraint satisfaction and/or optimization problems. We will also show how such problems can be handled using general-purpose techniques for solving constraint satisfaction and optimization problems, as well as how the relational and MapReduce paradigms can be used to implement these general-purpose techniques.

[link]

Definition: Suppose we have a graph consisting of a set of nodes N = {n₁, n₂, ...} and a set of edges E of the form (n₁, n₂). The minimum spanning tree of this graph is the smallest subset T ⊂ E such that in the subgraph consisting of the edges in T, every node can be reached from every other node (i.e., by traversing a path along the edges in T).

Notice how the problem of finding a minimum spanning tree can be broken down into a constraint satisfaction problem:

the set of system states is the collection of edge subsets (i.e., all 2^|E| possible subsets);
the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another) and the requirement that the subset T must be a tree (or, equivalent, that it must have at most |N| − 1 edges).

This problem can also be broken down into an optimization problem that minimizes a metric:

the set of system states is the collection of edge subsets (i.e., all 2^|E| possible subsets);
the set of constraints is the collection of reachability requirements (one for every pair nodes stating that those two nodes must be reachable from one another);
the metric is a functino that computes the size of T (i.e., the number of edges in the spanning subtree).

[link]

Example: We demonstrate how one possible algorithm for determing whether a spanning tree exists can be implemented within the MapReduce paradigm. For this example, we assume that our graph is represented as a list of edges. Note that this algorithm relies on the same facts about graphs and trees as Prim's algorithm, Kruskal's algorithm, and Borůvka's algorithm (it is most similar to the latter two).


U = {'a','b','c','d','e','f'}
E = {('a','b'), ('b','c'), ('a', 'c'), ('c','e'), ('a','e'), ('e','f'), ('f','d')}

We can convert the list of edges into an adjacency list representation in which each node is paired with its list of immediate neighbor nodes. This conversation can be accomplished in a variety of ways using the building blocks for data transformations, such as by aggregating the list of edges.


I = map(lambda k,v: [(k, v), (v, k)], E)
A = reduce(lambda k, vs: (k, frozenset(vs)), I)

For purposes of comparison, we provide an equivalent implementation using Python list comprehensions.


A = [(u, frozenset({w for (v,w) in E if u==v} | {w for (w,v) in E if u==v})) for u in U]

The algorithm then works by starting with a collection of forests (each containing a one-node tree) and, for every pair of forests encountered, combining as many trees as possible within those forests. If a spanning tree exists (i.e., the graph is connected), a forest with a single tree will be the result of the overall computation.


SUBTREES = [[a] for a in A]
FORESTS = [[t] for t in SUBTREES]

def pick(S):
    return list(S)[0]

def combinable(U, W):
    U_NODES = {u for (u, NBRS) in U}
    W_NODES = {w for (w, NBRS) in W}

    UW_EDGES = {(u,pick(NBRS & W_NODES)) for (u, NBRS) in U if NBRS & W_NODES != set()}
    WU_EDGES = {(w,pick(NBRS & U_NODES)) for (w, NBRS) in W if NBRS & U_NODES != set()}
    EDGES = UW_EDGES | WU_EDGES

    if EDGES != set():
        print(pick(EDGES))
        return True
    else:
        return False

def combine_trees(U, W):
    BOTH = {u for (u, nbrs) in U} | {w for (w,nbrs) in W}
    U = {(u, frozenset({w for w in nbrs if w not in BOTH})) for (u, nbrs) in U}
    W = {(u, frozenset({w for w in nbrs if w not in BOTH})) for (u, nbrs) in W}
    return U | W

def combine_forests(F1, F2):
    loop = True
    while loop: # Loop as long as merges occur.
        loop = False
        FOREST = []
        for T in F1:
            found = False
            for i in range(0,len(F2)):
                if combinable(T, F2[i]):
                    F2[i] = combine_trees(T, F2[i])
                    found = True
                    loop = True
            if not found:
                FOREST.append(T)
        F1 = FOREST
    return F1 + F2

As implemented, the combine_forests function can be used as an aggregate and applied directly to a list of forests.


from functools import reduce
print(reduce(combine_forests, FORESTS))

[link] 3.3. Linear systems, satisfiability modulo theories, and linear programming

We often say a system or subset of a system is linear if it can be modeled using a Euclidean vector space and can be defined using a collection of linear constraints of the form:

a₁ ⋅ x₁ + ... + a_n ⋅ x_n ♦ c

Where (x₁, ..., x_n) ∈ ℝⁿ is a vector, the a_i are coefficients in ℝ, c in ℝ is a constant, and ♦ is a relation operator such as =, ≤, <, ≥, or >. Note that the constraints contain no higher-order x_i terms such as x_i². There are three common techniques for solving constraint satisfaction and optimization problems involving linear systems:

finding an exact or approximate solution (e.g., using some combination of matrix inversion, QR factorization, and ordinary least squares),
checking if a solution exists (and finding a witness system state or model proving this), and
finding a solution that minimizes or maximizes some (linear) metric over the vector space.

The third item (solving an optimization problem) is arguably a generalization of the first approach if we remove the requirement that the metric must be linear (since the first approach minimizes the Euclidean distance metric, which is quadratic and thus non-linear).

[link]

Example: Suppose we want to use an SMT solver to solve a flow maximization problem for a flow graph (the variable on each represents the flow amount, and the parenthesized number is the maximum possible flow on that edge):

!table([ [null, null, ('dd:x4 (6)`dr:x5 (3)``γ'), null, null], [('r:x1 (?)``α'), ('ur:x2 (7)`dr:x3 (8)``β'), null, ('r:x7 (5)``ε'), ('ζ')], [null, null, ('ur:x6 (4)``δ'), null, null], ])

Here, the system is the set of possible states of the graph in terms of the amount of material flowing along each edge (we model the flow that goes along each edge in the graph as a variable ranging over ℝ, and the flow in the overall graph containing seven edges as a vector in ℝ⁷).


import z3

(x1,x2,x3,x4,x5,x6,x7) = [z3.Real('x'+str(i)) for i in range(1,8)]

S = z3.Solver()

# Only allow non-negative flows.
for x in (x1,x2,x3,x4,x5,x6,x7):
    S.add(x >= 0)
    
# Edge capacity constraints.
S.add(x2 <= 7, x3 <= 8, x4 <= 6)
S.add(x5 <= 3, x6 <= 4, x7 <= 5)

# Constraints derived from graph topology.
S.add(x1 <= x2+x3, x2 <= x4+x5, x3+x4 <= x6, x5+x6 <= x7)

S.add(x1 > 0) # We want a positive flow.

print(S.check())
print(S.model())

If we are allowed to vary only the variable for incoming edge, we can use the SMT solver together with binary search to find the maximum flow through the graph.


flow = 0
for i in range(5, -1, -1):
    S.push()
    S.add(x1 >= (2**i + flow))
    if str(S.check()) != 'unsat':
        flow += 2**i
    S.pop()

S.add(x1 >= flow)
print(S.model())

We could alternatively build up the constraint set from a starting matrix.


def dot(xs, ys):
    return sum([x*y for (x,y) in zip(xs, ys)])

x = [x1,x2,x3,x4,x5,x6,x7]

M = [
  [ 0,-1, 0, 0, 0, 0, 0],
  [ 0, 0,-1, 0, 0, 0, 0],
  [ 0, 0, 0,-1, 0, 0, 0],
  [ 0, 0, 0, 0,-1, 0, 0],
  [ 0, 0, 0, 0, 0,-1, 0],
  [ 0, 0, 0, 0, 0, 0,-1],

  [ 1, 0, 0, 0, 0, 0, 0],
  [ 0, 1, 0, 0, 0, 0, 0],
  [ 0, 0, 1, 0, 0, 0, 0],
  [ 0, 0, 0, 1, 0, 0, 0],
  [ 0, 0, 0, 0, 1, 0, 0],
  [ 0, 0, 0, 0, 0, 1, 0],
  [ 0, 0, 0, 0, 0, 0, 1],

  [-1, 1, 1, 0, 0, 0, 0],
  [ 0,-1, 0, 1, 1, 0, 0],
  [ 0, 0,-1,-1, 0, 1, 0],
  [ 0, 0, 0, 0,-1,-1, 1]
  ]

b = [-7,-8,-6,-3,-4,-5,
      0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0]

S = z3.Solver()

for i in range(len(M)):
    S.add(b[i] <= dot(M[i], x))

print(S.check())
print(S.model())

[link]

Example: Suppose we want to use linear programming to solve the problem described in the example above.


from scipy.optimize import linprog

M = [
  [ 1,-1,-1, 0, 0, 0, 0],
  [ 0, 1, 0,-1,-1, 0, 0],
  [ 0, 0, 1, 1, 0,-1, 0],
  [ 0, 0, 0, 0, 1, 1,-1],
  ]
b = [0, 0, 0, 0]
c = [-1, 0, 0, 0, 0, 0, 0]
bounds = [(0, None), (0, 7), (0, 8), (0, 6), (0, 3), (0, 4), (0, 5)]

result = linprog(c, A_ub = M, b_ub = b, bounds=bounds, options={"disp": True})
print(result)

Note that this technique produces a potentially different solution from the one in the example above (though both are optimal).

[link] 3.4. General-purpose Optimization Techniques

There exist general-purpose techniques for solving an optimization problem given a set of constraints and an objective function. Usually, these rely on certain general assumptions about the properties of the constraints and the objective function.

[link] 3.5. Project #2: Modeling, Optimization, and Statistical Analysis

The purpose of this project is to practice using the modeling, optimization, and statistical analysis tools we have covered to answer questions or solve problems using the data sets obtained by all the teams in Project #1. For the purposes of this project description, we assume a team consisting of two members, Alice and Bob, with the team identifier alice_bob in accordance with Project #0. Teams consisting of up to four people are permitted for this project.

For this project, teams will work with the GitHub repository created for Project #1, consisting of all the teams' submissions (merged via merge requests at the time of submission) at https://github.com/Data-Mechanics/course-2016-fal-proj.
- If your Project #1 team is not going to change, you should be able to continue working on your project on your own fork. Pull all the updates from the project repository to sync your fork with it (note that this will now include all the other teams' projects and scripts from Project #1).
- If you are creating a new or larger team, you should either create a new fork of the project repository, or rename the folder in your working fork to accurately represent the members of your new team. It is up to you how you combine your individual scripts and whether you wish to include or discard any redundant scripts or configuration files, but your project structure should be consistent with that of a single team and single team project.
As before, the course staff may commit and push additional fixes or updates to the repository before the project deadline, so check for updates and pull them into your fork on a regular basis. If you are utilizing another team's data or scripts, or are allowing another team to utilize your data or scripts, do not duplicate them in your own file tree. Instead, reference them in their original location (using the other team's collection prefix identifier and/or folder path). If another team would like to use an updated version of your data sets and/or scripts, perform a pull request so that the course staff can merge your latest code in the project repository. At that point, the other team can pull that update from the project repository.
Choose a problem or family of problems that you wish to solve using some collection of relevant data sets. These data sets could already be available in the repository (thanks to a Project #1 submission by your team or another team), or they may be additional data sets you plan to obtain with new retrieval scripts as you did in Project #1. The problem itself could involve solving a graph problem, performing a statistical analysis or analyses, solving a constraint satisfaction or optimization problem, developing a scoring mechanism, or achieving some other quantifiable goal for which at least some of the techniques we have learned so far might be useful. It could be the original problem you intended to solve in Project #1, or something new.
1. Write a short narrative and justification (at least 7-10 sentences) explaining how the data sets and techniques you hope to employ can address the problem you chose. Depending on what techniques you are using and what problem you are trying to solve, you may need to provide a justification or evaluation methodology for the techniques you are using (i.e., why you believe the chosen techniques will solve the problem you are trying to address). Include this narrative in the README.md file within your directory (you may only need to update your existing file).
2. Implement at least two additional non-trivial constraint satisfaction, optimization, statistical analysis, inference, or other algorithms that use the data sets and address the problem or problems described in part (a). You can choose one of the existing algorithms or tools we have considered so far, or something else (assuming you thoroughly document in your README.md file how to obtain and run those tools). As in Project #1, your algorithms should be implemented within scripts that extend the dml.Algorithm base class, should follow reasonable modularity and encapsulation practices, and should perform logically related operations.
  If you anticipate that an algorithm or technique you employ could be applied to much larger data sets, you should try to implement it as a transformation in the relational model or the MapReduce paradigm. In most cases, the results will consist of new data sets; these should be inserted into the repository along with all the others in the usual way (and should be considered derived data sets).
1. Ensure that whatever scripts or algorithms you implemented in Problem #2 also store the provenance information documenting the activities that take place and the data they generate during their operation. Each script should generate a single document once it finishes running, and should insert that document into the repository using record().
2. Ensure that whatever scripts or algorithms you implemented in Problem #2, add the capability to run the algorithm in trial mode, which should occur if the trial parameter of the execute() method is set to True. In trial mode, the algorithm should complete its execution very quickly (in at most a few seconds) by operating on a very small portion of the input data set(s). However, it should still run through most (or, ideally, all) the code paths in the algorithm definition when it does so. This will make it possible to easily test the algorithm without running it on the entire data set.

[link] 4. Statistical Analysis

Given a data set some of the optimization techniques we have studied can be used to find a best-fit model for that data. For example, given a data set of points in ℝ² representing some collection of observations measured along two dimensions, it is possible to find an ordinary least squares "best fit" linear function (using algebra, optimization, gradient descent, and so on). However, just because we have found a "best fit" model for a data set from some given space of functions does not mean that the model is actually a good one for that particular data set. Perhaps the relationship between the two dimensions is not linear, or even completely non-existent. How can we determine if this is the case?

In this section, we will present some statistical methods that can help us decide whether a linear model is actually appropriate for a given two-dimensional data set, and to quantify how appropriate it might be. We will review the mathematical foundations underlying the methods, how they can be applied, and how they can be realized as data transformations in the paradigms we have studied in earlier sections.

[link] 4.1. Review of Facts about Projections from Linear Algebra

We review a few useful definitions and facts about vectors in ℝⁿ from linear algebra. These provide a mathematical foundation for defining the statistical constructs discussed in this section.

[link]

Definition: A vector is an ordered, finite tuple of real numbers (x₁, ..., x_n) ∈ ℝⁿ. Note that we will use individual variables to denote an entire vector (e.g., x ∈ ℝⁿ), and that same variable with a subscript index (e.g., x_i) to denote individual components of that vector; thus, we have that x = (x₁, ..., x_n).

[link]

Fact: For a vector x ∈ ℝⁿ and a scalar s ∈ ℝ, we can scale x by computing s ⋅ x = (s ⋅ x₁, ..., s ⋅ x_n). Note that for a scalar s ∈ ℝ, we often use the shorthand notation x/s to mean (1/s) ⋅ x.

[link]

Definition: For a vector x ∈ ℝⁿ the Euclidean length or norm of x is ||x|| = √(x₁² + ... + x_n²). We can also obtain a normalized unit vector of length one that points in the same direction as x by computing x/||x||.

[link]

Definition: Given two vectors x ∈ ℝⁿ and y ∈ ℝⁿ, the dot product of the two vectors is x ⋅ y = x₁ ⋅ y₁ + ... + x_n ⋅ y_n.

[link]

Fact: We can compute the projection of a vector x ∈ ℝⁿ onto another vector y ∈ ℝⁿ by computing (x ⋅ (y/||y||)) ⋅ (y/||y||).

[link]

Example: Given two vectors (5,7) ∈ ℝ² and (3,0) ∈ ℝ², the normalized version of (3,0) is (1/3) ⋅ (3,0) = (1,0). The projection of (5,7) onto (3,0) is then:

((5,7) ⋅ ((1/3) ⋅ (3,0))) ⋅ ((1/3) ⋅ (3,0)

((5,7) ⋅ (1,0)) ⋅ (1,0)

(5 ⋅ 1 + 7 ⋅ 0) ⋅ (1,0)

5 ⋅ (1,0)

(5,0).

[link]

Theorem (Cauchy-Schwarz inequality): Given two vectors x ∈ ℝⁿ and y ∈ ℝⁿ, the following inequality holds:

|x ⋅ y| / (||x|| ⋅ ||y||)

≤

Another way of writing the above is:

| (x/||x||) ⋅ (y/||y||) |

≤

The intuitive interpretation is as follows: if we try to find the projection of a unit vector x/||x|| onto another unit vector y/||y||, the resulting vector has length at most 1.

[link] 4.2. Defining Mean and Standard Deviation using Concepts in Linear Algebra

In this section, we will assume that we are working with a data set of the form [(x₁, y₁), ..., (x_n, y_n)] that contains n (not necessarily distinct) tuples and where for 1 ≤ i ≤ n we have (x_i, y_i) ∈ ℝ². For the purposes of analysis, we will also often use two projections of this data set as vectors in their own right: (x₁, ...,x_n) ∈ ℝⁿ and (y₁, ...,y_n) ∈ ℝⁿ. We will also use the shorthand notation x = (x₁, ...,x_n) and y = (y₁, ...,y_n).

[link]

Definition (arithmetic mean): We define the arithmetic mean μ(x) of a one-dimensional or one-column data set x ∈ ℝⁿ where x = (x₁, ..., x_n) as:

μ(x)

(1/n) ⋅ (x₁ + ... + x_n)

[link]

Fact: For x ∈ ℝⁿ, μ(x) is the quantity that minimizes the following objective function:

μ(x)

argmin_{a ∈ ℝ} (Σ_i=1ⁿ (x_i - a)²)

To prove the above, assume that the true mean is some specific value m. Next, we choose some arbitrary a in the formula and proceed algebraically from there by multiplying the terms under the summation and then using the distributive property to factor out the terms without an index subscript:

Σ_i=1ⁿ (x_i - a)²

Σ_i=1ⁿ ((x_i − m) + (m − a))²

Σ_i=1ⁿ ( (x_i − m)² + 2 ⋅ ((x_i − m) ⋅ (m − a)) + (m − a)² )

Σ_i=1ⁿ (x_i − m)² + 2 ⋅ (m − a) ⋅ (Σ_i=1ⁿ (x_i − m)) + Σ_i=1ⁿ (m − a)²

Σ_i=1ⁿ (x_i − m)² + 2 ⋅ (m − a) ⋅ (Σ_i=1ⁿ x_i − Σ_i=1ⁿ m)) + Σ_i=1ⁿ (m − a)²

In the above, notice that Σ_i=1ⁿ x_i = Σ_i=1ⁿ m because m is the true mean, so the second term is just 2 ⋅ (m − a) ⋅ 0 = 0. Thus, we can eliminate it. Furthermore, we have Σ_i=1ⁿ (m − a)² = n ⋅ (m − a)². Applying all these substitutions, we have:

Σ_i=1ⁿ (x_i - a)²

(Σ_i=1ⁿ (x_i − m)² ) + n ⋅ (m − a)²

Notice that setting a = m minimizes the above because the first term does not contain a at all, and the second term becomes 0 exactly when a = m. Thus, the a that minimizes the objective Σ_i=1ⁿ (x_i - a)² is exactly the mean.

[link]

Definition: For a given vector x ∈ ℝⁿ, suppose we use μ(x) to create a vector with n entries (μ(x), μ(x), ..., μ(x)) in which every entry is μ(x). We can call this vector x_μ.

[link]

Fact: The vector x_μ minimizes a particular objective function:

x_μ

argmin_{v is on the diagonal in ℝⁿ} ||x - v||

Another way of thinking about this is that x_μ as defined above is the projection of the vector x onto the diagonal line in ℝⁿ. This means that x_μ is also the point on the diagonal in ℝⁿ that is closest to x.

Notice that the geometric, vector-based interpretation of the mean preserves an important property of the mean of a data set x: rearranging the elements of x should not change the mean. In fact, this is guaranteed because the dimensions are interchangeable: for any permuted version of x (let us call it x'), the distance of x' from the diagonal will always be the same as the distance of x from the diagonal because we can simply relabel the dimensions to bring x' back to its original form x.

[link]

Example: Suppose that our data consists of only two points in one dimension: x₁ = 10 and x₂ = 2. We can compute:

μ(x)

(1/2) ⋅ (10 + 2)

x_μ

(6, 6)

Notice that (6, 6) is on the diagonal in ℝ². Notice also that the vector x − x_μ = (10, 2) - (6, 6) = (4, -4) has a slope of − 1 (it is orthogonal to the diagonal). This is consistent with x_μ being a projection of x onto the diagonal.

[link]

Definition (standard deviation): We define the standard deviation σ(x) of a one-dimensional or one-column data set x ∈ ℝⁿ where x = (x₁, ..., x_n) as:

σ(x)

√( (1/n) ⋅ ((x₁ − μ(x))² + ... + (x_n − μ(x))²) )

[link]

Fact: For x ∈ ℝⁿ, σ(x) can be rewritten in terms of the norm (i.e., length) ||x − μ(x)|| of the vector x − μ(x):

σ(x)

(1/√(n)) ⋅ √( ((x₁ − μ(x))² + ... + (x_n − μ(x))²) )

(1/√(n)) ⋅ ||x − x_μ||

In other words, we can view the standard deviation of a data vector x as a normalized distance between x and the diagonal in ℝⁿ.

We might ask why the extra scalar term 1/√(n) is necessary in the above. After all, the term ||x − x_μ|| already measures the distance between the vector x and the diagonal. The reason is that when the number of components in the vector x changes, the number of dimensions in the space from which the vector is drawn also changes. But the number of dimensions influences the lengths of vectors (if the individual vector components are about the same): the more dimensions there are, the longer a vector can become because the distances from all dimensions contribute to the length of a vector.

[link]

Example: Consider the lengths of the following vectors (each drawn from a vector space having its own distinct number of dimensions):

||(1, 1)||

√(2)

||(1, 1, 1)||

√(3)

||(1, 1, 1, 1)||

√(4)

||(1, 1, 1, 1, 1)||

√(5)

⋮

||(1, 1, 1, 1, 1, ..., 1)||

√(n)

In general, we can see that the following holds for a vector of length n in which components are all the same value s:

||(s, s, s, ..., s)||

√(n ⋅ s²)

s ⋅ √(n)

The above example shows that if we want to compare the standard deviations of two distinct data sets that are of different sizes, we need to account for the fact that the vectors are drawn from different vector spaces (in terms of the number of dimensions). The scalar term 1/√(n) in the standard deviation formula accounts for this difference.

[link] 4.3. Covariance and Correlation

Given a data set that has two-dimensional entries of the form (x_i, y_i), it is possible to compare the two dimensions by comparing their standard deviation vectors.

[link]

Definition (covariance): For x ∈ ℝⁿ and y ∈ ℝⁿ, we define the covariance between x and y as:

cov(x, y)

(1/n) ⋅ ((x₁ - μ(x)) ⋅ (y₁ - μ(y)) + ... + (x_n - μ(x)) ⋅ (y_n - μ(y)))

(1/n) ⋅ (Σ_i=1ⁿ (x_i - μ(x)) ⋅ (y_i - μ(y)))

(1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)

((1/√(n)) ⋅ (x − x_μ)) ⋅ ((1/√(n)) ⋅ (y − y_μ))

Notice that the last line in the above indicates that we can view the covariance as the dot product of the vectors (x − x_μ) and (y − y_μ) (each of which represent how x and y deviate from their respective means).

[link]

Definition (correlation coefficient): For x ∈ ℝⁿ and y ∈ ℝⁿ, we define the correlation coefficient (also known as the Pearson product-moment correlation coefficient) between x and y as:

ρ(x, y)

cov(x, y) / (σ(x) ⋅ σ(y))

[link]

Fact: For x ∈ ℝⁿ and y ∈ ℝⁿ, it must be that -1 ≤ ρ(x, y) ≤ 1. We can see this by noticing that the formula for ρ(x, y) appears in the Cauchy-Schwarz inequality:

ρ(x, y)

cov(x, y) / (σ(x) ⋅ σ(y))

((1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)) / ((1/√(n)) ⋅ ||x − x_μ|| ⋅ (1/√(n)) ⋅ ||y − y_μ||)

((1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)) / ((1/n) ⋅ ||x − x_μ|| ⋅ ||y − y_μ||)

((x − x_μ) ⋅ (y − y_μ)) / (||x − x_μ|| ⋅ ||y − y_μ||)

Notice that the only difference between the above and the Cauchy-Schwarz inequality is that in the above, the numerator is not guaranteed to be positive. Thus, we have:

−1

≤

cov(x, y) / (σ(x) ⋅ σ(y))

≤

In addition to being bounded by −1 and 1, the correlation coefficient has some other useful properties. If x and y have a positive linear correlation, then ρ(x, y) is positive. If they have a negative linear correlation, it is negative. Given two pairs of data sets x, y and x', y', if |ρ(x, y)| > |ρ(x', y')|, then x and y have a stronger correlation with each other than do x' and y'.

While the correlation coefficient is useful for comparing correlations, the quantity itself should not be used to infer anything about the extent of the correlation. To do that with some rigor, it is necessary to introduce a few additional concepts.

[link]

Example: Suppose that our data consists of only a pair of two-dimensional tuples: x = (2, 6) and y = (20, 60). In other words, the data set is [(2,20), (6,60)]. Notice that the slopes of the orthogonal projections of x and of y onto the diagonal are the same, and if we normalize these projections, they become the same vector. That means (x − x_μ) / ||x − x_μ|| = (y − y_μ) / ||y − y_μ||. But the dot product of a vector with itself is the square of its length: v ⋅ v = ||v||². Thus, we have:

cov(x, y) / (σ(x) ⋅ σ(y))

(x − x_μ) ⋅ (y − y_μ) / (||x − x_μ|| ⋅ ||y − y_μ||)

Thus, because the two dimensions have a positive linear correlation in this example, the correlation coefficient is 1.

[link] 4.4. Observations, Hypothesis Testing, and Significance

The correlation coefficient gives us a way to quantify that there may exist a correlation between two dimensions in a data set. However, this correlation may be an artifact of the way we happened to obtain the data. For example, we may have obtained a biased sample. We can account for this by modeling the likelihood of obtaining a biased sample given the assumption that no correlation exists. If it is very unlikely that we would obtain a sample with the observed correlation if no correlation actually exists, then we might conclude that it is likely that a correlation does exist.

To take this approach, we need to compare the correlation we observe against something (i.e., to decide how likely or unlikely it is if there is no correlation). This means we need to define mathematically what it means for there to be no correlation.

[link]

Definition (distribution): A distribution over a set S is a function f: S → ℝ. If S represents a set of possible observations of a system (e.g., a collection of system states), then a distribution f might represent the probability that the system is in that state, or the frequency with which that system state has been or can be observed.

[link]

Definition (p-value): Let a subset of the real number line S ⊂ ℝ represent all possible system states that can be observed, and let f be a distribution over S that represents the probability of observing each of the possible states in S. Given some state s ∈ S, the probability of observing s is then f(s). We define the p-value at the observed state s to be the probability of observing any state that is equal to or more extreme than the state s.

One way to define which states are more extreme is to consider all the states from s to the nearest endpoint. Suppose S is an interval [a, b]. If s > (a − b) / 2, then the p-value would be computed by taking the definite integral of f from s to b. Otherwise, it would be computed by taking the definite integral of f from s to a.

In order to use the p-value concept to reason about correlation coefficients, it is necessary to model how the correlation coefficient should behave if no correlation exists. One way to do this using the data alone is to consider every permutation of the data.

[link]

Example: Assume we have a two-dimensional data set [(x₁, y₁), ..., (x_n, y_n)], also represented by x ∈ ℝⁿ and y ∈ ℝⁿ. One way to obtain a range of possible correlation coefficients is to compute the correlation coefficient for all n! possible permutations of the second dimension. Intuitively, if the data has no correlation, then performing this permutation should not change the correlation coefficient much. On the other hand, if the data does have a correlation, we should get a distribution of correlation coefficients with respect to which the true correlation coefficient is "extreme".


from random import shuffle
from math import sqrt

data = [(18, 28), (24, 18), (27, 31), (14, 15), (46, 23),
        (36, 19), (27, 10), (34, 25), (19, 15), (13, 13),
        (4, 2), (17, 20), (28, 12), (36, 11), (26, 14),
        (19, 19), (24, 13), (25, 6), (20, 8), (17, 22),
        (18, 8), (25, 12), (28, 27), (31, 28), (35, 22),
        (17, 8), (19, 19), (23, 23), (22, 11)]
x = [xi for (xi, yi) in data]
y = [yi for (xi, yi) in data]

def permute(x):
    shuffled = [xi for xi in x]
    shuffle(shuffled)
    return shuffled

def avg(x): # Average
    return sum(x)/len(x)

def stddev(x): # Standard deviation.
    m = avg(x)
    return sqrt(sum([(xi-m)**2 for xi in x])/len(x))

def cov(x, y): # Covariance.
    return sum([(xi-avg(x))*(yi-avg(y)) for (xi,yi) in zip(x,y)])/len(x)

def corr(x, y): # Correlation coefficient.
    if stddev(x)*stddev(y) != 0:
        return cov(x, y)/(stddev(x)*stddev(y))

def p(x, y):
    c0 = corr(x, y)
    corrs = []
    for k in range(0, 2000):
        y_permuted = permute(y)
        corrs.append(corr(x, y_permuted))
    return len([c for c in corrs if abs(c) > c0])/len(corrs)

In the above, we compute the correlation coefficient for a large number of permuted data sets. We then compute the fraction of these data sets that have a correlation coefficient with a higher absolute value than the correlation coefficient of the original data. We can call this fraction the p-value. Ideally, we would try all possible permutations; since this is infeasible, we can use this technique to get an approximation.

Alternatively, we can use a library to compute the same information. The scipy.stats.pearsonr() function will return the correlation coefficient and the p-value.


import scipy.stats
print(scipy.stats.pearsonr(x, y))

Another approach is to mathematically derive a distribution of correlation coefficients based on some assumptions about how the observation vectors x and y are distributed if they are independent of one another (i.e., not correlated) and each follow some distribution (e.g., the normal distribution). Then, computing a definite integral of this distribution from some specific observation to the nearest boundary would yield the p-value for that observation. Taking this approach usually requires parameterizing the distribution using the size of the data set (i.e., the number of samples) because this can influence how the distribution of correlation coefficients should be computed.

[link]

Example: Suppose that you have a data set of n observations in which each tuple is of the form (w_i, x_i, y_i, z_i), where w = (w₁, ..., w_n), x = (x₁, ..., x_n), y = (y₁, ..., y_n) and z = (z₁, ..., z_n). You compute some covariance and correlation coefficient quantities and find the following relationships:

cov(x, y)

cov(w, z)

ρ(y, z)

ρ(w, x)

Can you conclude that the correlation between x and y is stronger than the correlation between w and z?
Can you conclude that the correlation between y and z is stronger than the correlation between w and x?
Suppose that we change the last inequality to ρ(y, z) < 0. Does this change your answers to any of the above?

[link]

Example: Suppose you have a data set D consisting of tuples of the form (x_i, y_i) and you want to define the necessary transformations using the MapReduce paradigm to compute the covariance for this data set. Provide the definitions of the map and reduce operations necessary to accomplish this.

[link]

Example: Suppose you have a data set D consisting of tuples of the form (x_i, y_i) and you want to compute the correlation coefficient for this data set with a transformation sequence that uses building blocks from the relational paradigm. The flow diagram for this sequence of transformations is provided below.

!table([ [['dr:prod`r:proj``(xi,yi)'], ['r:agg plus``(xi,yi,1)'], ['d:proj``(sx,sy,n)']], [['d:agg plus``(((xi-μ(x))^2)/n, ((y-μ(y))^2)/n)'], ['rd:proj`l:proj``(xi,yi,μ(x), μ(y), n)'], ['l:prod``(μ(x) = sx/n, μ(y) = sy/n, n)']], [['d:proj``(σ(x)^2, σ(y)^2)'], ['d:prod``(cov(x, y)'], ['l:agg plus``(1/n)*(x-μ(x))*(y-μ(y))']], [['r:prod``(σ(x), σ(y))'], ['r:proj``(cov(x,y), σ(x), σ(y))'], ['(ρ(x, y))']] ])

[link] 4.5. Sampling and Inference

Often, data sets are partial observations of the overall system. Despite this, we may still be able to infer things about the system's properties from those partial observations. In this subsection we go over just a few of the many methods available for making inferences based on data obtained via sampling.

[link]

Definition (sampling): Given some data set S (which may not be known in its entirely and may only exist in theory, e.g., the data set of all humans, or the data set of all Earth-size planets in the galaxy), we call any subset T ⊂ S a sample of S.

Typically, the data set being sampled will contain records or tuples, each corresponding to an item or an individual. Each tuple provides some information about the properties or measurements of that item or individual along certain dimensions (e.g., size, location, and so on).

[link]

Definition (probability sampling): A probability sample of a data set S is any sample T ⊂ S that is collected under the condition that for every tuple, the probabiility of selecting that tuple is known in advance.

[link]

Fact: Assume a data set S has information that can be used to derive two relevant binary characteristics describing the individuals or items in that data set (e.g., whether an individual is male or female, whether an individual resides in a particular neighborhood, and so on). We can define two (possibly overlapping) subsets A ⊂ S and B ⊂ S corresponding to those subsets that have each of those characteristics.

If the two characteristics are independent we can conclude that:

|A ∩ B|/|B|

|A ∩ S|/|S|

|A|/|S|

In other words, if we treat the tuples from B as a sample of S and we find the fraction of those tuples that are also in A, we can conclude that the same would have occurred if we had sampled the entirety of S.

[link]

Example: Assume we know that within a certain geographical region, one out of every 10 people is a member of an online social networking mobile application that collects the geographical locations of its users. Suppose we take a sample of how many unique users were present in a particular neighborhood within that region over the course of a week, and we get a total of 1200 users. Assuming that there is no correlation between that particular neighborhood and the popularity of that mobile application (i.e., residency in that neighborhood and membership in the social network are independent variables), we can then infer the total population in that neighborhood by solving the following equation:

1/10

1200/n

The above yields n = 12,000, so we can conclude that there are about 12,000 residents in that particular neighborhood based on our observations.

In the above example, we could have also worked in the other direction if we only knew the overall population in the region, the population of the neighborhood, and the popularity of the application within the population in that neighborhood. This would allow us to derive the total number of users of the application in the region.

[link]

Fact (capture-recapture): Suppose that we can use probability sampling of a data set of size N, we know that the probability of selecting any individual tuple is equivalent to 1/N, and we do not know the size N of the data set. In addition, suppose we are allowed to "mark" any tuple that appears in a sample we take so that we can identify it (with complete certainty) as a previously marked tuple if we see it in subsequent samples.

Under these conditions, we can estimate the size of the entire data set (an unknown quantity) in the following way. First, we take a sample of size n of the data set, and we mark every tuple in that sample. Next, we take another (indepedent) sample of size K and count the number of marked tuples in that sample (suppose it is k ≤ K). We can then solve the following equation for N:

k/K

n/N

(n ⋅ K) / k

[link]

Example: Suppose that we want to know how many distinct people regularly visit a particular grocery store, and we can identify visitors in some way (e.g., by recording their unique identifiers). We can fix a particular time period and collect the identifiers of all the people who visit that store (suppose we collect n of them). We can then fix another time period, again record the identifiers of all K visitors during that time period, and then determine what fraction k/K of the identifiers from the first collection appears in the second collection. We could then estimate that the total number of individuals who visit the store regularly is around (n ⋅ K) / k.

[link]

Example: You are doing a probability sample of a population such that the probability of selecting every individual is the same. You take a sample of 1000 individuals and find that 200 of them are members of a particular social network. Answer the following questions.

Assuming that membership in the social network is completely independent of age, and that there are 20,000 individuals who are 18-24 years old in the overall population, provide one possible estimate for the number of individuals in the 18-24 age range who are members of the social network.
Suppose you determine that the social network has 10,000 unique members. What is the size of the population from which your sample was taken?

[link]

Fact (Algorithm R and reservoir sampling): Using Algorithm R, it is possible to take a probability sample of size k from an incoming sequence of data set entries (e.g., tuples) without knowing the length of the sequence in advance. Furthermore, it is guaranteed that every item in the sequence had an equal probability of being added to the sample.

[link]

Example: Assume we have enough storage in our database to store n + k tuples, and it currently holds a data set X that has n tuples. However, we expect that tomorrow, we will obtain a new data set Y that also has n tuples, so we cannot store it unless we delete X completely. We need to take a probability sample of size k from the union X ∪ Y such that any item has an equal likelihood of being added to the sample. Is this possible, and if so, how would you do it?

[link]

Example: You were using Algorithm R to perform reservoir sampling of packets as they were arriving, with a sample size of 100. You were also recording the number of packets there were corrupted, but you did not record the total number of packets that arrived.

Later on, someone asks you to compute the total number of packets that had arrived during the sampling period. You look at your sample and you find that out of 100 packets, 7 were corrupted. You also recorded that a total of 1400 packets had been corrupted during the sample collection period. How many packets arrived during the sample collection period?

[link]

Example: Suppose there are exactly n entries in a data set of all neighborhoods and that these entries can be divided into two dimensions: x = (x₁, ..., x_n) and y = (y₁, ..., y_n). Furthermore, for all 1 ≤ i ≤ n, we have that x_i ∈ {0,1} and y_i ∈ {0,1}, where each observation of the form (x_i, y_i) specifies whether the ith neighborhood has a hospital (the x_i component) and whether it has a school (the y_i component).

We compute the correlation coefficient and determine that ρ(x, y) = 0. If there are 12 neighborhoods with a school, there are 6 neighborhoods that have both a school and a hospital, and there are 8 hospitals in total, what can these facts reasonably imply about the total number of neighborhoods n?

[link] 5. Visualizations and Web Services

[link] 5.1. Visualizations

[link] 5.2. Web Services

In this section, we present examples of how a few existing tools can be used to implement and test a basic web service. More generally, there are innumerable frameworks for building server-side and client-side application components.

[link]

Example: The following presents one possible web service implementation using the Flask framework.


import jsonschema
from flask import Flask, jsonify, abort, make_response, request
from flask.ext.httpauth import HTTPBasicAuth

app = Flask(__name__)
auth = HTTPBasicAuth()

users = [
  { 'id': 1, 'username': u'alice' },
  { 'id': 2, 'username': u'bob' }
]

schema = {
  "type": "object", 
  "properties": {"username" : {"type": "string"}},
  "required": ["username"]
}

@app.route('/client', methods=['OPTIONS'])
def show_api():
    return jsonify(schema)

@app.route('/client', methods=['GET'])
@auth.login_required
def show_client():
    return open('client.html','r').read()

@app.route('/app/api/v0.1/users', methods=['GET'])
def get_users(): # Server-side reusable name for function.
    print("I'm responding.")
    return jsonify({'users': users})

@app.route('/app/api/v0.1/users/', methods=['GET'])
def get_user(user_id):
    user = [user for user in users if user['id'] == user_id]
    if len(user) == 0:
        abort(404)
    return jsonify({'user': user[0]})

@app.errorhandler(404)
def not_found(error):
    return make_response(jsonify({'error': 'Not found foo.'}), 404)

@app.route('/app/api/v0.1/users', methods=['POST'])
def create_user():
    print(request.json)
    if not request.json:
        print('Request not valid JSON.')
        abort(400)

    try:
        jsonschema.validate(request.json, schema)
        user = { 'id': users[-1]['id'] + 1, 'username': request.json['username'] }
        users.append(user)
        print(users)
        return jsonify({'user': user}), 201
    except:
        print('Request does not follow schema.')
        abort(400)

@auth.get_password
def foo(username):
    if username == 'alice':
        return 'ecila'
    return None

@auth.error_handler
def unauthorized():
    return make_response(jsonify({'error': 'Unauthorized access.'}), 401)

if __name__ == '__main__':
    app.run(debug=True)

The HTML page below can be used to test the POST request handling capabilities of the application.


<script>
  function postSomething() {
    var http = new XMLHttpRequest();
    var url = "http://localhost:5000/app/api/v0.1/users";
    var json = JSON.stringify({'name':'carl'});
    http.open("POST", url, true);
    http.setRequestHeader("Content-type", "application/json");
    http.onreadystatechange = function() {
      if (http.readyState == 4 && http.status == 200) {
        console.log(http.responseText);
      }
    }
    http.send(json);
  }
</script>
<button onclick="postSomething();">Post new user</button>

[link] 5.3. Project #3: Visualizations, Web Services, and Complete Project

The purpose of this project is to expose the data sets and capabilities you assembled in previous projects using visualizations and/or web services, and to present your completed implementation (and any associated results). This is also the last project submission, and should represent the culmination of your work during the semester.

Extend your project with two new features/components, where each feature/component is either an interactive web-based visualization that can be displayed in a standard web browser or a web service with a RESTful web API. You might choose to create two visualizations given the results you produced a previous project, or you might choose to create an interactive client-server (i.e., web interface and web service) application that allows users to invoke an algorithm or statistical analysis using their own specific parameters.
Complete a well-written and thorough description of your project (and the components you implemented) in the form of a report (which can be a README.md file within your repository, though HTML or PDF are also acceptable). It is expected that the report should come out to at least 3-5 pages (if printed in a 12-point font on 8.5 by 11 in. sheets), but there's no upper limit on length.
The report should bring together all the text, diagrams, and images you generated and put them together into a coherent summary that explains what you did for someone who is not familiar with your project. Some ideas for what you may want to include (this is a rough outline in the form of introduction, methods, results, and future work, with suggested subtopics that may or may not apply in your case, so adjust or substitute as appropriate).
- an introduction/narrative describing the problem or topic you chose to address and any associated motivation or background:
  - list the data sets you used or other online resources you collected or retrieved automatically, and their origins, and
  - describe any data sets you assembled manually or programmatically using the above;
- specify which algorithms, analysis techniques, or tools you used, and describe any interesting issues you encountered with regard to:
  - usability and/or performance of the tool or technique,
  - necessary adjustments or transformations to the data in order to make it compatible with the tool or technique, and/or
  - limitations of the technique or the data;
- summarize the results and your conclusions about the problem or topic (whether or not they are definitive or open problems remain), and include screenshots of the visualizations (or the visualizations themselves);
- describe any ideas for future work that you had while working on the project that you think might be useful to pursue, or that you did not have a chance to pursue (this will be particularly useful to us for the next iteration of the course).
Complete a poster for the poster session that will take place 3-5 PM on December 9th, 2016 at the Hariri Institute for Computing. The poster can contain the same information that your report contains, though visuals are usually more effective. You may choose to bring a demo to accompany your poster presentation. Submit your poster file as part of the pull request with your final submission. You may place the poster in the root directory of your project/group folder.

[link] References

[1]	"World's population increasingly urban with more than half living in urban areas". https://www.un.org/development/desa/en/news/population/world-urbanization-prospects.html
[2]	Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. West. "Growth, innovation, scaling, and the pace of life in cities". Proceedings of the National Academy of Sciences of the United States of America 2007;104(17):7301-7306. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852329/
[3]	Robert Albright, and Alan Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker White. "SGL: A Scalable Language for Data-driven Games". Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data 2008. http://www.cs.cornell.edu/~sowell/2008-sigmod-games-demo.pdf
[4]	Walker White, Benjamin Sowell, Johannes Gehrke, and Alan Demers. "Declarative Processing for Computer Games". Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games 2008. http://www.cs.cornell.edu/~sowell/2008-sandbox-declarative.pdf
[5]	W3C Working Group. "PROV Model Primer". https://www.w3.org/TR/prov-primer/
[6]	Robert Ikeda and Jennifer Widom. "Data Lineage: A Survey". http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf
[7]	Y. Cui and J. Widom. "Lineage Tracing for General Data Warehouse Transformations". The VLDB Journal 2003;12(1):41-58. http://ilpubs.stanford.edu:8090/525/1/2001-5.pdf
[8]	Gerd Gigerenzer. "Mindless statistics". The Journal of Socio-Economics 2004;33(5):587-606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf
[9]	Nihar B. Shah and Dengyong Zhou. "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing". CoRR 2014. http://www.eecs.berkeley.edu/~nihar/publications/double_or_nothing.pdf

[link] Appendix A. Using GitHub

In this course, staff will use GitHub to distribute skeleton and evaluation code for assigned projects, and students will maintain their code, collaborate on their project, and submit their solution or implementation using GitHub. Students will need to register on GitHub. Students should be able to sign up for a student discount, as well, using their Boston University credentials.

[link] A.1. Retrieving and Submitting Assigned Projects on GitHub

We assume that students are organized into groups (initially up to two students working together, though this limit will be raised to three for the later projects). Each group should designate a primary member. The workflow for student projects will be as follows.

For each project assigned during the course, a private repository will be created by course staff under the Data-Mechanics organization account with a name of the form course-YYYY-SSS-proj-NNNN where YYYY is the year, SSSS is the semester (spr or fal) and NNNN is zero, one, two, and so on.
- The first project (course-YYYY-SSS-proj-zero) will be a public repository so that those who are not members of the Data-Mechanics organozation can see and fork it in the steps below, as that will be how the course staff obtain everyone's GitHub usernames.
To begin working on that particular project, one of the group's members will fork that project repository, and will ensure that the other group members have the necessary privileges to contribute to it by adding them as collaborators.
- Within the forked repository's file tree, each group will create a subdirectory of the form BUlogin1, BUlogin1_BUlogin2, or BUlogin1_BUlogin2_BUlogin3 (depending on the number of members), where the login names BUloginN are the official Boston University login names for the members, and they are ordered in ascending alphabetical order and separated by underscores. All changes constituting work on the project should be made within this subdirectory and nowhere else, unless otherwise specified in the posted project instructions.
- It will be the responsibility of the group to ensure that their repository can be synced with the original project repository (this step only needs to be performed once).
- The group must also regularly sync their fork with the original repository in case course staff have made any changes to it. Course staff will usually announce when such changes take place.
The group members may use as many branches as they need in their own fork, but their master branch will represent the group's submission. At some point before the project deadline, the group must submit a pull request to official submit their work. Only the changes that were committed before the merge request is made will be accepted as submitted.

[link] Appendix B. Other Resources

[link] B.1. MongoDB and Related Resources

In this course, staff will use MongoDB to store and work with large data sets. This section lists some resources that you may find useful.

Some references, tutorials, and other materials of interest mentioned during lecture include the following.

The official documentation (https://docs.mongodb.org/manual/), and in particular:
- how to assemble relational queries (https://docs.mongodb.org/manual/core/aggregation-pipeline/) and
- how to assemble map-reduce queries (https://docs.mongodb.org/manual/core/map-reduce/).
Issues and feature requests mentioned during lecture include:
- flattening the results generated by mapReduce operations: https://jira.mongodb.org/browse/SERVER-2517 and
- deprecation and future removal of server-side eval() and stored procedure capabilities: https://jira.mongodb.org/browse/SERVER-17453.
A tutorial suggested by a student: http://openmymind.net/mongly/.
PyMongo Python library documentation: http://api.mongodb.org/python/current/.
Some resources dealing with using and building geospatial data set indices include:
- the GeoJSON format: http://geojson.org/geojson-spec.html,
- using geospatial indices and queries in MongoDB: https://docs.mongodb.org/manual/applications/geospatial-indexes/,
- example illustrating how to create a GEO2D index using PyMongo: http://api.mongodb.org/python/current/examples/geo.html, and
- extracts of OpenStreetMap data for metropolitan areas: https://mapzen.com/data/metro-extracts/.

[link] B.2. Installation Resources for Other Software Packages and Libraries

This subsection contains links to other useful resources that may help with installing packages and libraries we use in this course.

Some resources that may be useful when working with the prov package:

when installing the prov library for Python on Windows, you may have some trouble installing the lxml package. Try obtaining the precompiled package here, instead, and installing it using pip install *.whl;
a short tutorial: http://nbviewer.jupyter.org/github/trungdong/notebooks/blob/master/PROV%20Tutorial.ipynb.

As we use the JSON format for storing authentication, configuration, and provenance information, the JSON Schema is useful and there is a corresponding jsonschema Python package that you will need to install.

Two useful kinds of tools for solving problems that can be represented as systems of equations and inequalities are SMT solvers and linear optimization packages.

The SciPy library collection includes the optimize.linprog module for solving linear optimization problems, and the library is relatively straightforward to install:
- using Windows, it may be easiest to obtain and install precompiled binaries for numpy and scipy (in that order);
- using Mac OS X or Linux, you will likely need to download and install a Fortran compiler first, after which running pip install numpy and pip install scipyshould be sufficient.
The Z3 SMT solver is now open source and very easy to use with Python; precompiled binaries are also available.