BU CAS CS 591 L1 Spring 2016

NOTE: This page contains all the examples presented during the lectures, as well as all the homework assignments. Click here to go back to the main page with the course information and schedule.

1. Introduction, Background, and Motivation
2. Modeling Data and Data Transformations
3. Systems, Models, and Algorithms
4. Statistical Analysis
References
Appendix A. Using GitHub
- A.1. Retrieving and Submitting Assigned Projects on GitHub
Appendix B. Other Resources
- B.1. MongoDB and Related Resources
- B.2. Installation Resources for Other Software Packages and Libraries

[link] 1. Introduction, Background, and Motivation

[link] 1.1. Overview

With over half of the world's population living in cities [#], and given the possibility that cities are a more efficient [#] way to organize and distribute resources and activities, the rigorous study of cities as systems that can be modeled mathematically is a compelling proposition. With the advent of pervasive infrastructures of sensors and computational resources in urban environments (smart cities, software-defined cities, and the Internet of Things), there is a potential to inform and validate such models using actual data, and to exploit them using the interoperability of the pervasive computational infrastructure that is available.

In this course, we introduce and define the novel term data mechanics to refer specifically to the study of a particular aspect of large, instrumented systems such as cities: how data can flow through institutions and computational infrastructures, how these flows can interact, integrate, and cleave, and how they can inform decisions and operations (in both online and offline regimes). We choose the term mechanics specifically because it connotes the study of mechanics (e.g., classical mechanics) in physics, where universal laws that govern the behavior and interactions between many forms of matter are studied. We also choose this term to emphasize that, often, the data involved will be tied to the physical environment: geospatial and temporal information is a common and integral part of data sets produced and consumed by instrumented environments.

[link] 1.2. Data Mechanics Repository and Platform

This course is somewhat unusual in that there are specific, concrete software development goals towards which the staff and students are working. In particular, one goal of the course is to construct a new general-purpose service platform for collecting, integrating, analyzing, and employing (both in real time and offline) real data sets derived from urban sensors and services (particularly those in the city of Boston). We will consider several real data sets and data feeds, including:

We will also secure additional data sets from the above sources as well as a few others, and for the purposes of student projects, there are no limits on other sources of data or computation (e.g., Twitter, Mechanical Turk, and so on) that can be employed in conjunction with some of the above.

Because this course involves the construction of a relatively large software application infrastructure, we will have the opportunity to introduce and practice a variety of standard software development and software engineering concepts and techniques. This includes source control, collaboration, documentation, modularity and encapsulation, testing and validation, inversion of control, and others. Students will also need to become familiar with how to use web service APIs used by government organizations (e.g., Socrata) to make queries and retrieve data.

The overall architecture of the service platform will have at least the following:

a database/storage backend ("repository") that houses:
- original and derived data sets, with annotations that include:
  - from where, when, and by what algorithm it was retrieved
  - using what integration algorithms it was derived
- algorithms for data retrieval or integration, with references that include:
  - when it was written and by whom
  - in what data sets it is derived
  - from what component algorithms it is composed (if it is such)
a web service ("platform") witha n application program interface (API) for running analysis and optimization algorithms:
- a defined language for defining analysis and optimization algorithms over the data stored in the repository
- an interface for submitting and running algorithms
other features for data and result visualization, simulation using partial data, etc.

[link] 1.3. Mathematical Modeling, Analysis Algorithms, and Optimization Techniques

There are a variety of problems that it may be possible to address using the data in the repository and the capabilities of the platform. This course will cover a variety of useful online and offline optimization topics in a mathematically rigorous but potentially application-specific way, including:

a defined language for defining analysis and optimization algorithms over the data stored in the repository,
dual decomposition,
online optimization.

The goal is to apply some of these techniques to the data sets(including integrated or derived data sets) and solve practical problems. Some of the problems raised in discussions with the City of Boston DoIT and MassDOT teams are:

characterizing intersections and coming up with a metric that incorporates:
- intersection throughput (people per unit time),
- modes of transportation (public transport, biking, walking),
- intersection safety (vehicle speed, accidents, and so on),
- intersection organization (no left turns, and so on);
characterizing streets and deriving metrics for:
- number of parking spaces,
- probability of an accident (e.g., using different modes of transportation),
- senior and handicapped mobility;
characterizing neighborhoods:
- economic condition and gentrification,
- senior and handicapped accessibility;
how to allocate resources to optimize some of the metrics above:
- where to perform repairs,
- how to improve housing affordability,
- where to place bike racks of ride sharing stations,
- senior and handicapped accessibility;
answering immediate questions relevant to an individual (e.g., in an app):
- is a building accessible,
- is there a place to park (or will there likely be at some other time),
- is a neighborhood safe (or is it becoming less or more safe),
- is a neighborhood affordable (or is it becoming less or more affordable),
- are healthcare services nearby.
integrating public transportation data with other data (e.g., Waze):
- why are buses on a certain route late,
- performance and problems during unexpected events (e.g., snow).

The above list is far from complete, and we will update it as the course progresses. Students are encouraged to discuss project ideas with faculty and one another.

[link] 1.4. Project #0: Preparation and Trial Submission

The purpose of this project is to obtain the necessary credentials for subsequent projects and for everyone (including course staff) to become familiar with the submission process and address any issues.

Choose an initial team consisting of one or two members. If Alice and Bob form a team and have the BU login names alice@bu.edu and bob@bu.edu, the unique identifier for their team will be alice_bob, where, in cases with two members, the two usernames are in ascending alphabetical order and separated by an underscore. For a one-member team, it is simply the BU login name of the sole member (e.g., alice).
Depending on which data you think you might use, sign up for the City of Boston Data Portal and/or the MBTA Developer Portal. We will make other, larger raw data sets from these organizations available soon, once we obtain them and set up the appropriate servers.
Create a GitHub account and follow the GitHub submission instructions by forking the public repository at https://github.com/Data-Mechanics/course-2016-spr-proj-zero, adding a single folder named using the group identifier (alice or alice_bob) that contains a single ASCII text file members.txt. Each team member should commit a line to that text file specifying a mapping from their GitHub username and their BU username. For example, if Alice and Bob have the GitHub usernames alicegh and bobgh, respectively, then the file should look as follows:

alicegh:alice
bobgh:bob

One of the team members should then make a pull request as specified in the instructions to submit the project. On subsequent projects, you will be forking a private repository once we add your user account to the Data-Mechanics organization (which we will retrieve from the members.txt file you add above).
Download and install Python 3 and MongoDB on your system if you do not have them already, review the Socrata API (try the example below with different data sets and query parameters), and try installing the prov module for Python.

import urllib.request
import json

url = "https://data.cityofboston.gov/resource/awu8-dc52.json?$limit=10"
response = urllib.request.urlopen(url).read().decode("utf-8")
print(json.dumps(json.loads(response), sort_keys=True, indent=2))

[link] 2. Modeling Data and Data Transformations

To study rigorously the ways data can behave and interact within an infrastructure that generates, transforms, and consumes data (e.g., to make decisions), it is necessary to define formally what data and data transformations are. One traditional, widely used, and widely accepted model for data is the relational model: any data set is a relation (i.e., a subset of a product of sets), and transformations on data are functions between relations. While the relational model is sufficient to define any transformation on data sets, the MapReduce paradigm is one modern framework for defining transformations between data sets.

In modern contexts and paradigms, these models can be useful when studying relatively small collections of individual, curated data sets that do not change dramatically in the short term. However, these alone are not sufficient in a context in which data sets are overwhelmingly multitudinous, varying in structure, and continuously being generated, integrated, and transformed. One complementary discipline that can provide useful tools for dealing with numerous interdependent data sets is that of data provenance or data lineage. The provenance of a data set (or subset) is a formal record of its origin, which can include how the data was generated, from what data or other sources it was created or derived, what process was used or was responsible for creating or deriving it, and other such information. This information can be invaluable for a variety of reasons beyond the obvious ones (i.e., the origin of the data), such as:

the same data be generated again or reproduced if an error occurs,
a data set can be updated from the original source if the source has been updated or changed,
the source of an inconsistency or aberration of data can be investigated,
any of the above could be applied to a subset because recomputing or investigating the entire data set would be prohibitively costly.

[link] 2.1. Relational Data and the MapReduce Paradigm

The relational model for data can be expressed in a variety of ways: a data set is a relation on sets, a logical predicate governing terms, a collection of tuples or records with fields and values, a table of rows with labelled columns, and so on. Mathematically, they are all equivalent. In this course, we will adopt a particular model because it is well-suited for the tools and paradigms we will employ, and because it allows for one fairly clean mathematical integration of the study of relational data and data provenance.

[link]

Definition: A data set (also known as a store or database) is a multiset R: a collection (possibly with duplicates) of tuples of the form (x₁,...,x_n) taken from the set product X₁ × ... × X_n. Typically, some distinguished set (e.g., the left-most in the set product) will be a set of keys, so that every tuple contains a key. Whether a set is a key or not often depends on the particular paradigm and context, however.

[link]

Definition: A data transformation T: A → B is a mapping from one space of data sets A to another space of data sets B. Notice that an individual data set S (i.e., a relation or, equivalently, a set of tuples) is just an element of A.

Some of the typical building blocks for data transformations in the relational model are:

union and difference (intersection can also be defined in terms of these),
projection (sometimes generalized into extended projection),
selection (filtering),
renaming,
Cartesian product,
variants of join operations (many can be constructed using the above, but other variants have been added as extensions),
aggregation (an extension).

One common operation on relations that is not possible to express in traditional formulations but found in some relational database systems is the transitive closure of a data set. Normally, this requires an iterative process consisting of a sequence of join operations.

[link]

Example: We can model and illustrate the transformations that constitute the MapReduce paradigm using Python. Note that selection and projection can be implemented directly using Python comprehensions, but we define wrappers below for purposes of illustration.


def union(R, S):
    return R + S

def difference(R, S):
    return [t for t in R if t not in S]

def intersect(R, S):
    return [t for t in R if t in S]

def project(R, p):
    return [p(t) for t in R]

def select(R, s):
    return [t for t in R if s(t)]
 
def product(R, S):
    return [(t,u) for t in R for u in S]

def aggregate(R, f):
    keys = {r[0] for r in R}
    return [(key, f([v for (k,v) in R if k == key])) for key in keys]

In the MapReduce paradigm, a smaller set of building blocks inspired by the functional programming paradigm (supported by languages such as ML and Haskell) exist for defining transformations between data sets. Beyond adapting (with some modification) the map and reduce (a.k.a., "fold") functions from functional programming, the contribution of the MapReduce paradigm is the improvement in the performance of these operations on very large distributed data sets. Because of the elegance of the small set of building blocks, and because of the scalability advantages under appropriate circumstances, it is worth studying the paradigm's two building blocks for data transformations: map and reduce:

a map operation will apply some user-specified computation to every tuple in the data set, producing one or more new tuples,
a reduce operation will apply some user-specified aggregation computation to every set of tuples having the same key, producing a single result.

Notice that there is little restriction on the user-specified code other than the requirement that it be stateless in the sense that communication and coordination between parallel executions of the code is impossible. It is possible to express all the building blocks of the relational model using the building blocks of the MapReduce paradigm.

[link]

Example: We can model and illustrate the two basic transformations that constitute the MapReduce paradigm in a concise way using Python.


def map(f, R):
    return [t for (k,v) in R for t in f(k,v)]
    
def reduce(f, R):
    keys = {k for (k,v) in R}
    return [f(k1, [v for (k2,v) in R if k1 == k2]) for k1 in keys]

A map operation applies some function f to every key-value tuple and produces zero or more new key-value tuples. A reduce operation collects all values under the same key and performs an aggregate operation f on those values. Notice that the operation can be applied to any subset of the tuples in any order, so it is often necessary to use an operation that is associated and commutative.

At the language design level, the relational model and the MapReduce paradigm are arguably complementary and simply represent different trade-offs; they can be used in conjunction on data sets represented as relations. Likewise, implementations of the two represent different performance trade-offs. Still, contexts in which they are used can also differ due to historical reasons or due to conventions and community standards.

[link]

Example: We can use the Python implementations of the map and reduce operations in an example above to implement some common transformations in the relational model. For this example, we assume that the first field in each tuple in the data set is a unique key (in general, we assume there is a unique key field if we are working with the MapReduce paradigm). We illustrate how selection can be implemented in the code below.


R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]

X = map(lambda k,v: [((k,v), (k,v))] if v > 20 else [], R) # Selection.
Y = reduce(lambda k,vs: k, X) # Keep same tuples (use tuples as unique keys).

We can also perform projection and aggregation.


R = [('Alice', ('F', 23)), ('Bob', ('M', 19)), ('Carl', ('M', 22))]

X = map(lambda k,v: [(v[0], v[1])], R) # Projection keeps only gender and age.
Y = reduce(lambda k,vs: (k, sum(vs)), X) # Aggregates ages by gender.

We can also perform a simple join operation, although we also need to "combine" the two collections of data. This particular join operation is also simple because each tuple in the input is only joined with one other tuple.


R = [('Alice', 23), ('Bob', 19), ('Carl', 22)]
S = [('Alice', 'F'), ('Bob', 'M'), ('Carl', 'M')]

X =   map(lambda k,v: [(k, ('Age', v))], R)\
    + map(lambda k,v: [(k, ('Gender', v))], S)
Y = reduce(\
        lambda k,vs:\
          (k,(vs[0][1], vs[1][1]) if vs[0][0] == 'Age' else (vs[1][1],vs[0][1])),\
        X\
      )

[link]

Exercise: Suppose you have a data set containing tuples of the form (name, city, age). You want to produce a result of the form (city, range) where range is defined as the difference between the oldest and youngest person in each city.

Construct a sequence of transformations that will yield this result using the basic building blocks available in the relational model (projections, selections, aggregations, and so on).
Use the MapReduce paradigm to construct a single-pass (one map operation and one reduce operation) MapReduce computation that computes the desired result. Hint: emit two copies of each entry (one for the computation of the maximum and one for the computation of the minimum).

[link]

Exercise: Consider the following line of Python code that operates on a data set D containing some voting results broken down by voter, state where the voter participated in voting, and the candidate they chose:


sum([1 for (person, state, candidate) in D if state == "Massachusetts" and candidate == "Trump"])

Identify which building blocks for transformations available in the relational model are being used in the above code, and draw a corresponding flow diagram that shows how they fit together.

[link] 2.2. Data Provenance

Data provenance is an overloaded term that refers, in various contexts and communities, to the source, origin, or lifecycle of a particular unit of data (which could be an individual data point, a subset of a data set, or an entire data set). In this course, we will use the term primarily to refer to dependency relationships between data sets that may be derived from one another (usually over time). The study of data provenance (also called data lineage) is arguably still being developed. However, more recently, standards for general-purpose representations of data provenance have been established (e.g., PROV [#]).

While the research literature explores various ways of categorizing approaches to data provenance, there are two dimensions that can be used to classify provenance techniques (surveyed in more detail in the literature [#]):

from where the data was generated (e.g., what data sets or individual data entries) and how it was generated (e.g., what algorithms were used);
whether the linear is tracked at a fine granularity (e.g., per data entry such as a row in a data set) or a coarse granularity (e.g., per data set).

Because data transformations can be broken down into building blocks (e.g., selections and projections in the relational model, or map and reduce operations in the MapReduce paradigm), when studying data lineage of individual data items it may be possible to derive standard techniques [#] for determining data lineage given the components that make up the transformation.

There are at least two approaches one can use to determine provenance of a single entry within a data set produced using a specific transformation. One is to track the provenance of every entry within the transformation itself, and another is to determine the provenance of an entry after the transformation has already been performed (e.g., by another party and/or at some point in the past). The latter scenario is more likely if storage of per-entry provenance meta-data is an issue, or if the transformations are black boxes that cannot be modified or extended before they are applied.

[link]

Fact: For intersection, union, difference, projection, and selection transformations, the per-entry provenance describing which input data set entries influenced an individual entry in the output data set (i.e., the "where" provnenance of an individual entry in the output) can be determined using a linear search through the input data sets for that item.

[link]

Fact: For an aggregation transformation, finding the per-entry provenance for an entry in the output data set is non-trivial. Note that even running the entire transformation again would not yield the subset of input data set entries that produced the output data set entry in question. In the worst, case, it would be necessary to run the transformation on all 2ⁿ subcollections of the input data set (for an input data set of size n) and to check for each one whether the given entry in the output data set was generated.

If we add the additional condition that the aggregation transformation is key-preserving (which means any two entries under the same keys in the input data set always contribute to an entry with the same output key in the output data set), then the provenance information is easy to determine using linear search.

If we add the additional condition that the aggregation transformation is context-free (which means any two entries in the input data set always contribute to the same output entry regardless of what other input entries are present), then the provenance information can be determined by partitioning the input data set using the entries of the output data set (i.e., there is one output data set entry per input data set entry), and the determinine which partition contributed to the output data set entry of interest using linear search. This could be done in polynomial time.

[link]

Exercise: For each of the following informal descriptions of data transformations, determine whether computing the per-entry provenance of a row in the output data set is easy (i.e., linear time) or difficult (i.e., cannot be done in polynomial time), assuming that the transformation's implementation is a black box (except for the provided description).

Given a data set containing entries of the form (name, age), the transformation produces a list of all entries corresponding to individuals who are of a legal voting age.
Given a data set containing entries of the form (product, department, price, date) describing sales that take place in a department store, the transformation computes the total revenue from each department for each possible date.
Given a data set specifying the geographical locations of a collection of radio receivers in a bounded region, determine an optimal placement of 100 radio towers within that region that minimizes the aggregate of the distances between each tower-receiver pair.

Tracking provenance at a coarse granularity (per data set) is relatively inexpensive in most applications (since the number of data sets is usually orders of magnitude less than the number of individual data set entries), and standards such as PROV provide authors of transformations a way to represent provenance information.

[link]

Example: Suppose Alice and Bob want to assemble a script that retrieves some information from an online resource and stores it in a database. We can use the prov Python package to programmatically assemble a course granularity provenance record that conforms to the PROV standard and describes a particular execution of this script.

We first create the document object and define the namespaces we will use for symbols in the document.


doc = prov.model.ProvDocument()
doc.add_namespace('alg', 'http://datamechanics.io/algorithm/alice_bob/') # The scripts in / format.
doc.add_namespace('dat', 'http://datamechanics.io/data/alice_bob/') # The data sets in / format.
doc.add_namespace('ont', 'http://datamechanics.io/ontology#')
doc.add_namespace('log', 'http://datamechanics.io/log#') # The event log.
doc.add_namespace('bdp', 'https://data.cityofboston.gov/resource/')

We then use an agent to represent the script, en entity to represent a resource, and an activity to repsent this invocation of the script. We also associate this current activity with the script, and we and indicate that his activity used the entity representing the resource.


this_script = doc.agent('alg:example', {prov.model.PROV_TYPE:prov.model.PROV['SoftwareAgent'], 'ont:Extension':'py'})
resource = doc.entity('bdp:wc8w-nujj', 
    {'prov:label':'311, Service Requests', 
    prov.model.PROV_TYPE:'ont:DataResource', 'ont:Extension':'json'}
  )
this_run = doc.activity(
    'log:a'+str(uuid.uuid4()), startTime, endTime, 
    {prov.model.PROV_TYPE:'ont:Retrieval', 'ont:Query':'?type=Animal+Found&$select=type,latitude,longitude,OPEN_DT'}
  )
doc.wasAssociatedWith(this_run, this_script)
doc.used(this_run, resource, startTime)

Finally, we define an entity for the data obtained and indicate to what agent it was attributed, by what actvitity it was generated, and from what entity it was derived.


found = doc.entity('dat:found', {prov.model.PROV_LABEL:'Animals Found', prov.model.PROV_TYPE:'ont:DataSet'})
doc.wasAttributedTo(found, this_script)
doc.wasGeneratedBy(found, this_run, endTime)
doc.wasDerivedFrom(found, resource, this_run, this_run, this_run)

repo.record(doc.serialize()) # Record the provenance document.

We can view our completed provenance record in, for example, JSON form using doc.serialize().

[link] 2.3. Composing Transformations into Algorithms

Whether we are using the relation model or the MapReduce paradigm, the available building blocks can be used to assemble fairly complex transformations on data sets. Each transformation can be written either using the concrete syntax of a particular programming language that implements the paradigm, or as a data flow diagram that describes how starting and imtermediate data sets are combined to derive new data sets over the course of the algorithm's operation.

[link]

Example: We can use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the k-means clustering algorithm.


def dist(p, q):
    (x1,y1) = p
    (x2,y2) = q
    return (x1-x2)**2 + (y1-y2)**2

def plus(args):
    p = [0,0]
    for (x,y) in args:
        p[0] += x
        p[1] += y
    return tuple(p)

def scale(p, c):
    (x,y) = p
    return (x/c, y/c)

M = [(13,1), (2,12)]
P = [(1,2),(4,5),(1,3),(10,12),(13,14),(13,9),(11,11)]

OLD = []
while OLD != M:
    OLD = M

    MPD = [(m, p, dist(m,p)) for (m, p) in product(M, P)]
    PDs = [(p, dist(m,p)) for (m, p, d) in MPD]
    PD = aggregate(PDs, min)
    MP = [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MT = aggregate(MP, plus)

    M1 = [(m, 1) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2]
    MC = aggregate(M1, sum)

    M = [scale(t,c) for ((m,t),(m2,c)) in product(MT, MC) if m == m2]
    print(sorted(M))

Below is a flow diagram describing the overall organization of the computation (nodes are data sets and edges are transformations). Note that the nodes labeled "..." are intermediate results that are implicit in the comprehension notation. For example, [(m, p) for ((m,p,d), (p2,d2)) in product(MPD, PD) if p==p2 and d==d2] first filters product(MPD, PD) using a selection criteria if p==p2 and d==d2 and then performs a projection from tuples of the form ((m,p,d), (p2,d2)) to tuples of the form (m, p) to obtain the result.

table([ ['rd:prod``P', null, null, 'd:agg min``PDs'], [null, 'r:proj``...', 'dr:prod`ru:proj``MPD', 'd:prod``PD'], ['ru:prod``M', null, null, 'd:selection``...'], ['u:proj``...', 'ld:prod``MC', 'l:agg sum``M1', 'l:proj`d:proj``...'], ['u:selection``...', 'l:prod``MT', null, 'll:agg plus``MP'] ])

[link]

Example: We can also use building blocks drawn from the relational model (defined for Python in an example above) to construct an implementation of the Floyd-Warshall shortest path algorithm.


N = ['a','b','c','d','e','f']
E = [('a','b'),('b','c'),('a','c'),('c','d'),('d','e'),('e','f'),('b','f')]

P = product(N,N)
I = [((x,y),100 if x != y else 0) for (x,y) in P]
D = [((x,y),1) for (x,y) in E]

OUTPUT = aggregate(union(I,D), min)
STEP = []
while sorted(STEP) != sorted(OUTPUT):
    STEP = OUTPUT
    P = product(STEP, STEP)
    NEW = union(STEP,[((x,v),k+m) for (((x,y),k),((u,v),m)) in P if u == y])
    OUTPUT = aggregate(NEW, min)

SHORTEST = OUTPUT

[link]

Example: We can use building blocks drawn from the MapReduce paradigm that are available in MongoDB to construct an implementation of the k-means clustering algorithm. This implementation illustrates many of the idiosyncracies of the MapReduce abstraction made available by MongoDB.


db.M.remove({});
db.P.remove({});
db.createCollection("M");
db.createCollection("P");

db.system.js.save({ _id:"dist" , value:function(u, v) {
  return Math.pow(u.x - v.x, 2) + Math.pow(u.y - v.y, 2);
}});

function flatten(X) {
    db[X].find().forEach(function(x) { db[X].update({_id: x._id}, x.value); });
}

function prod(X, Y, XY) {
  db[XY].remove({});
  db.createCollection(XY);
  db[X].find().forEach(function(x) {
    db[Y].find().forEach(function(y) {
      db[XY].insert({left:x, right:y});
    });
  });
}

function union(X, Y, XY) {
  db[XY].remove({});
  db.createCollection(XY);
  db[X].find().forEach(function(x) {
    db[XY].insert(x);
  });
  db[Y].find().forEach(function(y) {
    db[XY].insert(y);
  });
}

// We'll only perform a single product operation.
// Using map-reduce, we can perform argmax and argmin more easily.
// We can also use map-reduce to compare progress.

db.M.insert([{x:13,y:1}, {x:2,y:12}]);
db.P.insert([{x:1,y:2},{x:4,y:5},{x:1,y:3},{x:10,y:12},{x:13,y:14},{x:13,y:9},{x:11,y:11}]);

db.createCollection("HASHOLD");
db.createCollection("HASHNEW");

var iter = 0;
do {

  db.M.mapReduce(
    function() { emit("hash", {hash: this.x + this.y})},
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      });
      return {hash: hash};
    },
    {out: "HASHOLD"}
  );

  prod("M", "P", "MP");
  db.MPD.remove({});
  db.MP.mapReduce(
    function() { emit(this._id, {m:this.left, p:this.right, d:dist(this.left, this.right)}); },
    function() { },
    {out: "MPD"}
  );
  flatten("MPD");

  // No need to flatten when multiple map-reduce operations are taking place.

  db.MPD2.remove({});
  db.MPD.mapReduce(
    function() { emit(this.p._id, this); },
    function(key, vs) {
      var j = 0;
      vs.forEach(function(v, i) {
        if (v.d < vs[j].d)
          j = i;
      });
      return vs[j];
    },
    {out: "MPD2"}
  );

  // Remember that the reduce operation will be applied multiple times, in any order.
  // Cannot assume all will come in at once. Must be associative and commutative.

  db.MPD2.mapReduce(
    function() { emit(this.value.m._id, {x: this.value.p.x, y:this.value.p.y, c:1}); },
    function(key, vs) {
      var x = 0, y = 0, c = 0;
      vs.forEach(function(v, i) {
        x += v.x;
        y += v.y;
        c += v.c;         
      });
      return {x:x, y:y, c:c};
    },
    { finalize: function(k, v) { return {x: v.x/v.c, y: v.y/v.c}; },
      out: "M"
    }
  );
  flatten("M");

  db.M.mapReduce(
    function() { emit("hash", {hash: this.x + this.y})},
    function(k, vs) {
      var hash = 0;
      vs.forEach(function(v) {
        hash += v.hash;
      });
      return {hash: hash};
    },
    {out: "HASHNEW"}
  );

  var o = db.HASHOLD.find({}).limit(1).toArray()[0].value.hash;
  var n = db.HASHNEW.find({}).limit(1).toArray()[0].value.hash;
  print(o);
  print(n);
  print(iter);
  iter++;
} while (o != n);

[link] 2.4. Project #1: Data Retrieval, Storage, Provenance, and Transformation

The purpose of this project is to practice using some of the tools necessary for retrieving and storing data sets, tracking provenance information for data sets, and performing some transformations using one or both of the presented paradigms. For the purposes of this project description, we assume a team consisting of two members, Alice and Bob, with the team identifier alice_bob in accordance with Project #0.

1. Follow the GitHub submission instructions by forking the public repository at https://github.com/Data-Mechanics/course-2016-spr-proj-one. Note that we may commit and push a few additional fixes or updates to this repository before the deadline, so check for updates and pull them into your fork on a regular basis. Set up a MongoDB and Python environment as necessary for the project. You should be able to run the setup.js script to prepare the repository, and then to start your MongoDB server with authentication enabled and run the alice_bob example script.
2. As in Project #0, create a folder of the form alice_bob within the top-level directory of the project. All your scripts for retrieval of data and for transforming data already within the repository should be placed within this folder. Note that the file name for each script should be reasonably descriptive, as it will act as the identifier for the script. The scripts should also follow reasonable modularity and encapsulation practices, and should logically perform related operations. A larger number of smaller simpler, and more reusable scripts that each perform a small task is preferable.
1. Choose at least three publicly available data sets, data streams, services, or documents and write a short narrative and justification (5-10 sentences) explaining how these data sets can be combined to answer an interesting question or solve a problem. You do not need to solve the actual problem in this project, and it is acceptable to merely combine data sets in a way that is likely to support one or more solutions involving the particular data you choose. Include this narrative in a README.md file within your directory (along with any documentation you may write in that file).
2. Implement scripts that retrieve these data sets automatically. Any authentication credentials your scripts use should be included in a auth.json file, and all your scripts should retrieve the credential information from that file when it is passed to them over the command line, as seen below.
  
  python example.py auth.json
  
  Do not submit your auth.json file and do not include hard-coded authentication credentials in your code. You can use a .gitignore file to ensure you do not accidentially submit the credentials file. The course staff will use their own authentication credentials when running your code. Your README.md file should list any idiosyncratic details associated with the credentials needed to run your scripts.
3. Implement scripts that perform at least two non-trivial transformations that merge, combine, or otherwise process these data sets into two new data sets. These new data sets should be inserted into the repository. You can choose one of the existing algorithms or tools we have considered so far, or something else (assuming you thoroughly document in your README.md file how to obtain and run those tools).
1. Extend each of the scripts you implemented in Problem #2 to also store the provenance information documenting the activities that take place during the operation of that script. Each script should generate a single document once it finishes running, and should insert that document into the repository using record().
2. Create a separate JSON file plan.json in the directory that encodes the overall plan describing what all the scripts do. It should not have explicit time stamps, but it should be a provenance document conforming to the PROV standard that is the union of the all the PROV documents that the individual scripts generate. The information in this file, together with the scripts, should be sufficient to reproduce the data obtained and generated by the scripts.

[link] 3. Systems, Models, and Algorithms

When using mathematics to solve problems involving real-world entities and phenomenon, we can introduce the notions of a system (i.e., an organized collection of interdependent components and their possible states) that has distinct system states (which can capture spatial and temporal information about the system components, their behaviors, and their relationships). A mathematical model captures only some parts of the system and the possible system states, abstracting the other details away, and it does so using some particular symbolic language or representation.

Four common abstract mathematical objects used to represent models of systems are particularly relevant given our application domain of interest:

graphs consisting of a collection of nodes and edges (possibly weighted and directed) between those nodes,
vector spaces consisting of sets of vectors,
systems of equations and inequalities (or, equivalently, sets of constraints or set of logical formulas), and
functions (whether continuous or discrete).

In many cases, these mathematical objects can be used to represent each other (e.g., a graph can be represented as a function or as a system of equations), but each has its own characteristics that make it well-suited for modeling certain kinds of systems. Very often, two or more of these are combined when modeling a system.

Most data sets describing a real-world system are using one or more of the above languages for modeling that system. Identifying which of these is being used can inform our decisions about what techniques we can use to answer questions about that system or to solve problems involving that system. It can also tell us how we can convert to a different language or representation in order to convert, combine, or derive new data sets.

[link] 3.1. Systems, Models, and Metrics

We can provide abstract definitions for systems, models, and metrics. These then allow us to characterize satisfaction, search, and optimization problems in a general and consistent way.

[link]

Definition: A system is a set S of distinct system states (possibly infinite). Often, each possible system state s ∈ S is called a model of the system.

One mathematical object that is commonly used to represent systems is a vector space where individual vectors are the system states (i.e., models) of that system.

[link]

Definition: A constraint set is a set of logical formulas (written using a formal language that may contain terms such as integers, vector, matrices, sets, and so on), usually used to identify a subset of a system (i.e., a subset of the possible systems states or models). A system state or model satisfies a constraint or constraint set if the logical formulas are true for that model.

When using vector spaces, it is often possible to represent constraints using a matrix equation, e.g.:

M ⋅ x

In the above, M might be a matrix from ℝ^{n × m}, x might be a variable vector in ℝ^m, and b might be a vector in ℝⁿ.

[link]

Definition: A constraint satisfaction problem is the problem of finding a model or set of models that satisfy a constraint set.

[link]

Definition: A metric over a system S is a function f: S → ℝ that maps system states to real numbers.

[link]

Definition: Given a system S and metric f, an optimization problem is the problem of finding s ∈ S that maximizes or minimizes f:

argmax_{s ∈ S} f(s)

argmin_{s ∈ S} f(s)

[link]

Exercise: For each of the following scenarios, determine whether it is an instance of a constraint satisfaction problem or an instance of an optimization problem. If it is an optimization problem, identify the objective function.

Given a data set describing a set of spectators and their heights, determine if it is possible to seat them in a stadium so that no one blocks anyone else's view.
Find the lowest-cost route between two destinations on a transportation network.
Given a social network where any pair of individual members can be "connected", determine if the whole network is connected (i.e., there exists a path from every member to every other member).
Given a social network where any pair of individual members can be "connected", find the smallest group of individuals that must be removed in order to make the network disconnected.

[link] 3.2. Spatial and Graph Algorithms

In the section above, we used some basic algorthms that have natural spatial interpretations (basic clustering, shortest paths, and shortest routes) in order to illustrate how to build data transformations using building blocks drawn from the relational model and the MapReduce paradigm. These two algorithms employed two different representations: a vector space (clustering) and a graph (shortest paths and routes). In this section, we review a few other such algorithms.

[link] 3.3. Linear systems, satisfiability modulo theories, and linear programming

We often say a system or subset of a system is linear if it can be modeled using a Euclidean vector space and can be defined using a collection of linear constraints of the form:

a₁ ⋅ x₁ + ... + a_n ⋅ x_n ♦ c

Where (x₁, ..., x_n) ∈ ℝⁿ is a vector, the a_i are coefficients in ℝ, c in ℝ is a constant, and ♦ is a relation operator such as =, ≤, <, ≥, or >. Note that the constraints contain no higher-order x_i terms such as x_i². There are three common techniques for solving constraint satisfaction and optimization problems involving linear systems:

finding an exact or approximate solution (e.g., using some combination of matrix inversion, QR factorization, and ordinary least squares),
checking if a solution exists (and finding a witness system state or model proving this), and
finding a solution that minimizes or maximizes some (linear) metric over the vector space.

The third item (solving an optimization problem) is arguably a generalization of the first approach if we remove the requirement that the metric must be linear (since the first approach minimizes the Euclidean distance metric, which is quadratic and thus non-linear).

[link]

Example: Suppose we want to use an SMT solver to solve a flow maximization problem for a flow graph (the variable on each represents the flow amount, and the parenthesized number is the maximum possible flow on that edge):

table([ [null, null, ('dd:x4 (6)`dr:x5 (3)``γ'), null, null], [('r:x1 (?)``α'), ('ur:x2 (7)`dr:x3 (8)``β'), null, ('r:x7 (5)``ε'), ('ζ')], [null, null, ('ur:x6 (4)``δ'), null, null], ])

Here, the system is the set of possible states of the graph in terms of the amount of material flowing along each edge (we model the flow that goes along each edge in the graph as a variable ranging over ℝ, and the flow in the overall graph containing seven edges as a vector in ℝ⁷).


import z3

(x1,x2,x3,x4,x5,x6,x7) = [z3.Real('x'+str(i)) for i in range(1,8)]

S = z3.Solver()

# Only allow non-negative flows.
for x in (x1,x2,x3,x4,x5,x6,x7):
    S.add(x >= 0)
    
# Edge capacity constraints.
S.add(x2 <= 7, x3 <= 8, x4 <= 6)
S.add(x5 <= 3, x6 <= 4, x7 <= 5)

# Constraints derived from graph topology.
S.add(x1 <= x2+x3, x2 <= x4+x5, x3+x4 <= x6, x5+x6 <= x7)

S.add(x1 > 0) # We want a positive flow.

print(S.check())
print(S.model())

If we are allowed to vary only the variable for incoming edge, we can use the SMT solver together with binary search to find the maximum flow through the graph.


flow = 0
for i in range(5, -1, -1):
    S.push()
    S.add(x1 >= (2**i + flow))
    if str(S.check()) != 'unsat':
        flow += 2**i
    S.pop()

S.add(x1 >= flow)
print(S.model())

We could alternatively build up the constraint set from a starting matrix.


def dot(xs, ys):
    return sum([x*y for (x,y) in zip(xs, ys)])

x = [x1,x2,x3,x4,x5,x6,x7]

M = [
  [ 0,-1, 0, 0, 0, 0, 0],
  [ 0, 0,-1, 0, 0, 0, 0],
  [ 0, 0, 0,-1, 0, 0, 0],
  [ 0, 0, 0, 0,-1, 0, 0],
  [ 0, 0, 0, 0, 0,-1, 0],
  [ 0, 0, 0, 0, 0, 0,-1],

  [ 1, 0, 0, 0, 0, 0, 0],
  [ 0, 1, 0, 0, 0, 0, 0],
  [ 0, 0, 1, 0, 0, 0, 0],
  [ 0, 0, 0, 1, 0, 0, 0],
  [ 0, 0, 0, 0, 1, 0, 0],
  [ 0, 0, 0, 0, 0, 1, 0],
  [ 0, 0, 0, 0, 0, 0, 1],

  [-1, 1, 1, 0, 0, 0, 0],
  [ 0,-1, 0, 1, 1, 0, 0],
  [ 0, 0,-1,-1, 0, 1, 0],
  [ 0, 0, 0, 0,-1,-1, 1]
  ]

b = [-7,-8,-6,-3,-4,-5,
      0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0]

S = z3.Solver()

for i in range(len(M)):
    S.add(b[i] <= dot(M[i], x))

print(S.check())
print(S.model())

[link]

Example: Suppose we want to use linear programming to solve the problem described in the example above.


from scipy.optimize import linprog

M = [
  [ 1,-1,-1, 0, 0, 0, 0],
  [ 0, 1, 0,-1,-1, 0, 0],
  [ 0, 0, 1, 1, 0,-1, 0],
  [ 0, 0, 0, 0, 1, 1,-1],
  ]
b = [0, 0, 0, 0]
c = [-1, 0, 0, 0, 0, 0, 0]
bounds = [(0, None), (0, 7), (0, 8), (0, 6), (0, 3), (0, 4), (0, 5)]

result = linprog(c, A_ub = M, b_ub = b, bounds=bounds, options={"disp": True})
print(result)

Note that this technique produces a potentially different solution from the one in the example above (though both are optimal).

[link] 3.4. General-purpose Optimization Techniques

There exist general-purpose techniques for solving an optimization problem given a set of constraints and an objective function. Usually, these rely on certain general assumptions about the properties of the constraints and the objective function.

[link]

Example: Suppose we want to find a "best fit" line y = a ⋅ x + b for a collection of k points in two dimensions. For the purposes of this example, we will assume that there is a solution that fits the data exactly. This amounts to solving a system of the form A ⋅ v = w where each row i of the matrix A has two entries x_i and 1, v is a vertical vector with two variable entries a and b, and w is a vertical vector where each row i is one entry y_i.

We can define a closed form for the gradient for this optimization problem in the following way. First, we define a closed form for the error function:

ε(a, b)

Σ_i [((ax_i + b) − y_i)²]

Σ_i [(a² x_i² + 2ab x_i + b²) − 2y_i (ax_i + b) + y_i²]

a²(Σ_i x_i²) + ab (Σ_i 2x_i) − a(Σ_i 2y_i x_i ) − b(Σ_i 2y_i) + kb² + Σ_i y_i²

We now need a closed form for the gradient of ε with respect to a and b. We can obtain this by taking the partial derivatives of ε with respect to a and b.

∂ε(a, b)/∂a

a(2Σ_i x_i²) + b (Σ_i 2x_i) − (Σ_i 2y_i x_i)

∂ε(a, b)/∂b

2kb + a (Σ_i 2x_i) − (Σ_i 2y_i)

b(2k) + a (Σ_i 2x_i) − (Σ_i 2y_i)

The above suggests that we can perform an aggregate computation to obtain a matrix representation of the gradient:

∇ε

2Σ_i x_i²	Σ_i 2x_i	− Σ_i 2y_i x_i
(Σ_i 2x_i)	2k	− (Σ_i 2y_i)

⋅

[link] 4. Statistical Analysis

Given a data set some of the optimization techniques we have studied can be used to find a best-fit model for that data. For example, given a data set of points in ℝ² representing some collection of observations measured along two dimensions, it is possible to find an ordinary least squares "best fit" linear function (using algebra, optimization, gradient descent, and so on). However, just because we have found a "best fit" model for a data set from some given space of functions does not mean that the model is actually a good one for that particular data set. Perhaps the relationship between the two dimensions is not linear, or even completely non-existent. How can we determine if this is the case?

In this section, we will present some statistical methods that can help us decide whether a linear model is actually appropriate for a given two-dimensional data set, and to quantify how appropriate it might be. We will review the mathematical foundations underlying the methods, how they can be applied, and how they can be realized as data transformations in the paradigms we have studied in earlier sections.

[link] 4.1. Review of Facts about Projections from Linear Algebra

We review a few useful facts about vectors in ℝⁿ from linear algebra. These provide a mathematical foundation for defining the statistical constructs discussed in this section.

[link]

Fact: For a vector x ∈ ℝⁿ and a scalar s ∈ ℝ, we can scale x by computing s ⋅ x = (s ⋅ x₁, ..., s ⋅ x_n).

[link]

Fact: For a vector x ∈ ℝⁿ the length of x is ||x|| = √(x₁² + ... + x_n²). We can also obtain a normalized unit vector of length one that points in the same direction as x by computing x/||x||.

[link]

Fact: Given two vectors x ∈ ℝⁿ and y ∈ ℝⁿ, the dot product of the two vector is x ⋅ y = x₁ ⋅ y₁ + ... + x_n ⋅ y_n. We can compute the projection of x onto y by computing x ⋅ (y/||y||).

[link]

Example: Given two vectors (5,7) ∈ ℝ² and (3,0) ∈ ℝ², the normalized version of (3,0) is (1/3) ⋅ (3,0) = (1,0). The projection of (5,7) onto (3,0) is then (5,7) ⋅ ((1/3) ⋅ (3,0)) = (5,7) ⋅ (1,0) = (5,0).

[link]

Theorem (Cauchy-Schwarz inequality): Given two vectors x ∈ ℝⁿ and y ∈ ℝⁿ, the following inequality holds:

|x ⋅ y| / ||x|| ⋅ ||y||

≤

Another way of writing the above is:

|| (x/||x||) ⋅ (y/||y||) ||

≤

The intuitive interpretation is as follows: if we try to find the projection of a unit vector x/||x|| onto another unit vector y/||y||, the resulting vector has length at most 1.

[link] 4.2. Mean, Standard Deviation, Covariance, and Correlation

In this section, we will assume that we are working with a data set of the form [(x₁, y₁), ..., (x_n, y_n)] that contains n (not necessarily distinct) tuples and where for 1 ≤ i ≤ n we have (x_i, y_i) ∈ ℝ². For the purposes of analysis, we will also often use two projections of this data set as vectors in their own right: (x₁, ...,x_n) ∈ ℝⁿ and (y₁, ...,y_n) ∈ ℝⁿ. We will also use the shorthand notation x = (x₁, ...,x_n) and y = (y₁, ...,y_n).

[link]

Definition (arithmetic mean): We define the arithmetic mean μ(x) of a one-dimensional or one-column data set x ∈ ℝⁿ where x = (x₁, ..., x_n) as:

μ(x)

(1/n) ⋅ (x₁ + ... + x_n)

[link]

Fact: For x ∈ ℝⁿ, μ(x) is the quantity that minimizes the following objective function:

μ(x)

argmin_{a ∈ ℝ} (Σ_i=1ⁿ (x_i - a)²)

To prove the above, let's suppose that we choose some arbitrary a in the formula, introduce the true mean m, and proceed from there:

Σ_i=1ⁿ (x_i - a)²

Σ_i=1ⁿ ((x_i − m) + (m − a))²

Σ_i=1ⁿ (x_i − m)² + 2 ⋅ (Σ_i=1ⁿ (x_i − m) ⋅ (m − a)) + Σ_i=1ⁿ (m − a)²

(Σ_i=1ⁿ (x_i − m)² ) + n ⋅ (m − a)²

Notice that setting a = m minimizes the above because the second term becomes 0.

Suppose we use μ(x) to create a vector with n entries (μ(x), μ(x), ..., μ(x)) in which every entry is μ(x). We can call this vector x_μ. Then we can say that this is the vector that minimizes the following objective function:

x_μ

argmin_{v is on the diagonal in ℝⁿ} ||x - v||

Another way of thinking about it is that x_μ as defined above is the projection of the vector x onto the diagonal line in ℝⁿ. This meas that x_μ is also the point on the diagonal that is closest to x.

[link]

Example: Suppose that our data consists of only two points in one dimension: x₁ = 10 and x₂ = 2. We can compute:

μ(x)

(1/2) ⋅ (10 + 2)

x_μ

(6, 6)

Notice that (6, 6) is on the diagonal in ℝ². Notice also that the vector x − x_μ = (10, 2) - (6, 6) = (4, -4) has a slope of − 1 (it is orthogonal to the diagonal). This is consistent with x_μ being a projection of x onto the diagonal.

[link]

Definition (standard deviation): We define the standard deviation σ(x) of a one-dimensional or one-column data set x ∈ ℝⁿ where x = (x₁, ..., x_n) as:

σ(x)

√( (1/n) ⋅ ((x₁ − μ(x))² + ... + (x_n − μ(x))²) )

[link]

Fact: For x ∈ ℝⁿ, σ(x) can be rewritten in terms of the norm (i.e., length) ||x − μ(x)|| of the vector x − μ(x):

σ(x)

(1/√(n)) ⋅ √( ((x₁ − μ(x))² + ... + (x_n − μ(x))²) )

(1/√(n)) ⋅ ||x − x_μ||

In other words, we can view the standard deviation as a normalized distance between a vector x and the diagonal in ℝⁿ.

[link]

Definition (covariance): For x ∈ ℝⁿ and y ∈ ℝⁿ, we define the covariance between x and y as:

cov(x, y)

(1/n) ⋅ ((x₁ - μ(x)) ⋅ (y₁ - μ(y)) + ... + (x_n - μ(x)) ⋅ (y_n - μ(y)))

(1/n) ⋅ (Σ_i=1ⁿ (x_i - μ(x)) ⋅ (y_i - μ(y)))

(1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)

Notice that the last line in the above indicates that we can view the covariance as the dot product of the vectors (x − x_μ) and (y − y_μ) that represent how x and y deviate from their respective means.

[link]

Definition (correlation coefficient): For x ∈ ℝⁿ and y ∈ ℝⁿ, we define the correlation coefficient (also known as the Pearson product-moment correlation coefficient) between x and y as:

ρ(x, y)

cov(x, y) / (σ(x) ⋅ σ(y))

[link]

Fact: For x ∈ ℝⁿ and y ∈ ℝⁿ, it must be that -1 ≤ ρ(x, y) ≤ 1. We can see this by noticing that the formula for ρ(x, y) appears in the Cauchy-Schwarz inequality:

ρ(x, y)

cov(x, y) / (σ(x) ⋅ σ(y))

((1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)) / ((1/√(n)) ⋅ ||x − x_μ|| ⋅ (1/√(n)) ⋅ ||y − y_μ||)

((1/n) ⋅ (x − x_μ) ⋅ (y − y_μ)) / ((1/n) ⋅ ||x − x_μ|| ⋅ ||y − y_μ||)

((x − x_μ) ⋅ (y − y_μ)) / (||x − x_μ|| ⋅ ||y − y_μ||)

Notice that the only difference between the above and the Cauchy-Schwarz inequality is that in the above, the numerator is not guaranteed to be positive. Thus, we have:

−1

≤

cov(x, y) / (σ(x) ⋅ σ(y))

≤

In addition to being bounded by −1 and 1, the correlation coefficient has some other useful properties. If x and y have a positive linear correlation, then ρ(x, y) is positive. If they have a negative linear correlation, it is negative. Given two pairs of data sets x, y and x', y', if |ρ(x, y)| > |ρ(x', y')|, then x and y have a stronger correlation with each other than do x' and y'.

While the correlation coefficient is useful for comparing correlations, the quantity itself should not be used to infer anything about the extent of the correlation. To do that with some rigor, it is necessary to introduce a few additional concepts.

[link]

Example: Suppose that our data consists of only a pair of two-dimensional tuples: x = (2, 6) and y = (20, 60). In other words, the data set is [(2,20), (6,60)]. Notice that the slopes of the orthogonal projections of x and of y onto the diagonal are the same, and if we normalize these projections, they become the same vector. That means (x − x_μ) / ||x − x_μ|| = (y − y_μ) / ||y − y_μ||. But the dot product of a vector with itself is the square of its length: v ⋅ v = ||v||². Thus, we have:

cov(x, y) / (σ(x) ⋅ σ(y))

(x − x_μ) ⋅ (y − y_μ) / (||x − x_μ|| ⋅ ||y − y_μ||)

Thus, because the two dimensions have a positive linear correlation in this example, the correlation coefficient is 1.

[link] 4.3. Observations, Hypothesis Testing, and Significance

The correlation coefficient gives us a way to quantify that there may exist a correlation between two dimensions in a data set. However, this correlation may be an artifact of the way we happened to obtain the data. For example, we may have obtained a biased sample. We can account for this by modeling the likelihood of obtaining a biased sample given the assumption that no correlation exists. If it is very unlikely that we would obtain a sample with the observed correlation if no correlation actually exists, then we might conclude that it is likely that a correlation does exist.

To take this approach, we need to compare the correlation we observe against something (i.e., to decide how likely or unlikely it is if there is no correlation). This means we need to define mathematically what it means for there to be no correlation.

[link]

Definition (distribution): A distribution over a set S is a function f: S → ℝ. If S represents a set of possible observations of a system (e.g., a collection of system states), then a distribution f might represent the probability that the system is in that state, or the frequency with which that system state has been or can be observed.

[link]

Definition (p-value): Let a subset of the real number line S ⊂ ℝ represent all possible system states that can be observed, and let f be a distribution over S that represents the probability of observing each of the possible states in S. Given some state s ∈ S, the probability of observing s is then f(s). We define the p-value at the observed state s to be the probability of observing any state that is equal to or more extreme than the state s.

One way to define which states are more extreme is to consider all the states from s to the nearest endpoint. Suppose S is an interval [a, b]. If s > (a − b) / 2, then the p-value would be computed by taking the definite integral of f from s to b. Otherwise, it would be computed by taking the definite integral of f from s to a.

In order to use the p-value concept to reason about correlation coefficients, it is necessary to model how the correlation coefficient should behave if no correlation exists. One way to do this using the data alone is to consider every permutation of the data.

[link]

Example: Assume we have a two-dimensional data set [(x₁, y₁), ..., (x_n, y_n)], also represented by x ∈ ℝⁿ and y ∈ ℝⁿ. One way to obtain a range of possible correlation coefficients is to compute the correlation coefficient for all n! possible permutations of the second dimension. Intuitively, if the data has no correlation, then performing this permutation should not change the correlation coefficient much. On the other hand, if the data does have a correlation, we should get a distribution of correlation coefficients with respect to which the true correlation coefficient is "extreme".


from random import shuffle
from math import sqrt

data = [(18, 28), (24, 18), (27, 31), (14, 15), (46, 23),
        (36, 19), (27, 10), (34, 25), (19, 15), (13, 13),
        (4, 2), (17, 20), (28, 12), (36, 11), (26, 14),
        (19, 19), (24, 13), (25, 6), (20, 8), (17, 22),
        (18, 8), (25, 12), (28, 27), (31, 28), (35, 22),
        (17, 8), (19, 19), (23, 23), (22, 11)]
x = [xi for (xi, yi) in data]
y = [yi for (xi, yi) in data]

def permute(x):
    shuffled = [xi for xi in x]
    shuffle(shuffled)
    return shuffled

def avg(x): # Average
    return sum(x)/len(x)

def stddev(x): # Standard deviation.
    m = avg(x)
    return sqrt(sum([(xi-m)**2 for xi in x])/len(x))

def cov(x, y): # Covariance.
    return sum([(xi-avg(x))*(yi-avg(y)) for (xi,yi) in zip(x,y)])/len(x)

def corr(x, y): # Correlation coefficient.
    if stddev(x)*stddev(y) != 0:
        return cov(x, y)/(stddev(x)*stddev(y))

def p(x, y):
    c0 = corr(x, y)
    corrs = []
    for k in range(0, 2000):
        y_permuted = permute(y)
        corrs.append(corr(x, y_permuted))
    return len([c for c in corrs if abs(c) > c0])/len(corrs)

In the above, we compute the correlation coefficient for a large number of permuted data sets. We then compute the fraction of these data sets that have a correlation coefficient with a higher absolute value than the correlation coefficient of the original data. We can call this fraction the p-value. Ideally, we would try all possible permutations; since this is infeasible, we can use this technique to get an approximation.

Alternatively, we can use a library to compute the same information. The scipy.stats.pearsonr() function will return the correlation coefficient and the p-value.


import scipy.stats
print(scipy.stats.pearsonr(x, y))

Another approach is to mathematically derive a distribution of correlation coefficients based on some assumptions about how the observation vectors x and y are distributed if they are independent of one another (i.e., not correlated) and each follow some distribution (e.g., the normal distribution). Then, computing a definite integral of this distribution from some specific observation to the nearest boundary would yield the p-value for that observation. Taking this approach usually requires parameterizing the distribution using the size of the data set (i.e., the number of samples) because this can influence how the distribution of correlation coefficients should be computed.

[link]

Example: Suppose that you have a data set of n observations in which each tuple is of the form (w_i, x_i, y_i, z_i), where w = (w₁, ..., w_n), x = (x₁, ..., x_n), y = (y₁, ..., y_n) and z = (z₁, ..., z_n). You compute some covariance and correlation coefficient quantities and find the following relationships:

cov(x, y)

cov(w, z)

ρ(y, z)

ρ(w, x)

Can you conclude that the correlation between x and y is stronger than the correlation between w and z?
Can you conclude that the correlation between y and z is stronger than the correlation between w and x?
Suppose that we change the last inequality to ρ(y, z) < 0. Does this change your answers to any of the above?

[link]

Example: Suppose you have a data set D consisting of tuples of the form (x_i, y_i) and you want to define the necessary transformations using the MapReduce paradigm to compute the covariance for this data set. Provide the definitions of the map and reduce operations necessary to accomplish this.

[link]

Example: Suppose you have a data set D consisting of tuples of the form (x_i, y_i) and you want to compute the correlation coefficient for this data set with a transformation sequence that uses building blocks from the relational paradigm. The flow diagram for this sequence of transformations is provided below.

table([ [['dr:prod`r:proj``(xi,yi)'], ['r:agg plus``(xi,yi,1)'], ['d:proj``(sx,sy,n)']], [['d:agg plus``(((xi-μ(x))^2)/n, ((y-μ(y))^2)/n)'], ['rd:proj`l:proj``(xi,yi,μ(x), μ(y), n)'], ['l:prod``(μ(x) = sx/n, μ(y) = sy/n, n)']], [['d:proj``(σ(x)^2, σ(y)^2)'], ['d:prod``(cov(x, y)'], ['l:agg plus``(1/n)*(x-μ(x))*(y-μ(y))']], [['r:prod``(σ(x), σ(y))'], ['r:proj``(cov(x,y), σ(x), σ(y))'], ['(ρ(x, y))']] ])

[link] 4.4. Sampling and Inference

Often, data sets are partial observations of the overall system. Despite this, we may still be able to infer things about the system's properties from those partial observations. In this subsection we go over just a few of the many methods available for making inferences based on data obtained via sampling.

[link]

Definition (sampling): Given some data set S (which may not be known in its entirely and may only exist in theory, e.g., the data set of all humans, or the data set of all Earth-size planets in the galaxy), we call any subset T ⊂ S a sample of S.

Typically, the data set being sampled will contain records or tuples, each corresponding to an item or an individual. Each tuple provides some information about the properties or measurements of that item or individual along certain dimensions (e.g., size, location, and so on).

[link]

Definition (probability sampling): A probability sample of a data set S is any sample T ⊂ S that is collected under the condition that for every tuple, the probabiility of selecting that tuple is known in advance.

[link]

Fact: Assume a data set S has information that can be used to derive two relevant binary characteristics describing the individuals or items in that data set (e.g., whether an individual is male or female, whether an individual resides in a particular neighborhood, and so on). We can define two (possibly overlapping) subsets A ⊂ S and B ⊂ S corresponding to those subsets that have each of those characteristics.

If the two characteristics are independent we can conclude that:

|A ∩ B|/|B|

|A ∩ S|/|S|

|A|/|S|

In other words, if we treat the tuples from B as a sample of S and we find the fraction of those tuples that are also in A, we can conclude that the same would have occurred if we had sampled the entirety of S.

[link]

Example: Assume we know that within a certain geographical region, one out of every 10 people is a member of an online social networking mobile application that collects the geographical locations of its users. Suppose we take a sample of how many unique users were present in a particular neighborhood within that region over the course of a week, and we get a total of 1200 users. Assuming that there is no correlation between that particular neighborhood and the popularity of that mobile application (i.e., residency in that neighborhood and membership in the social network are independent variables), we can then infer the total population in that neighborhood by solving the following equation:

1/10

1200/n

The above yields n = 12,000, so we can conclude that there are about 12,000 residents in that particular neighborhood based on our observations.

In the above example, we could have also worked in the other direction if we only knew the overall population in the region, the population of the neighborhood, and the popularity of the application within the population in that neighborhood. This would allow us to derive the total number of users of the application in the region.

[link]

Fact (capture-recapture): Suppose that we can use probability sampling of a data set of size N, we know that the probability of selecting any individual tuple is equivalent to 1/N, and we do not know the size N of the data set. In addition, suppose we are allowed to "mark" any tuple that appears in a sample we take so that we can identify it (with complete certainty) as a previously marked tuple if we see it in subsequent samples.

Under these conditions, we can estimate the size of the entire data set (an unknown quantity) in the following way. First, we take a sample of size n of the data set, and we mark every tuple in that sample. Next, we take another (indepedent) sample of size K and count the number of marked tuples in that sample (suppose it is k ≤ K). We can then solve the following equation for N:

k/K

n/N

(n ⋅ K) / k

[link]

Example: Suppose that we want to know how many distinct people regularly visit a particular grocery store, and we can identify visitors in some way (e.g., by recording their unique identifiers). We can fix a particular time period and collect the identifiers of all the people who visit that store (suppose we collect n of them). We can then fix another time period, again record the identifiers of all K visitors during that time period, and then determine what fraction k/K of the identifiers from the first collection appears in the second collection. We could then estimate that the total number of individuals who visit the store regularly is around (n ⋅ K) / k.

[link]

Example: You are doing a probability sample of a population such that the probability of selecting every individual is the same. You take a sample of 1000 individuals and find that 200 of them are members of a particular social network. Answer the following questions.

Assuming that membership in the social network is completely independent of age, and that there are 20,000 individuals who are 18-24 years old in the overall population, provide one possible estimate for the number of individuals in the 18-24 age range who are members of the social network.
Suppose you determine that the social network has 10,000 unique members. What is the size of the population from which your sample was taken?

[link]

Fact (Algorithm R and reservoir sampling): Using Algorithm R, it is possible to take a probability sample of size k from an incoming sequence of data set entries (e.g., tuples) without knowing the length of the sequence in advance. Furthermore, it is guaranteed that every item in the sequence had an equal probability of being added to the sample.

[link]

Example: Assume we have enough storage in our database to store n + k tuples, and it currently holds a data set X that has n tuples. However, we expect that tomorrow, we will obtain a new data set Y that also has n tuples, so we cannot store it unless we delete X completely. We need to take a probability sample of size k from the union X ∪ Y such that any item has an equal likelihood of being added to the sample. Is this possible, and if so, how would you do it?

[link]

Example: You were using Algorithm R to perform reservoir sampling of packets as they were arriving, with a sample size of 100. You were also recording the number of packets there were corrupted, but you did not record the total number of packets that arrived.

Later on, someone asks you to compute the total number of packets that had arrived during the sampling period. You look at your sample and you find that out of 100 packets, 7 were corrupted. You also recorded that a total of 1400 packets had been corrupted during the sample collection period. How many packets arrived during the sample collection period?

[link] 4.5. Project #2: Modeling, Optimization, Statistical Analysis, and Visualization

The purpose of this project is to practice using the modeling, optimization, and statistical analysis tools we have covered to answer questions or solve problems using the data sets obtained by all the teams in Project #1. For the purposes of this project description, we assume a team consisting of two members, Alice and Bob, with the team identifier alice_bob in accordance with Project #0. Teams consisting of up to four people are permitted for this project.

The GitHub repository for Project #1, consisting of all the teams' submissions (merged via merge requests at the time of submission) has been renamed to https://github.com/Data-Mechanics/course-2016-spr-proj-two.
- If your Project #1 team is not going to change, you should be able to continue working on your project on your own fork. Pull all the updates from the project repository to sync your fork with it (note that this will now include all the other teams' projects and scripts from Project #1).
- If you are creating a new or larger team, you should either create a new fork of the project repository, or rename the folder in your working fork to accurately represent the members of your new team. It is up to you how you combine your individual scripts and whether you wish to include or discard any redundant scripts or configuration files, but your project structure should be consistent with that of a single team and single team project.
As before, the course staff may commit and push additional fixes or updates to the repository before the project deadline, so check for updates and pull them into your fork on a regular basis. If you are utilizing another team's data or scripts, or are allowing another team to utilize your data or scripts, do not duplicate them in your own file tree. Instead, reference them in their original location (using the other team's collection prefix identifier and/or folder path). If another team would like to use an updated version of your data sets and/or scripts, perform a pull request so that the course staff can merge your latest code in the project repository. At that point, the other team can pull that update from the project repository.
Choose a problem or family of problems that you wish to solve using some collection of relevant data sets. These data sets could already be available in the repository (thanks to a Project #1 submission by your team or another team), or they may be additional data sets you plan to obtain with new retrieval scripts as you did in Project #1. The problem itself could involve solving a graph problem, performing a statistical analysis or analyses, solving a constraint satisfaction or optimization problem, developing a scoring mechanism, or achieving some other quantifiable goal for which at least some of the techniques we have learned so far might be useful. It could be the original problem you intended to solve in Project #1, or something new.
1. Write a short narrative and justification (at least 7-10 sentences) explaining how the data sets and techniques you hope to employ can address the problem you chose. Depending on what techniques you are using and what problem you are trying to solve, you may need to provide a justification or evaluation methodology for the techniques you are using (i.e., why you believe the chosen techniques will solve the problem you are trying to address). Include this narrative in the README.md file within your directory (you may only need to update your existing file).
2. Implement at least two non-trivial constraint satisfaction, optimization, statistical analysis, inference, or other algorithms that use the data sets and address the problem or problems described in part (a). You can choose one of the existing algorithms or tools we have considered so far, or something else (assuming you thoroughly document in your README.md file how to obtain and run those tools). If you anticipate that an algorithm or technique you employ could be applied to much larger data sets, you should try to implement it as a transformation in the relational model or the MapReduce paradigm. In most cases, the results will consist of new data sets; these should be inserted into the repository along with all the others in the usual way (and should be considered derived data sets).
3. Implement at least two interactive information visualizations of two or more of your data sets. These could be the data sets representing your final results, or some intermediate data sets; a single visualization can incorporate multiple data sets. The visualizations could be plots, histograms, heat maps, graphs, trees, maps, or something else. Ideally, it should be possible to view and interact with these visualization in a standard browser running on a laptop.
1. Ensure that whatever scripts or algorithms you implemented in Problem #2 also store the provenance information documenting the activities that take place and the data they generate during their operation. Each script should generate a single document once it finishes running, and should insert that document into the repository using record().
2. Create or update the plan.json file in your project directory that encodes the overall plan describing what all the scripts do. It should not have explicit time stamps, but it should be a provenance document conforming to the PROV standard that is the union of the all the PROV documents that the individual scripts generate. As before, the information in this file, together with the scripts, should be sufficient to reproduce the data obtained and generated by the scripts.

[link] References

[1]	"World's population increasingly urban with more than half living in urban areas". https://www.un.org/development/desa/en/news/population/world-urbanization-prospects.html
[2]	Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. West. "Growth, innovation, scaling, and the pace of life in cities". Proceedings of the National Academy of Sciences of the United States of America 2007;104(17):7301-7306. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852329/
[3]	Robert Albright, and Alan Demers, Johannes Gehrke, Nitin Gupta, Hooyeon Lee, Rick Keilty, Gregory Sadowski, Ben Sowell, and Walker White. "SGL: A Scalable Language for Data-driven Games". Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data 2008. http://www.cs.cornell.edu/~sowell/2008-sigmod-games-demo.pdf
[4]	Walker White, Benjamin Sowell, Johannes Gehrke, and Alan Demers. "Declarative Processing for Computer Games". Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games 2008. http://www.cs.cornell.edu/~sowell/2008-sandbox-declarative.pdf
[5]	W3C Working Group. "PROV Model Primer". https://www.w3.org/TR/prov-primer/
[6]	Robert Ikeda and Jennifer Widom. "Data Lineage: A Survey". http://ilpubs.stanford.edu:8090/918/1/lin_final.pdf
[7]	Y. Cui and J. Widom. "Lineage Tracing for General Data Warehouse Transformations". The VLDB Journal 2003;12(1):41-58. http://ilpubs.stanford.edu:8090/525/1/2001-5.pdf
[8]	Gerd Gigerenzer. "Mindless statistics". The Journal of Socio-Economics 2004;33(5):587-606. http://www.unh.edu/halelab/BIOL933/papers/2004_Gigerenzer_JSE.pdf
[9]	Nihar B. Shah and Dengyong Zhou. "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing". CoRR 2014. http://www.eecs.berkeley.edu/~nihar/publications/double_or_nothing.pdf

[link] Appendix A. Using GitHub

In this course, staff will use GitHub to distribute skeleton and evaluation code for assigned projects, and students will maintain their code, collaborate on their project, and submit their solution or implementation using GitHub. Students will need to register on GitHub. Students should be able to sign up for a student discount, as well, using their Boston University credentials.

[link] A.1. Retrieving and Submitting Assigned Projects on GitHub

We assume that students are organized into groups (initially up to two students working together, though this limit will be raised to three for the later projects). Each group should designate a primary member. The workflow for student projects will be as follows.

For each project assigned during the course, a private repository will be created by course staff under the Data-Mechanics organization account with a name of the form course-YYYY-SSS-proj-NNNN where YYYY is the year, SSSS is the semester (spr or fal) and NNNN is zero, one, two, and so on.
- The first project (course-YYYY-SSS-proj-zero) will be a public repository so that those who are not members of the Data-Mechanics organozation can see and fork it in the steps below, as that will be how the course staff obtain everyone's GitHub usernames.
To begin working on that particular project, one of the group's members will fork that project repository, and will ensure that the other group members have the necessary privileges to contribute to it by adding them as collaborators.
- Within the forked repository's file tree, each group will create a subdirectory of the form BUlogin1, BUlogin1_BUlogin2, or BUlogin1_BUlogin2_BUlogin3 (depending on the number of members), where the login names BUloginN are the official Boston University login names for the members, and they are ordered in ascending alphabetical order and separated by underscores. All changes constituting work on the project should be made within this subdirectory and nowhere else, unless otherwise specified in the posted project instructions.
- It will be the responsibility of the group to ensure that their repository can be synced with the original project repository (this step only needs to be performed once).
- The group must also regularly sync their fork with the original repository in case course staff have made any changes to it. Course staff will usually announce when such changes take place.
The group members may use as many branches as they need in their own fork, but their master branch will represent the group's submission. At some point before the project deadline, the group must submit a pull request to official submit their work. Only the changes that were committed before the merge request is made will be accepted as submitted.

[link] Appendix B. Other Resources

[link] B.1. MongoDB and Related Resources

In this course, staff will use MongoDB to store and work with large data sets. This section lists some resources that you may find useful.

Some references, tutorials, and other materials of interest mentioned during lecture include the following.

The official documentation (https://docs.mongodb.org/manual/), and in particular:
- how to assemble relational queries (https://docs.mongodb.org/manual/core/aggregation-pipeline/) and
- how to assemble map-reduce queries (https://docs.mongodb.org/manual/core/map-reduce/).
Issues and feature requests mentioned during lecture include:
- flattening the results generated by mapReduce operations: https://jira.mongodb.org/browse/SERVER-2517 and
- deprecation and future removal of server-side eval() and stored procedure capabilities: https://jira.mongodb.org/browse/SERVER-17453.
A tutorial suggested by a student: http://openmymind.net/mongly/.
PyMongo Python library documentation: http://api.mongodb.org/python/current/.
Some resources dealing with using and building geospatial data set indices include:
- the GeoJSON format: http://geojson.org/geojson-spec.html,
- using geospatial indices and queries in MongoDB: https://docs.mongodb.org/manual/applications/geospatial-indexes/,
- example illustrating how to create a GEO2D index using PyMongo: http://api.mongodb.org/python/current/examples/geo.html, and
- extracts of OpenStreetMap data for metropolitan areas: https://mapzen.com/data/metro-extracts/.

[link] B.2. Installation Resources for Other Software Packages and Libraries

This subsection contains links to other useful resources that may help with installing packages and libraries we use in this course.

Some resources that may be useful when working with the prov package:

when installing the prov library for Python on Windows, you may have some trouble installing the lxml package. Try obtaining the precompiled package here, instead, and installing it using pip install *.whl;
a short tutorial: http://nbviewer.jupyter.org/github/trungdong/notebooks/blob/master/PROV%20Tutorial.ipynb.

As we use the JSON format for storing authentication, configuration, and provenance information, the JSON Schema is useful and there is a corresponding jsonschema Python package that you will need to install.

Two useful kinds of tools for solving problems that can be represented as systems of equations and inequalities are SMT solvers and linear optimization packages.

The SciPy library collection includes the optimize.linprog module for solving linear optimization problems, and the library is relatively straightforward to install:
- using Windows, it may be easiest to obtain and install precompiled binaries for numpy and scipy (in that order);
- using Mac OS X or Linux, you will likely need to download and install a Fortran compiler first, after which running pip install numpy and pip install scipyshould be sufficient.
The Z3 SMT solver is now open source and very easy to use with Python; precompiled binaries are also available.