As easy as ABC

Let's Start

Here is a quick example to give you a flavor of the project, using scikit-learn and the famous “digits” dataset running in a Jupyter Notebook.

Step 1: Install the library:

pip install dataworkspaces

Step 2: Create a workspace:

mkdir quickstart
cd ./quickstart
dws init --create-resources code,results

This created our workspace (which is a git repository under the covers) and initialized it with two subdirectories, one for the source code, and one for the results. These are special subdirectories, in that they are resources which can be tracked and versioned independently.

Step 3: Add our source data to the workspace. This resides in an external, third-party git repository. It is simple to add:

git clone https://github.com/jfischer/sklearn-digits-dataset.git
dws add git --role=source-data --read-only ./sklearn-digits-dataset

The first line (git clone …) makes a local copy of the Git repository for the Digits dataset. The second line (dws add git …) adds the repository to the workspace as a resource to be tracked as part of our project. The –role option tells Data Workspaces how we will use the resource (as source data), and the –read-only option indicates that we should treat the repository as read-only and never try to push it to its origin [2] (as you do not have write permissions to the origin copy of this repository).

[2] In Git, each remote copy of a repository is assigned a name. By convention, the origin is the copy from which the local copy was cloned.

We can see the list of resources in our workspace via the command dws report status:

$ dws report status
Status for workspace: quickstart
Resources for workspace: quickstart
| Resource               | Role        | Type             | Parameters                                                                |
|________________________|_____________|__________________|___________________________________________________________________________|
| sklearn-digits-dataset | source-data | git              | remote_origin_url=https://github.com/jfischer/sklearn-digits-dataset.git, |
|                        |             |                  | relative_local_path=sklearn-digits-dataset,                               |
|                        |             |                  | branch=master,                                                            |
|                        |             |                  | read_only=True                                                            |
| code                   | code        | git-subdirectory | relative_path=code                                                        |
| results                | results     | git-subdirectory | relative_path=results                                                     |
No resources for the following roles: intermediate-data.

 

Step 4: Create a Jupyter notebook for running our experiments:

cd ./code
jupyter notebook

This will bring up the Jupyter app in your brower. Click on the New dropdown (on the right side) and select “Python 3”. Once in the notebook, click on the current title (“Untitled”, at the top, next to “Jupyter”) and change the title to digits-svc.

Step 5: Type the following Python code in the first cell:

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from dataworkspaces.kits.scikit_learn import LineagePredictor, load_dataset_from_resource

# load the data from filesystem into a "Bunch"
dataset = load_dataset_from_resource('sklearn-digits-dataset')

# Instantiate a support vector classifier and wrap it for dws
classifier = LineagePredictor(SVC(gamma=0.001),
                              'multiclass_classification',
                              input_resource=dataset.resource,
                              model_save_file='digits.joblib')

# split the training and test data
X_train, X_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.5, shuffle=False)

# train and score the classifier
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

This code is the same as you would write for scikit-learn without dws, except that:

  1. we load the dataset from a resource rather than call the lower-level NumPy fuctions (although you can call those if you prefer), and
  2. we wrap the support vector classifier instance with a LineagePredictor.

 

Step 6: Run the cell. It will take a few seconds to train and test the model. You should then see:

Wrote results to results:results.json

0.9688542825361512

Now, you can save and shut down your notebook. If you look at the directory quickstart/results, you should see you should see a saved model file, digits.joblib, and a results file, results.json, file with information about your run. We can format and view the results file with the command dws report results:

$ dws report results
Results file at results:/results.json

General Properties
| Key                    | Value                      |
|________________________|____________________________|
| step                   | digits-svc                 |
| start_time             | 2020-01-14T12:54:00.473892 |
| execution_time_seconds | 0.13                       |
| run_description        | None                       |

Parameters
| Key                     | Value |
|_________________________|_______|
| C                       | 1.0   |
| cache_size              | 200   |
| class_weight            | None  |
| coef0                   | 0.0   |
| decision_function_shape | ovr   |
| degree                  | 3     |
| gamma                   | 0.001 |
| kernel                  | rbf   |
| max_iter                | -1    |
| probability             | False |
| random_state            | None  |
| shrinking               | True  |
| tol                     | 0.001 |
| verbose                 | False |

Metrics
| Key      | Value |
|__________|_______|
| accuracy | 0.969 |

Metrics: classification_report
| Key          | Value                                                                                                 |
|______________|_______________________________________________________________________________________________________|
| 0.0          | precision: 1.0, recall: 0.9886363636363636, f1-score: 0.9942857142857142, support: 88                 |
| 1.0          | precision: 0.9887640449438202, recall: 0.967032967032967, f1-score: 0.9777777777777779, support: 91   |
| 2.0          | precision: 0.9883720930232558, recall: 0.9883720930232558, f1-score: 0.9883720930232558, support: 86  |
| 3.0          | precision: 0.9753086419753086, recall: 0.8681318681318682, f1-score: 0.9186046511627908, support: 91  |
| 4.0          | precision: 0.9887640449438202, recall: 0.9565217391304348, f1-score: 0.9723756906077348, support: 92  |
| 5.0          | precision: 0.946236559139785, recall: 0.967032967032967, f1-score: 0.9565217391304348, support: 91    |
| 6.0          | precision: 0.989010989010989, recall: 0.989010989010989, f1-score: 0.989010989010989, support: 91     |
| 7.0          | precision: 0.9565217391304348, recall: 0.9887640449438202, f1-score: 0.9723756906077348, support: 89  |
| 8.0          | precision: 0.9361702127659575, recall: 1.0, f1-score: 0.967032967032967, support: 88                  |
| 9.0          | precision: 0.9278350515463918, recall: 0.9782608695652174, f1-score: 0.9523809523809524, support: 92  |
| micro avg    | precision: 0.9688542825361512, recall: 0.9688542825361512, f1-score: 0.9688542825361512, support: 899 |
| macro avg    | precision: 0.9696983376479764, recall: 0.9691763901507882, f1-score: 0.9688738265020351, support: 899 |
| weighted avg | precision: 0.9696092010839529, recall: 0.9688542825361512, f1-score: 0.9686644837258652, support: 899 |

 

Step 7: Let us take a snapshot, which will record the state of the workspace and save the data lineage along with our results:

dws snapshot -m "first run with SVC" SVC-1

SVC-1 is the tag of our snapshot. If you look in quickstart/results, you will see that the results (currently just results.json) have been moved to the subdirectory snapshots/HOSTNAME-SVC-1, where HOSTNAME is the hostname for your local machine). A file, lineage.json, containing a full data lineage graph for our experiment has also been created in that directory.

We can see the history of snapshots with the command dws report history:

$ dws report history

History of snapshots
| Hash    | Tags  | Created             | accuracy | classification_report     | Message            |
|_________|_______|_____________________|__________|___________________________|____________________|
| f1401a8 | SVC-1 | 2020-01-14T13:00:39 |    0.969 | {'0.0': {'precision': 1.. | first run with SVC |
1 snapshots total

We can also see the lineage for this snapshot with the command dws report lineage --snapshot SVC-1:

$ dws report lineage --snapshot SVC-1
Lineage for SVC-1
| Resource               | Type        | Details                                  | Inputs                                 |
|________________________|_____________|__________________________________________|________________________________________|
| results                | Step        | digits-svc at 2020-01-14 12:54:00.473892 | sklearn-digits-dataset (Hash:635b7182) |
| sklearn-digits-dataset | Source Data | Hash:635b7182                            | None                                   |

This report shows us that the results resource was writen by the digits-svc step, which had as its input the resource sklearn-digits-dataset. We also know the specific version of this resource (hash 635b71820) and that it is source data, not written by another step.

Some things you can do from here:

  • Run more experiments and save their results by snapshotting the workspace. If, at some point, we want to go back to our first experiment, we can run: dws restore SVC-1. This will restore the state of the source data and code subdirectories, but leave the full history of the results.
  • Upload your workspace on GitHub or an any other Git hosting application. This can be to have a backup copy or to share with others. Others can download it via dws clone.
  • More complex scenarios involving multi-step data pipelines can easily be automated. See the documentation for details.