Open Data and Componentization
Jo Walsh and Rufus Pollock
[open knowledge foundation]
What We'd Like to be Able To Do
Sure, but Specifically By
Having Lots of Data
Plugging It Together
Having Lots of Data (in a nice form) is Non-Trivial
What Were Governments Up to Before WWI?
The Cleaned Data
It's Not Always So Simple
So I've now created/parsed a whole bunch of data
Which I can Happily Use
Government Revenue 1880 - 1914 (1900=100)
UK Real Wages 1500-1850
OK, that's great: But What About Reuse
The Many Minds Principle
The Coolest Thing To Do With Your Data Will Be Thought of By Someone Else
Long version: The set of useful things one can do with a given informational resource is always larger than can be done (or even thought of) by one individual or group.
So We Should Make it Available - Upload the Data Somewhere
One Ring to Rule Them All
- Everyone everywhere uploads to some central repository
- Using the same metadata formats
- Using the same data formats
The Revolution Will be Decentralized
Small Pieces, Loosely Joined
Production Should Be Decentralized and Federated
How Do We Make This Happen
How Do We Support Decentralization of Data Creation
Recombination of the Produced Data
Consider the Miracle of 'apt'
Atomization and Packaging
Componentization is the process of atomizing (breaking down) resources into separate reusable packages that can be easily recombined.
- Break Things Down into Suitable Sized Chunks
- Not all of us want all 70TB of CO2 data
- Add an external interface (knowledge API)
- Otherwise we can't put humpty-dumpty back together again
- Allow Divide and Conquer to Work
- Web APIs (but we still want the dump)
- Code itself (distinction between code and data will disappear)
Human Genome Project
- Social value of decentralized collaboration
- Knowledge APIs: Unique identifiers for genes
- Very large amounts of semi-versioned data
- The API: unique identifiers (DOI, standard terms H20)
- Academic papers offer limited level of reuse and recombination
- Not easily atomizable
- Open Street Map: One Ring to Rule Them All?
- What are the Standards which will form the APIs
- How do we scale and decentralize?
So Where Do We Start
We Can't Do All of This Right Now
The APIs (the entry points) Are the Most Difficult Thing
We Can Only Progress By Experimentation
We Can At Least Wrap the Data Up
- In a Form Suitable for Automatable Downloading
- Include the build scripts (how did you get from source to here)
- datapkg ...
datapkg wrap mydata.csv mydata2.csv ...
# edit the metadata
# somewhere else ...
datapkg install econdata
# in my file
data = datapkg.load('econdata', 'mydata1.csv')
Make it Available in a Simple Way
- Put the metadata (and a pointer) in a Registry
The Start of Componentization
- Let's Remember it took Code 30 years