Opening Data

OpenTech 2008
July 5th 2008

Rufus Pollock

[open knowledge foundation]


About the Foundation

OKFN website

Founded 2004 / Not-for-profit / A variety of projects

KnowledgeForge, CKAN, Open Shakespeare, Open Economics ...

What is Open Knowledge?


Content/Data/Information (Genes to Geodata, Stats to Sonnets)

Open = Freedom to Access / Use / Re-use / Redistribute

So How's This Relevant

After all Openness isn't an End-in-Itself!

What We'd Like to be Able To Do

Understand and Create

Whether I'm a fund-manager making investments


An academic looking to cure cancer

Sure, but Specifically By

Having Lots of Material


Plugging It Together

Getting Material (in a nice form) is Often Non-Trivial

US Unemployment

US Unemployment
US Unemployment

But the Original Data Ain't So Nice

US Unemployment Raw

So We Clean It ...

Cleaned Data

So I've now created/parsed a whole bunch of data

Which I can Happily Use

US Unemployment Figures: 1940-2006

US Unemployment Figures: 1940-2006

OK, that's great: But What About Reuse

Maybe I want to link this with loan defaults, interest rates ...

The Many Minds Principle

The Coolest Thing To Do With Your Material Will Be Thought of By Someone Else

Long version: The set of useful things one can do with a given informational resource is always larger than can be done (or even thought of) by one individual or group.

So We Should Make it Available - Upload the Material Somewhere

One Ring to Rule Them All

- Everyone everywhere uploads to some central repository
- Using the same metadata formats
- Using the same data formats


The Revolution Will be Decentralized

Small Pieces, Loosely Joined

Production Should Be Decentralized and Federated

How Do We Make This Happen

How Do We Support Decentralization of Creation
Recombination of the Produced Material

Consider the Miracle of 'apt'

apt-get in operation


Atomization and Packaging

Componentization is the process of atomizing (breaking down) resources into separate reusable packages that can be easily recombined.

NB: If the Data is OPEN

Putting Humpty-Dumpty Back Together is Much Easier

(And Clear Openness 'Standard' Matters)

2 Related but Distinct Aspects

Knowledge APIs + (Automated) Discovery and Installation

Ignore Knowledge APIs Here (Hard!)

- Domain Specific
- Require Coordination
- Hard to Plan in Advance, Progress By Experimentation

Automated Discovery and Installation

- Wrap the material up and make it available
- In a form suitable for automatable downloading
- Basic metadata: id, license, etc
- Register so it can be found ...

Automated Installation: datapkg


    datapkg create mypkg
    # go into mypkg and add some data
    cp ~/mydata1.csv ./
    # edit the metadata
    vi / vi metadata.txt
    # tar it up and upload it somewhere 
    scp -r .
    # or register on CKAN
    datapkg register .

Where to Register?


Freshmeat/CPAN/... for Open Data/Knowledge

Getting and Using

    # later and somewhere else ...
    datapkg install mypkg
    # or just
    datapkg install
    datapkg list-installed

    # USING
    # in my code
    import datapkg
    data = datapkg.resource_stream('mypkg-name', 'mydata1.csv') 
    # plot it etc
    # datapkg plot ...

The Start of the 'Debian of Data'


Rufus Pollock

(later today)