CKAN and Building the Debian of Data

Chaos Computer Congress
December 28th 2009

Rufus Pollock and Daniel Dietrich

[open knowledge foundation]

[http://www.okfn.org/]

About the Foundation

Open Knowledge Foundation

Founded 2004 / Not-for-profit

Promoting, Creating Disseminating Open Knowledge

Genes to Geodata, Stats to Sonnets ...

We're A Community

Structure

Organized around a set of projects and working groups

Principles: open knowledge, meritocracy and tolerance

What's Open Mean?

Open Knowledge Definition

Open Knowledge Definition

http://www.opendefinition.org/

Open = Freedom to Access / Use / Re-use / Redistribute

So How's This Relevant?

Openness isn't an End-in-Itself!

1. Why: What We Want
(Really, Really Want)

To Create and Use Information

Whether I'm a citizen deciding how to vote

or

A researcher working on global warming

or ...

Sure, but Specifically By

Having Lots of Data

AND

Plugging It Together

US Wheat: 1860-Present

US Wheat Data Output US Wheat Data Acreage
US Wheat Data Yield US Wheat Data Price

http://www.openeconomics.net/store/517d7c4e-3cb7-4e8f-aaa1-745dd665ad1f

Getting Data Often Ain't Easy!

US Unemployment

US Unemployment
US Unemployment

But the Original Data Ain't So Nice

US Unemployment Raw

So We Clean It ...

http://knowledgeforge.net/econ/svn/trunk/data/bls/usa_bls_employment/data.py

Cleaned Data

So I've now created/parsed a whole bunch of data

Which I can Happily Use

US Unemployment Figures: 1940-2006

US Unemployment Figures: 1940-2008

http://www.openeconomics.net/plot/chart/usa_bls_employment

OK, that's great, but:
How do we SCALE

Want to link this with lots of other data (interest rates etc) and I'm only me

2. Building (Large-Scale) Data Infrastructures

The DBs of Cyberspace

(The Real Vision of Cyberspace)

Larger than Any Single Individual (or Corporation)

How Do We Build Complex Things?

...

Lots of Labour

How Do We Build Complex Things II?

worker on empire state

But as we get bigger too much for one mind ...

Divide and Conquer: Componentization

Break data down into chunks (packages) that can be individually managed

Need to Put Humpty-Dumpty Back Together Again

Broken Humpty

Componentization isn't just atomization, it's also about 'packaging' ...

Two Different Models

One Ring to Rule Them All

- One centralized system
- A single set of APIs/formats
- Probably closed
- Most data currently like this

NO!

The Revolution Will be Decentralized

Remember: The Many Minds Principle

The Best Thing To Do With Your Data Will Be Thought of By Someone Else

Small Pieces, Loosely Joined

Production Should Be Decentralized and Federated

Sharing and Separation are Key

Requires OPENNESS!

3. Making It Happen for Data

Consider the Miracle of 'apt'

apt-get in operation

2 Related but Distinct Aspects

APIs + Distribution

Ignore (Knowledge) APIs Here (Hard!)

- Domain Specific (Geodata ain't Genomics)
- Require Coordination
- Hard to Plan in Advance, Progress By Experimentation

Distribution

- Package: Wrap the material up (+ basic metadata)
- In form suited for automatable upload/download
- Register so it can be found ...

The Vision

The Registry: http://www.ckan.net/

Freshmeat/CPAN/Gems ... for Open Data

Anyone can add material (760 pkgs + counting)

A CKAN Package

CKAN Helping Power data.gov.uk

apt-get: datapkg

http://www.okfn.org/datapkg/

A Data Packaging Swiss Army Knife

Getting and Using


    # search for a package CKAN.net::
    $ datapkg search ckan:// windhover
    ...
    datapkgdemo -- ...
    ...

    # Get info
    $ datapkg info ckan://datapkgdemo

    # Install (download + unpack atm) to the current directory:
    $ datapkg install ckan://datapkgdemo .
  
  

Creating and Registering


    # Have some existing data
    cd my_data_directory

    # Make a metadata (metadata.txt) - name/value pairs (like Debian,R etc)
    $ vim metadata.txt

    # register on CKAN 
    $ datapkg register . ckan://

    # Check it has registered ok::
    $ datapkg info ckan://mynewdatapkg
  
  

i18n + decentralization: http://de.ckan.net/

http://wiki.okfn.org/de/

4. Conclusion

The Start of the 'Debian of Data'

'Data' package managers wanted ...

Data and Code are Becoming One

Hack Code, Hack Data

Thank-You

Rufus Pollock and Daniel Dietrich
rufus.pollock / daniel.dietrich @okfn.org

http://www.okfn.org/
http://www.ckan.net/
http://www.okfn.org/datapkg/

Credits

Images

giza_pyramid.jpg: http://www.flickr.com/photos/cornelluniversitylibrary/3672461369/
cyberspace.jpg: thevirtualism.com/history_project/virtual.html
worker_on_empire_state.jpg: PD (Lewis Hine for US Federal Government)
tolkein_ring.jpg: http://en.wikipedia.org/wiki/File:Ringinscription.jpg
lego.jpg: http://www.flickr.com/photos/oskay/265899811/
humpty_dumpty.jpg: http://www.flickr.com/photos/aussiegall/298669543/