CKAN and Building the Debian of Data

Chaos Computer Congress
December 28th 2009

Rufus Pollock and Daniel Dietrich

[open knowledge foundation]


About the Foundation

Open Knowledge Foundation

Founded 2004 / Not-for-profit

Promoting, Creating Disseminating Open Knowledge

Genes to Geodata, Stats to Sonnets ...

We're A Community


Organized around a set of projects and working groups

Principles: open knowledge, meritocracy and tolerance

What's Open Mean?

Open Knowledge Definition

Open Knowledge Definition

Open = Freedom to Access / Use / Re-use / Redistribute

So How's This Relevant?

Openness isn't an End-in-Itself!

1. Why: What We Want
(Really, Really Want)

To Create and Use Information

Whether I'm a citizen deciding how to vote


A researcher working on global warming

or ...

Sure, but Specifically By

Having Lots of Data


Plugging It Together

US Wheat: 1860-Present

US Wheat Data Output US Wheat Data Acreage
US Wheat Data Yield US Wheat Data Price

Getting Data Often Ain't Easy!

US Unemployment

US Unemployment
US Unemployment

But the Original Data Ain't So Nice

US Unemployment Raw

So We Clean It ...

Cleaned Data

So I've now created/parsed a whole bunch of data

Which I can Happily Use

US Unemployment Figures: 1940-2006

US Unemployment Figures: 1940-2008

OK, that's great, but:
How do we SCALE

Want to link this with lots of other data (interest rates etc) and I'm only me

2. Building (Large-Scale) Data Infrastructures

The DBs of Cyberspace

(The Real Vision of Cyberspace)

Larger than Any Single Individual (or Corporation)

How Do We Build Complex Things?


Lots of Labour

How Do We Build Complex Things II?

worker on empire state

But as we get bigger too much for one mind ...

Divide and Conquer: Componentization

Break data down into chunks (packages) that can be individually managed

Need to Put Humpty-Dumpty Back Together Again

Broken Humpty

Componentization isn't just atomization, it's also about 'packaging' ...

Two Different Models

One Ring to Rule Them All

- One centralized system
- A single set of APIs/formats
- Probably closed
- Most data currently like this


The Revolution Will be Decentralized

Remember: The Many Minds Principle

The Best Thing To Do With Your Data Will Be Thought of By Someone Else

Small Pieces, Loosely Joined

Production Should Be Decentralized and Federated

Sharing and Separation are Key

Requires OPENNESS!

3. Making It Happen for Data

Consider the Miracle of 'apt'

apt-get in operation

2 Related but Distinct Aspects

APIs + Distribution

Ignore (Knowledge) APIs Here (Hard!)

- Domain Specific (Geodata ain't Genomics)
- Require Coordination
- Hard to Plan in Advance, Progress By Experimentation


- Package: Wrap the material up (+ basic metadata)
- In form suited for automatable upload/download
- Register so it can be found ...

The Vision

The Registry:

Freshmeat/CPAN/Gems ... for Open Data

Anyone can add material (760 pkgs + counting)

A CKAN Package

CKAN Helping Power

apt-get: datapkg

A Data Packaging Swiss Army Knife

Getting and Using

    # search for a package
    $ datapkg search ckan:// windhover
    datapkgdemo -- ...

    # Get info
    $ datapkg info ckan://datapkgdemo

    # Install (download + unpack atm) to the current directory:
    $ datapkg install ckan://datapkgdemo .

Creating and Registering

    # Have some existing data
    cd my_data_directory

    # Make a metadata (metadata.txt) - name/value pairs (like Debian,R etc)
    $ vim metadata.txt

    # register on CKAN 
    $ datapkg register . ckan://

    # Check it has registered ok::
    $ datapkg info ckan://mynewdatapkg

i18n + decentralization:

4. Conclusion

The Start of the 'Debian of Data'

'Data' package managers wanted ...

Data and Code are Becoming One

Hack Code, Hack Data


Rufus Pollock and Daniel Dietrich
rufus.pollock / daniel.dietrich



worker_on_empire_state.jpg: PD (Lewis Hine for US Federal Government)