Storing information in ways that frustrate further analysis is condemning those data to illustrate only one point about the world. Conversely, structuring it for flexible reuse allows your analysis AND my analysis, and more in the future. For example, consider the total frustration involved in trying to copy data from a existing pdf file. Or, more broadly, the limits on potential investigation that stems from limited sharing of data. If we could be simpler about structuring information, sharing becomes easier, making the information more useful as building blocks in other people's work. Only sharing the final tables is not really sharing.
Thinking about structuring data comes from the same territory as GTD ("getting things done") - simpler is better, and setting some simple rules up front are better than many complex systems later. Storing numerical data in the simplest-possible structure means you can send it to someone else, and there's a good chance that the information will be useful without hours of reformatting. I'm mostly thinking about spreadsheets here, and when I say simple structure I really do mean simple - maybe just a list of names and values, no subtotals, no clever cross-tabulation. But the point here is not to get stuck on a particular template or format - it's to work in the mindset that all data will be passed on, and looser, less complex information stands a better chance of getting re-use.
The idea of effective data structures is not a unique benefit for aspirant open source planners. It's a good idea in any context, especially teamwork or conducting any research where you aren't trying to fit the facts to your existing conclusion. A decent structure becomes most relevant if you want to share with outsiders. In my silo of interest, numbers describing transport mode share might have a particular importance - maybe I'm interested in tracking changes in auto use. For someone else, the same information could be an input into a carbon footprint analysis, or something else entirely that isn't on my horizon right now. If I spend time assembling my data and then lock it down, I'm probably preventing someone else from going in a fresh direction with a boost from whatever groundwork I laid. Is the purpose of my report to illustrate some points with robustness and confidence? Probably. Does that mean I can't also share the information in a simple and flexible way, and explain the ways I processed it? No, but mostly we get used to seeing everyone's final, formatted, pdf'd data.
I can't find the right analogy for this - something like making bricks and also giving away mud and straw, with the straw in neat bundles... But it would be really exciting if we all got into the habit of storing numbers in very simple ways, then sharing the information in beautiful formatted pdfs and also those simple raw data files.
So what does this mean in practical, file format type terms? For now, I'm entirely neglecting issues of copyright and data sharing agreements, which can be serious stumbling blocks. Not to mention more complex data such as source files for geographic data that goes into maps. There are also exciting things going on in other fields, such as info sharing in pharma trials, or Swivel's data sharing and charting tools.
Here are some very basic ideas, a non-exhaustive list.
For structuring,
- try to be self-documenting - use a logical name for columns, etc
- keep spreadsheets simple - no merged cells, no sub-totals, no text mixed in with numbers, nothing that prevents someone from taking your data in a new direction.
- preserve base tables as much as possible - don't perserve your only copy of data in a formatted-for-print excel workbook. A spreadsheet with a row for every item is easiest to work with (but lousy to describe with words - I'll come back to this and talk about the utter magic of analysis with pivot tables in another post).
- really learn to use excel (or your comparable spreadsheet tool).
- make your final tables available in multiple file formats, not just a PDF.
- give out text versions of everything (why not?).
- make your source files available.
- document your sources - not just "Census 2000". SF1? SF2? which fields? you kept track of that information, right?
- ask others for their base data - rather like a nudist beach, everyone will be more comfortable if they see others out there. And if you can require someone to share, even better.
No comments:
Post a Comment