Saturday, June 29, 2013

What Makes Data Meaningful?

Telling Tales

Photo of woman listening by The New Institute
Context, of course. Especially narrative context. Putting data into the context of a story makes that data more intuitively understandable, more interesting, more accessible, maybe even more provocative. This was the theme of MAPC's annual Data Day event on June 22: "Data and Story Telling." The keynote presentation was by Boston Globe staff showcasing their 68 Blocks: Life, Death, Hope project. 68 Blocks is a multi-media showcase focusing on the Bowdoin-Geneva section of Dorchester, a Boston neighborhood that's been plagued by a murder rate that is triple that of the city as a whole. As Globe staff explained it, the point of their project was to understand why violence is so persistent in this neighborhood, to tell its stories in a way that hasn't happened before, and maybe, to contribute to awareness and understanding that can be part of the search for solutions.

The work they did was essentially ethnographic. A couple of reporters rented an apartment and lived in the neighborhood for 5 months, inserting themselves into the community and into the lives of a number of families and individuals. They also brought with them the documentary resources of the Globe, like photographers and videographers and sound engineers. And they cleverly tapped into both official sources of data as well as the modern social media cyberscape. The latter is what seems to have caught MAPC's attention for Data Day: using interactive maps to show violence and complaints and the more mundane demographic changes, using Instagram photos by neighborhood residents and allowing them to supplement these images with their own oral stories, etc. The reporters who spoke focused on the stories themselves and the experience of living and working in the neighborhood. At some point however, Ted McEnroe, the moderator (and PR guy for the The Boston Foundation, the event's sponsor), pointed out that the word "data" had hardly been used at all in the panel's discussion. How did the globe use data in the 68 Blocks project? I have to admit that I was bothered by this question. Wasn't it ALL data? Some was quantitative (e.g. statistics) and some was qualitative (e.g. stories, images). But we all knew what McEnroe meant by 'data': the numbers and statistics.

Letting (or Making) the Data Speak

One of the Globe staff responded, "We didn't want the story to sound or look like numbers. ... We used numbers as a way to find the story." They had of course delved into the (quantitative) 'data.' They flooded the City of Boston with Freedom of Information Act requests for public records: school statistics, numbers and types of resident complaints, service calls, sanitation, code enforcement, property records, etc. (This apparently caused some level of panic at City Hall, but that's another story.) They did the same with the Boston Police Department (e.g. 911 calls). Acquiring this data, cleaning it up, and figuring out how to present it was of course a monumental task. And they did manage to acquire a trove of data (which they are interested in sharing with academic institutions. Contact Chris Marstall at the Globe). It's clear enough how the reporters used this data in their stories: citing statistics to support claims, using numbers to hone in on issues or places of interest. But what was new to me was the activity of the Globe's 'Data Journalists' - techies of varying savvy whose job it was to make sense of the quantitative data and to figure out how to present it. They knew that their presentation of the data should work in support of the stories written by the journalists, but that was about the only guidance they had. Like true data jockies - unburdened by either theory or expertise - they took the abstracted data and looked for creative ways to present it. They sifted and sorted and experimented with different visualizations and platforms that would "allow the data to tell its own story." This sounded a little naive to me, or at least misleading.

Data is always the product of some human author - subjective at some level, or at least context-dependent. Data is not the same thing as the phenomenon it describes or enumerates. Data is a construct. Someone made a decision about what phenomena to record (e.g. crimes), what to pay attention to and what to ignore, how to count or code it, where to separate and where to aggregate, where to be precise and where to be general, and on and on, ad nauseum. The result - the data - is not a simple reflection of the phenomenon of interest. I think that a lot of us want to act as if data was authorless - just free floating facts needing to be collected and collated and then communicated. This fiction is convenient because it allows us to act as if we're working with manageable units of unfiltered observation, our perspective unsullied by some other author's dirty fingerprints. But they're there (the fingerprints), whether you see them or not. This is why metadata (i.e. data documentation) and topical expertise are so important when working with data. But the Globe took a deliberately naive approach toward the data. Their goal was to get past the preconceived solutions and cliches and stereotypes that typify discussions of neighborhood violence. Let's look at the neighborhood afresh, they said. And to be fair, their approach is a robust one - presenting the data and stories in as many ways, from as many angles, and from as many perspectives as they could manage. But for all the focus on story-telling, which is essentially linear, their approach was very non-linear, a challenging thing to reconcile.

The audience that attended this event (about 200 or so) was divided about equally among representatives from area non-profits, representatives of municipal governments, small businesses, and college students. One of the MAPC staff confided to me that they were a little nervous about how the Globe's presentation on its 68 Blocks project would be received. At first glance, it might appear to be another sensationalistic, voyeuristic tour of violence and grief in a poor, minority community. But it clearly wasn't perceived that way. One woman stood up to praise the Globe. She represented an anti-violence group and had recently lost her own son to violence. She wanted to thank the Globe staff for their respectful and sympathetic coverage of her tragedy and that of others. But the more common question was simply "How can we do that? How can my organization leverage these tools to tell our stories?" Never mind that year's worth of deep, ethnographic journalism, tell us about the cool web tools. Here is a listing of the tools mentioned:

Data Science Toolkit. Open-source tools to geocode data. TimelineJS. Open-source tool that enables you to build visually-rich interactive timelines. Data for Radicals. Illustrated guide to making a data-driven map with TileMill.
myNeighborhood Census Viewer. U.S. Census 2010 – Data for the City of Boston. Interactive map tool from the Boston Redevelopment Authority. Google Fusion Tables. Tool for sharing, visualizing (maps and graphs), and collaborating with data. Google Refine. Tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases.

Cleaning Up Dirty Data

Alvin Chang, one of the Globe's lead Data Journalists, said, "People often think that data is just out there. Data is not just out there." Even if the data you want is available (a big 'if'), it is rarely in the form that you need it. It needs to be cleaned up, reshaped, reformatted to fit your purposes. This is one of the aspects of the Big Data revolution that is under appreciated.

In the case of the 68 Blocks project, Globe journalists faced a significant hurdle in compiling data about the Bowdoin-Geneva neighborhood: the neighborhood does not exist as a unit of measurement for any public agency. You cannot simply call up City Hall, or the Police Department, or the school district office, or even the U.S. Census and ask for records for the Bowdoin-Geneva neighborhood. Nobody gathers or holds information for such a place. Bowdoin-Geneva is a segment of the Dorchester neighborhood of Boston (which is itself fuzzily defined). It is spread across several ZIP codes which extend outside the neighborhood, somewhat overlapped by a little more than three Census Tracts, is served by various schools in and out of the neighborhood ... you get the idea. How do you gather information about a place that is not an official unit of measurement? This is an old geographic problem and there is no simple solution. It is a familiar challenge for geospatial analysts. Options are to (re)gather or (re)compile the data according to the area of interest, or to slice and dice the overlapping data units (i.e. lines or polygons) that are available (e.g. ZIP codes, Census Tracts), and make some serious statistical assumptions. Either way, the choices are labor-intensive and highly prone to error. But the results, if successful, are powerful. If you can tie different data sets together based upon location, whether or not they were originally collected with that purpose in mind, you have opened up the possibility to combine data sets from different sources and to examine their relationships.

The ability to clean up dirty data - to reshape and combine disparate sets of data - and to find connections and relationships not otherwise visible in the source data is a powerful ability. It can also be threatening.

Making Connections with the Data

During the afternoon plenary, Latanya Sweeney, from Harvard University's Data Privacy Lab, spoke about the privacy issues surrounding Big Data, and specifically, the increased capacity of commercial organizations to link together databases and thereby discover information about individuals that should be private. The example she used was healthcare data, and she demonstrated her example with theDataMap tool - a network visualization tool that allows you to see how an individual's healthcare data is shared between different organizations, from doctors and hospitals, to government agencies, to pharmaceutical companies and other private entities. One of the more profound implications from her research, which the tool shows, is the proliferation of entities sharing in an individual's information. But even more startling is how easily data privacy standards to protect individuals can be circumvented because of the proliferation of data sharing connections. While data privacy laws, such as HIPPA, require that individual healthcare records be "de-identified" before being shared, so that outside organizations cannot see the names or personally identifying information connected with those records (e.g. diagnoses for diseases, hospital admission history, etc.), it is quite possible for those organizations to deduce or reconstruct that individually identifying information. The method is a classic step in "data cleaning" and preparation - finding "key" variables or characteristics that can clearly link records across different databases. It turns out that birth dates are very powerful in this respect, especially when they can be combined with gender and geographic location. Statistically, it is HIGHLY unlikely that anyone living in your neighborhood has both the same gender and the same birth date as you. You can see how this works in a somewhat creepy application at aboutmyinfo.net, which was developed from Dr. Sweeney's research. If you can clean up dirty data, you might be able to see the dirty laundry. What tales we can tell then!

Creative Presentation of Data

In a completely rational society, evidence of a problem would be enough to motivate action when action was warranted. But since we don't live in that society, we must find other ways to motivate ourselves and our neighbors. Teens at the Urbano Project have taken a creative approach to data in an effort to spark discourse, and possibly, social change. Urbano is a non-profit organization that invites professional artists to work with high school youth "to effect social change through participatory works of contemporary art and performance." This spring they focused on the issue of transportation equity. Several pairs of teens focused on specific statistics of problems or inequity around the Boston region's mass transit system:

Five teens and their artist mentors, Risa Horn and Alison Kotin, talked about the project and development of their art during a final afternoon presentation. The inspiration for the project came while the teens were visiting different Boston neighborhoods as part of their larger theme "Crossing Urban Boundaries." The youth noticed how dramatically different the transit experience was for different communities (and how much they hated getting on certain bus lines). They researched the issue of public transit in Boston, the problems faced by the system, and the inequities of experience. Armed with facts, they faced a challenge: how to express their data artistically and in a way that would inspire discourse and maybe even action.

Like most contemporary art, their work is abstract and symbolic. But it is grounded in the data. Every item and aspect of their art represents a quantum of the data. Each black bracelet is one hour a year lost in extra waiting. Each orange bucket lid is $300 million in debt burden. Every whistle is some number of crimes committed on the T. Their art was developed to be worn, allowing them to take their wobbling, clattering, clinking work onto subways and buses and other public venues. And it was meant to draw attention and questions, which it did. But they went even further, arranging a meeting with MBTA senior staff to deliver their artistic messages. Amazingly, MBTA officials (stone-faced and stuffily dressed, according to the youth) gave them 3 hours of their time for the meeting. By all accounts, this was a painfully awkward encounter. The two groups sat on opposite sides of a large conference table and proceeded to talk at, and past, one another. The funny thing is that, even from the youths' telling, it sounds like the two groups were actually in agreement about the data and the need for solutions. Same data, same basic interpretation, and lots of confusion.

There is a long and venerable tradition of artistic expression in the service of social activism. When done well, art resonates with people - much differently than arid facts or wonkish policy discourse. But was it that resonates? What message is communicated or received? What happens to the data when it becomes embedded in art? Should we even call it data when it is in this form? From my experiences with policy campaigns and social justice organizations, artistic expression and dry data discourse operate side by side ... or maybe it's along a continuum. Inside the legislative chamber, soberly dressed witnesses read aloud carefully researched statistics and analyses, or relate personal stories with a visceral effect - often heartbreaking or infuriating. Outside on the street their allies are dressed in costumes, performing an outrageous skit or stunt, highlighting the ridiculous or unjust state of affairs. In the end, if the campaign is successful, it still won't be clear what moved people to action.

Context

Clearly, "data" are more than disembodied facts. Context matters - both the way in which the data are situated and the way they are communicated. A lot of honest effort goes into trying to "reveal" the meaning of the data, although it sometimes seems that what we are actually trying to do is invest meaning into the data. I don't mean the latter to sound cynical. I believe that data are real, and that we have a responsibility to be faithful to the data. But given the incredible diversity of ways in which data can be honestly handled and understood, it seems naive - and boring - to think that there are simple truths to be extracted or that the data exist outside of our purposes.

No comments: