BTRIPP (btripp) wrote,
BTRIPP
btripp

Woke up in a Soho doorway, a policeman knew my name ...

Well, that was enjoyable … this is yet another great read showing up in my hands via the LibraryThing.com “Early Reviewers” program – not something that's “a given” in LTER, as there have been a lot of clunkers over the years – but this one was both informative and entertaining, and engaging throughout. The only “whuh?” element here was that it's not “new”, really … but the paperback version (officially coming out next week). I'm guessing that the hardcover of this was offered via LTER last year, as there are a whole bunch of reviews up for it already. This is Dataclysm: Love, Sex, Race, and Identity – What Our Online Lives Tell Us about Our Offline Selves by Christian Rudder, the co-founder and in-house data wonk for the OkCupid dating site. I'm assuming that this (coming out only a year later) is substantially the same as the hardcover, but this is coming out with a new subtitle [previously: “Who We Are (When We Think No One's Looking)”], which always makes me wonder if there's been an update between the versions.

On one level, this is a bit of a “morality tale”, sort of in the mode of Ryan Holliday's Trust Me, I'm Lying (although in a much less sleazy milieu), where a master of a “dark art” comes clean about it, and tries to make amends. Unlike Holiday's “gaming” dang near the entire information infrastructure, here Rudder is talking about data … and how it can look into our lives … and eventually he ends up wringing his hands a bit (and noting that he does very little “social media”, won't post pictures of his family, etc., in an effort to minimize how transparent his life is to the algorithms churning through that data) at how even with the slightest digital “trail of breadcrumbs”, the number crunchers can find out remarkably accurate and personal information – typified by the story of how Target, by analyzing the shopping patterns of a teen girl, knew she was pregnant (and started sending out pregnancy-related fliers) before her family did … or, more unsettling, the patent that Amazon has taken out for an “anticipatory shipping” system that will send out products that its analysis of the data indicates that you need/want, before you even order them. However, that's sort of jumping to the end of the story here … the book is NOT a doom-and-gloom “the computers are going to rule us” dystopian tale, but a look from the inside of how all that stuff works … although the cautionary element is certainly hovering over the book in its name – a portmanteau mashing up “data” with “cataclysm” – which echoes one mind-blowing factoid he has in here: as long ago as 2012, Facebook was collecting 500 terabytes of information every day!

Dataclysm is set up in three parts, “What Brings Us Together”, “What Pulls Us Apart”, and “What Makes Us Who We Are”, each with 4-5 chapters looking at specific elements thereof. It starts with the basics, gathering the data. Early on he puts up a caveat, true for most data sets involved (in academic studies), which have a tendency to be based on white American college kids … he adds:
I understand how it happens: in person, getting a real representative data set is often more difficult than the actual experiment you'd like to perform. You're a professor or postdoc who wants to push forward, so you take what's called a “convenience sample” – and that means the students at your university. But it's a big problem, especially when you're researching belief and behavior. It even has a name. It's called WEIRD research: white, educated, industrialized, rich, and democratic. And most published social research papers are WEIRD.
He adds in a footnote, from an article in Slate, that this profile only represents about 12% of the world's population, and differs from the others in “moral decision making, reasoning style, fairness, even things like visual perception”. While it's not exactly a case of GIGO, it certainly warns that when trying to extrapolate WEIRD data to a global model, you're probably going to be off from the start.

There is a lot of humor in this book … an entertaining review could be whipped up by just repeating the jokes … while this is tempting, I'll try to limit myself to a few choice ones. One that stood out sufficiently that it got its own little bookmark when I was going through this, was an odd footnote which reads: “* Definition of true ignorance: getting your "what the kids are into" intel from the Securities and Exchange Commission” … OK, so standing on its own isn't quite the self-depreciating gut-buster it ought to be, so I'll have to “explain the joke” (trust me, I have a lot of experience in having to do this). This is at the very start of the “Writing on the Wall” chapter, which starts out talking about home-sickness among troops a century or more ago … noting that “in the American Civil War nostalgia was such a problem it put some 5,000 troops out of action, and 74 men died of it”, and then suggesting that the best scientists of 1863, on “either side of the Potomac”, were furiously working “to develop the ultimate war-ending superweapon: high school yearbooks” (assuming that this “cures” nostalgia). He asks if they still have high school yearbooks, what with Facebook around … but then points out that in a recent FB quarterly report (hence the SEC angle) they noted a drop in use among the under-18 crowd, possibly requiring the printed book again. Yeah, it's funnier when you're reading through it.

This, however, sets up the issue of writing … in less than a generation, kids are writing vastly more than any of their predecessor demographics ever imagined. Rudder cuts to the chase in terms of internet writing, and focuses on Twitter … writing 140 characters at a time. Many commentators have bewailed how the web was going to destroy the language, and that we'd lose the use of longer, more sophisticated words. Here the author compares Twitter's list of most commonly used words with that of the Oxford English Corpus (all 2.5 billion words) … in each case, the top 100 words are considered, which makes up half the writing. Counter-intuitively, the Twitter list has an average word length significantly longer than the OEC's … 4.3 characters vs 3.4 (yes, there are a whole bunch of 2-3 letter words on those lists) … and what's even more remarkable is that the average word length of something like Shakespeare’s Hamlet, clocks in at a shorter word length than a similar word-count sample from Twitter (3.99 vs 4.80 – and that's with the @'s and #'s stripped out of the Twitter numbers). Another amazing source of data is Google Books, which has so far digitized over 30 million books, going back as far as 1800. Using that data all sorts of interesting things can be tracked, for instance, there's a fascinating graph in here which looks at mentions of food items … things like “steak” or “sausage” go back to 1800, but the “winner” (currently peaking at over 8 mentions per million words) is “pizza”, despite not noticeably appearing in the data until the 1940s.

Rudder similarly goes into the OkCupid data to see how message length relates to getting responses (which, after all, is the point on a dating site), and then flips into looking at “social graphs” (he uses examples of his own, plus his with his wife's data combined). To get an idea of how these look, check out a post I did when LinkedIn was discontinuing its cool (but no doubt resource draining) “inMaps”, where I included a copy of my (final) LinkedIn map. These can be very predictive, working in part off of Milgram's famous “six degrees” experiments. The author cites studies which show how couples' relationship longevity can be quite accurately predicted by how these combined maps develop.

Obviously, one of the biggest “big dogs” on the data end of the Internet is Google:
Google has become a repository for humanity's collective id. It hears our confessions, our concerns, our secrets. It's doctor, priest, psychiatrist, confidante, and above all, Google doesn't have to ask us a thing, because the question is always implied in the blank space of the interface. … What a person searches for often gives you the person himself.
An amazing example of how this “works” is that researchers using the Google Trends tool have been able to “track epidemics of flu and dengue fever in real time”, which has developed into “Google Flu”, which follows searches for symptoms and remedies, and reports the trends to the CDC.

Going back to the OkCupid data, Rudder describes doing analysis on profile text … and brings up a remarkable mathematical entity called Zipf's Law:
{The} counterintuitive relationship between the popularity of a word (it's rank in a given vocabulary) and the number of times it appears is described by something called Zipf's law, an observed statistical property of language that, like so much of the best math, lies somewhere between miracle and coincidence. It states that in any large body of text, a word's popularity (its place in the lexicon, with 1 being the highest ranking) multiplied by the number of times it shows up, is the same for every word in the text. Or, very elegantly: rank x number = constant … This law holds for the Bible, the collected lyrics of '60s pop songs, and the canonical corpus of English literature … and it certainly holds for profile text.
He then presents a table with words ranked from 10 in various steps down to 29,055 out of James Joyce's Ulysses (to pick an example of “highly idiosyncratic” language) … the “constant” here does vary somewhat, but are pretty close to a common number. One of the things he is able to do with this is to make comparative charts of how frequently words appear in different groups' profiles. The example he starts with is comparing the word rankings of “white men” with “everybody else”, with the first few words being “the”, “pizza”, and (the band) “Phish”. There's a diagonal which is the “common” line, and, not surprisingly, “the” and “pizza” are both on that line, and way up in the top/top corner. However, “Phish” is about 80% up on the “white men” side, and only about 30% over towards the “everybody else” side. He then adds another dozen or so words, with things like “orange” and “rollercoaster” showing up on the diagonal, and “snowmobiling” at about 0% for “everybody else” and around 60% for white men (on the other hand “Kpop” - Korean Pop, ends up at about 0% for white guys). He then starts breaking these down into various racial groups, with both men and women, and finds rather surprising stuff … “{These lists} are our shibboleths. As such they are something no one could generate a priori, by typing things into Google Trends or by searching millions of hashtags. Sometimes, it takes a blind algorithm to really see the data.”

One final amazing thing he holds for last here … it's called “Parsons code” and it's the engine that enables the Shazam app to recognize music from very small samples … “... almost any piece of music can be identified by the up/down pattern in the melody – you can ignore everything else: key, rhythm, lyrics, arrangement … To know the song, you just need a map of the notes' rise and fall. This melodic contour is called the song's Parsons code, named for the musicologist who developed it in the 1970s.” this is a string of letters, U for melody up, D for melody down, and R for repeated note … he charts out “Happy Birthday” and “Yesterday” for examples. His closing paragraph is:
Like an app straining for a song, data science is about finding patterns. Time after time I – and many other people doing work like me – have had to devise methods, structures, even shortcuts to find the signal amidst the noise. We're all looking for our own Parsons code. Something so simple and yet so powerful is a once-in-a-lifetime discovery, but luckily there are a lot of lifetimes out there.
Again, Dataclysm was a delight, full of “I did not know that!” moments, entertaining stories, and some sobering realities. The paperback is just coming out in a few days, and right now the big boys have it for pre-order at a 45% discount (and this is evidently popular enough that used copies of the hardcover edition are still more expensive than the discounted rate on the new paperback). If you're a “web denizen” like me, or a math geek, or somebody interested in digging behind the surface of social realities, you will really enjoy this book. Highly recommended!


Visit the BTRIPP home page!



Tags: book review
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments