Big Data vs Quality Data

theLoneFuturist: I’m not certain why learning Hadoop isn’t more attractive to you. If you are fine with R, doesn’t having lots of data interest you?
theLoneFuturist: Don’t get me wrong, there are probably unexciting tasks associated with big data, but you’d then get to run your algorithms over big data. And lack of data is an often cited problem for learning/adaptive algorithms. But of no interest to you?
isomorphisms: The BIG DATA fad seems to be based on “let’s turn a generic algorithm loose on exabytes!”
isomorphisms: No matter how the data was gathered, what its underlying shape/logic is, what’s left out.

isomorphisms: For example twitter text analysis. At a high level I might ask “How are attitudes changing?” “How do people talk about women differently than men?” “Do attitudes toward Barack Obama depend on the state of the US economy?” Questions whose answers aren’t easy to turn into just a few numbers.

isomorphisms: My parody of a big-data faddist’s response would be all the sophistication of: listen twitter | Hadoop_grep Obama | uniq -c | well_known_sentiment_analysis_algo. Hooray! Now I know how people feel about Obama! /sarcasm

isomorphisms: In the ‘modelling vs scavenging’ war (cf Leo Breiman) I’m more on the modelling side. So I find some aspects of the ML / bigdata craze unsavoury.
isomorphisms: But the emergence of petareams is certainly a paradigm shift. I don’t think the Big Data faddists are wrong in that. That environmental difference will change things as surely as cheap computing power changed statistics. (Why learn statistical theory when you can bootstrap?) As far as the art of the possible — more clickstreams being recorded makes more analysis doable.

isomorphisms: Anyway, to answer your question, no, having a lot of data doesn’t interest me.
isomorphisms: I’d rather have interesting data than lots of it.
theLoneFuturist: Thing is, interesting data is probably a subset of big data. Mechanically define/separate interesting and you can get it.
isomorphisms: Definitely not, think about historical data.
isomorphisms: For example Angus Maddison’s estimates of ancient incomes; the archaeological or geological record; unscanned text (like the Book of Kells, are you going to OCR an illuminated manuscript? You would miss the Celtic knots)
isomorphisms: Even if stuff were OCR scanned properly and no problems with tables, the interpretive work that historians do would be hard to code up in an algorithm. To me they dig up much more interesting information than the petabytes of clickstream logs.
isomorphisms: Or these internal documents they just found from Al-Qaeda? Which would you rather have, 100 GB of server logs or 10 kB worth of text from Osama bin Laden at a crucial moment?
isomorphisms: Also, we talk about text being “unstructured data”, how about “I smell sulphur coming from over there” (during an archaeological dig) or “This kind of quartz shouldn’t be at this depth in this part of the world” or, you know, “Hey look are those dinosaur footprints?”
isomorphisms: The kind of stuff a fisherman might notice. THAT’S unstructured data.

theLoneFuturist: Sure, though if enough historical records get scanned, they too become the dread big data. I do catch your point, though.

About isomorphismes

Argonaut: someone engaged in a dangerous but potentially rewarding adventure.
This entry was posted in Uncategorized and tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s