Big data & Ed, Feb 2014


Stumbled upon Paco Nathan via my twitter filter for ML. Talk about wade into a foreign country and most of the lingo is huh? A bit of homework to do. Have a copy of his slides from his January talk in Seattle. His opening gambit:

Machine Learning in production apps is less and less about algorithms (even though that work is quite fun and vital)

Performing real work is more about:
• socializing a problem within an organization
• feature engineering (“Beyond Product Managers”) ! tournaments in CI/CD environments
• operationalizing high-ROI apps at scale
• etc.

So I’ll just crawl out on a limb and state that leveraging great frameworks to build data workflows is more important than 
 chasing after diminishing returns on highly nuanced algorithms

And mulling Curry's attack on associationism and how that may play out in an ANT sensibility, came across this quote in Srensen's book1 p. 81:

As a result of these kinds of so-called pseudo-questions – a term that indicates that they are not questions that look for an answer, but questions to which the answers are already known (Lindblad & Sahlstro€m 1998) – children are performed as trivial machines, to use a term from systems theorist Niklas Luhmann (2002, p. 77). These are pupils that, given a specific input, execute a certain function, which results in a specific output.

What kinds of machines are humans enacted as by machine learning?


Should have been keeping better log of what has been going in and outside my head. I went back to the presentation of Sendhil Mullainathan gave to HeadCon3 at Edge. He raises among other things the importance of machine learning (ML), a term used in AI to talk about machines working out patterns, rules, strategies via a bunch of different approaches. This has shuffled this writing under what I call one of my biggies, the machine-human partnership.

I had stumbled across ML when I was reading Mark Stevenson's, An Optimists tour of the Future2. Stevenson spoke to Hod Lipsom who had developed a piece of AI software (Eureqa3) which could take any data set of numbers and generate a model from scratch! You could feed it data from the classic pendulum experiment and it would generate Newton's 1st Law, for example.

So I thought that I had to get my head around ML much better. AI used to be a bit of a joke way back. With the help of Moore's law it is much less so now and promises a good deal more. I think this is an important part of the puzzle of Big Data (BD) in the social sciences and particularly in education. So I went looking and came across the amazing Andrew Ng at Stanford. And, as it happens there is a set of his lectures on YouTube and they run a course on Coursera. Argh..back to the joys of linear algebra. But that interests me most are the approaches/styles of approaching ML. There is a bunch of intersections here, i.e. the way we think about human learning; retiring the idea of associationism vs communication theory4.

8th-11th Feb

These are email exchanges.

R: This will be an interesting collab because you are into writing as an extreme sport, and I am far less adventurous. I do 'to mean re form - style I am happy to experiment with - but content, I am far more cautious. I get the sense you see big data as a bad thing - a brave new world of mind police and sensors etc. for me they are merely interesting phenomena which I'd like to speculate about to see how it might change how and what we (can) know. The trick will be to find what works for both - at the moment I think I will be much more comfortable throwing up questions than making judgements.

On 11 Feb 2014, at 11:21 am, cj <moc.liamg|mugibc#moc.liamg|mugibc> wrote:

C: I think there are a few things - we could drag up a very old set of categories I invented on the fly before a workshop eons ago which ended up as a bit of a classic :)

boosters, doomsters, critics and anti-schoolers - maybe modified

But the hype is always around anything digital - sell, sell, sell

The Bowker piece is a good sobering account of the difficulties.

It would be fun to write a bit of it as a fictional piece - a sensor on every child! (no sensored child left behind :) )

There is a very long chain to be established between BD and teacher at desk wondering about whether or not the sensor telling her that the kid is hungry — or the Prin looking at a chart which maps the time spent in location Z in his school…. etc.  

On 11 Feb 2014, at 10:16 am, Radhika Gorur <moc.oohay|rurogakihdar#moc.oohay|rurogakihdar> wrote:

R: Great. Gives us room to play. That link you sent was good! Maybe we can ask: when we have big data in Ed, what will it look like? What might we be able to do? To what effect for learners, teachers and researchers?

This is a bit of a tangent - but, when folk take on new software, unless it is really simply, and even then, they tend to opt for the surface stuff, the stuff to do the job for them, i.e. most folk probably use ~10% of what bloatware like MS Word offers.  BD will come to practitioners in two ways - via software/algorithms which pre-digests etc to produce… or via sw that allows them to do their own thing - a bit like the gap minder software (  So there is still the “sense making” to do. How does the click frequency on a sports site by student x correlate with their new found enthusiasm for statistics? etc.

But all of this - is still so heavily dependent upon code, code underneath code etc. And there is still the interesting point about what sensors (mice or chips embedded in school uniforms or whatever) collect, how, why etc.

10th Feb

R: Have you checked out the Bowker paper to which I sent a ref - biodiversity data diversity - that's a good bridge too - he talks of research where the desired end product is a database - almost 'theory-neutral' database for diverse users - like the human genome project - PISA a bit like that? 

Have you checked out the Bowker paper to which I sent a ref - biodiversity data diversity - that's a good bridge too - he talks of research where the desired end product is a database - almost 'theory-neutral' database for diverse users - like the human genome project - PISA a bit like that? 

On 10 Feb 2014, at 5:20 pm, "cj" <moc.liamg|mugibc#moc.liamg|mugibc> wrote:

C: We are on the same page, i.e. what is the point to make….  :)  and PISA prolly only a way in just to make the point about how many trips to the moon and back a lot of big data looks like - i.e. do the conversion into books - folk do that all the time - condescending but an attempt to get a grip on ridiculously large numbers.

I’ve been reading Clough - she is smart - and really really like Thrift’s stuff- geographers have always had the more interesting stuff I think.

I really like the tech unconscious - not sure it is ‘like’ the measurement of time stuff but the analogy is helpful. But I am guessing this is way past where the ‘audience” may wish to be taken.

I’d like to set a mini agenda out of the paper - just need to opt for a good one.

Much of the stuff on BD is really poor-  ie look at how BIG it is. I still think that drawing the computational (code for anything in 1’s and 0’s into Ed in ways other than how we teach history better would be a good thing- so a kind of meta frame for it, i.e. the digital teaching stuff is really a side show, a poorly run one at that, the actual game is a good deal different…  I do like Mike Savage’s stuff - even though Law has been some of the folk scribbling hey guys the caper has changed kind of line — but the frames they have set up a a tad wishy washy - maybe first steps- better than none.

The thing I like about Clough is that she does not just talk about the digital but also includes the bio stuff etc. - that we are doing 1’s and 0’s is more of an accident than anything else—the bigger will to map, to effectively do particle physics on everything else is, it seems to me, what we are seeing the edges of. :)

R: Re dialog genre – I think I sent you the Gillborn paper – The Colour of Numbers – that is also playful but gets the point across well. I like some aspects of the dialog genre but in parts it can get a bit self-conscious. Don’t mind trying it, but not sure if the journal will be happy with it – we would need to check it with them before we spend time and energy on it.
Most of all, for me, the thing I need to become clear about with any piece of writing is: what is the point we want to make?
Here are some possibilities:
1.       There is a tendency to think of PISA as ‘big data’ but it is not – big data is characterised by x, y and z, and PISA is pretty conventional because it does not have the features x, y and z – maybe useful to do, but will take the wind out of their sails, so perhaps not such a good idea
2.       Do a Savage and Burrows on them – i.e., think about why large-scale and ‘big’ data demand that we leave behind our conventional modes of data ‘gathering’ and analysis; focus on what needs to be done differently – this idea attracts me.
3.       Elaborate the possibilities, pitfalls and performativity of PISA as ‘big data’
Once we know what we want to say, we can think about ‘how’.

8th Feb

C: I think the Savage and Burrows paper and the Bowker one would both be important to our Big Data paper.
The idea of data as by-product, I think, was what I was trying to articulate earlier - that data are now incidentally picked up. Their question with regard to what does this mean for social science research - or, to put it in terms I have been using, what are the implications of big data for how we study the making of social knowledge is a good angle to take.
Also, not sure if you are familiar with Thrift's notion of 'technological unconscious - the idea of juxtaposition and the capacious background that gets built up out of very few templates has always attracted me.
The Bowker paper's focus (sorry - thought I'd attach the paper but can't find it in my folders - let me know if you can't access it  - Bowker 2000, Biodiversity and Datadiversity in SSS) on the convergence of heterogeneous data bases also offers a way to understand what's different about current data gathering and data analysis practices, besides size.

R: Yes - exactly. My head is in a few spaces  right now- deep in the DECRA and also deep in the Australia -NZ test highlights (missed the live telecast) and busy emailing to finalise travel plans (I leave on Tuesday) - so I'm struggling to be coherent, but that is exactly what I meant. It is not that the video camera data is indiscriminate just because it is automatic - as you say, it is still selective in terms of recording light. But that is a different level of discrimination to PISA or Census etc which depend on very elaborate models involving complex indicators etc.
Two other ideas to think about -
• Is the video camera more like an "intermediary" and PISA more a "mediator" of 'data' in Latour's terms?
• Wondering about 'indicators' - PISA can measure abstract things because it has these complex models and indicators that 'materialise' abstract things like 'equity' - big data I think work with some kind of raw data, not complex indicators - this is a hunch to be investigated - not sure if this is the case.
My instinct is that there is something 'raw' about big data at the 'collection' stage. There are also differences at the analysis stage but I am not able to think how to articulate them yet.
Nice to have these ideas to worry way at at the back of one's head whilst doing other things.

On Saturday, 8 February 2014 9:20 PM, cj <moc.liamg|mugibc#moc.liamg|mugibc> wrote:
C: That's a good example to play with. The "data" from a vid camera is still subject to a very crude "model", i.e. the recording of light - like a camera- which it is - but then to be useful, has to be "used" to say identify a car number plate or a face - all algorithms. I think that is a generative idea though.

R: Yes of course I agree that nothing is not based on a model - but … not sure how to explain this, but suppose you have a security camera which keeps on filming irrespective of what goes on - i.e., it does not discriminate between black, white, tall, short, old, young, male, female or even in and out - just keeps going whatever is in front of it - that kind of data collection is very different to a survey designed to elicit a particular slice of data. That is a slice too - of whoever goes past that camera - but there is a difference between that kind of data collection and PISA. Right?

On Saturday, 8 February 2014 5:58 PM, cj <moc.liamg|mugibc#moc.liamg|mugibc> wrote:
C: Nah - I think Monsieur Callon might disagree as would Law. Sensors are programmed, algorithmic. They are all premised on some kind of model. And yes you are right about size and PISA but I think a bridge might have to be built between PISA or maybe ABS ed stats or whatever to the other stuff. There are folk who are imagining a "sensor-based" education. Effectively the early stuff for this is around so-called learning analytics, click counting. I don't know if they'd like it but if the paper was a kind of opener for "them" from big data to the more interesting stuff, the algorithmic it might help nudge some of the ed folk to say ok yeah - this stuff is important. :)

R: The difference is that sensors are indiscriminate and they are not designed on a model or to a plan - what they pick up is incidental. PISA is based on a model, and it is not vast - in fact it covers only a very narrow range of things - I still don't think PISA comes under 'big data'. It is not as big as the ABS data or census etc. and even that is not 'big data', I think…

On Saturday, 8 February 2014 2:01 PM, cj <moc.liamg|mugibc#moc.liamg|mugibc> wrote:
C: Yeah - it's good and solid and covers most of the ideas. It is all about sensors - but in my head read in the broad. In a sense a PISA test or perhaps question is a kind of sensor - it picks up a right or a wrong, much like a sensor may be designed to pick up light or no light.
7th Feb
R: A nice, alliterative account…


Sent a link to this to Radhika. It speaks to part of the puzzle in an interesting way. I've dropped comments into my slow hunch file.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License