Homework for week 1 of the #LAK13 MOOC is to begin to think around something that I want to understand better using an analytic approach. One thing that’s been bugging me is how to get a better understanding of vulnerability. That is, which students are less likely to complete their module so that they can be contacted at an early stage to offer advice and guidance – this can help to better prepare them for their studies and/or suggest alternative study routes.
First time around, this sounds relatively straightforward, even for a non statistician like me. I can get hold of some kind of stats package (or a friendly statistician) who can run historical sets of student completion data against a number of student characteristics (age, gender, employment status, etc) and study behaviours (timing/frequency of VLE engagement, assignment scores, etc) and come up with a predictive model which suggests students who are most ‘vulnerable’, ie less likely to get to the end of their module. If all goes to plan, students sharing characteristics or behaviours associated with vulnerability can be contacted at an early stage and appropriate guidance provided. So far, so easy. If my statistician friend has come up with a fail-safe predictive model, my vulnerable students will get to the end of their module (or at least, more of them will).
So, year 2 comes around and I need to decide which students to contact. I could rely on the model developed from the original analysis but how long does that predictive model remain valid? I could re-run the analysis on the most recent data set, but here’s my dilemma: by identifying students deemed as vulnerable and providing a support intervention, I have influenced their outcome. This implies that even if being under 25 years of age, say, is the key issue in getting to the end of the module, by contacting and improving the chances of young students previously, age won’t now appear as a student characteristic associated with vulnerability even if it remains relevant.
My issue then is to better understand how to take into account intervention and impact on outcome. I’ve had a useful heads up that this is linked to Bayesian statistics, but know nothing about it so need to start some background reading. I’m hoping that there is a neat statistical tool that can be applied (I’m fairly sure there will be) and I won’t need to exert my brain too greatly.
My data set will be generally large – I work at a university with hundreds of thousands of students each year, and most modules will have hundreds registered, so there shouldn’t be too many issues with reliability (but that’s an assumption on my part). Some of the data is messy though – it’s not stored in a single place and some of it (particularly the VLE activity logs) would not be easy to knock into shape so that I can easily make sense if it.
Concerns around this are whether the approach is too knee jerk – how meaningful are the predictions if an analysis suggests that the most relevant characteristic one year is being in employment and the next year it is having no employment? I’d expect the characteristics/behaviours to be reasonably constant, ie there is something about the module or mode of study that makes it more problematic for certain types of student. Concern number 2 relates to the ethical issues of labelling students as doomed to fail and the need to refresh those labels over time as students become more competent (or their circumstances change). And what if there are just too many students flagged as vulnerable? Institutional resource constraints will necessarily limit which students get additional support, so there’s a matter of drawing a line with students on one side getting more than those on the other (so even if we’re aware that students share characteristics with others who have previously not completed, we may not act on that information). Hmmm.