Discovery: Script solution for Instance survey #62

jmakowski1123 · 2022-01-10T22:18:08Z

Context

tCRIL is generating an Impact Report that quantifies the landscape of Open edX Instances globally. A large portion of this data will be elicited directly from Providers, within the boundaries of the standard Provider contract. Draft survey questions are here. To facilitate survey uptake, we'd like to automate the process of answering some or all of the questions. The results of the survey will be analyzed and summarized in aggregate, and the anonymized results shared publicly. We will present the results at the Open edX conference in April.

Acceptance Criteria:

The Provider is given a quick and seamless method by which to autogenerate data to answer the following questions for each of their Instances:

Number of unique courses
Total number of learners using the site
Total number of enrollments for all courses
Total number of course completions/certificates granted
Primary language of instruction
Other languages of instruction
[Additional questions in the survey if they are reasonable to autogenerate answers for]

The end-result data is captured in .cvs (or similar), with a clear connection between each Instance URL and its corresponding data listed above.

Approach:

The purpose of this ticket is to explore solutions for the method by which to autogenerate data, and to propose a recommended method/approach. Based on a brainstorming session during the January 5 Standup, one highly viable approach is to write a script that Providers can embed into each of their Instances.

ormsbee · 2022-01-21T21:54:30Z

Some questions:

What level of accuracy do we care about? So for instance, if the # of learners or enrollments is off by 5%, do we care? (Some things are faster/simpler to query if we accept less accuracy.)
Do we want to capture anything more granular about enrollments/completions for some particular window of time? Next year, would we want to ask the question: “how many certificates were granted this year?” Or is it enough that we capture the snapshots of all of these and we can extrapolate the delta year to year?
If it was not a burden on the site operator, would we want updates at some higher frequency (quarterly, monthly)?

The purpose of this ticket is to explore solutions for the method by which to autogenerate data, and to propose a recommended method/approach. Based on a brainstorming session during the January 5 Standup, one highly viable approach is to write a script that Providers can embed into each of their Instances.

Getting people to install something is a high barrier to entry for a survey. It might make sense to start with a Google Forms sort of approach for the first iteration of this, and to have a bundled app that operators could choose to opt into starting with the Nutmeg release.

ormsbee · 2022-01-21T21:55:31Z

That app could later be used for other really useful pulse-of-the-community things like determining which sites use which feature flags, and other things that would be useful to know for support and deprecation purposes. Some sites might not want to give up their enrollment numbers, but they might be at least willing to share which features they're using if it can be collected automatically.

jmakowski1123 · 2022-01-22T00:09:39Z

Some questions:

What level of accuracy do we care about? So for instance, if the # of learners or enrollments is off by 5%, do we care? (Some things are faster/simpler to query if we accept less accuracy.)

Given that the nature of the project (quantifying current Open edX Instances) is fairly opaque to begin with, my initial thought is that we can err on the side of fair/reasonable accuracy. But I'm very curious for @e0d thoughts?

For a framework, I'd say strike a balance between reasonable accuracy and the realistic timeline for this project, which is to gather, analyze and present the data at the April conference (sorry...that's very un-Agile-like, with such a hard deadline!)

Do we want to capture anything more granular about enrollments/completions for some particular window of time? Next year, would we want to ask the question: “how many certificates were granted this year?” Or is it enough that we capture the snapshots of all of these and we can extrapolate the delta year to year?

For the purposes of the first survey, I think it's enough to be able to say the current number of learners, the current number of enrollments, and perhaps the number of certificates/credentials granted to date. Assuming we run the survey annually, next year we could put a time wrap around it (ie "in CY 2022"). Again curious for @e0d thoughts?

If it was not a burden on the site operator, would we want updates at some higher frequency (quarterly, monthly)?

At the moment, I think once a year is realistic in terms of what our goals are (an annual impact report), but I also hope this project expands with community involvement and can see scenarios where more frequent updates could be of interest to the Marketing WG for example. So if not a burden to site operators, perhaps biannually and quarterly as a start?

The purpose of this ticket is to explore solutions for the method by which to autogenerate data, and to propose a recommended method/approach. Based on a brainstorming session during the January 5 Standup, one highly viable approach is to write a script that Providers can embed into each of their Instances.

Getting people to install something is a high barrier to entry for a survey. It might make sense to start with a Google Forms sort of approach for the first iteration of this, and to have a bundled app that operators could choose to opt into starting with the Nutmeg release.

Would the Google Form then be filled out manually for each Instance? I can see that also being a barrier to operators who are running many Instances? Even if we only got a ~10% rate of install in the first go-round, that's still more data than we have now, and would be the bar to raise next year. Maybe there's a hybrid approach where we can give folks the option, either an install or the Google form? And I like the bundled app idea with Nutmeg as a long-term sustainable solution.

ormsbee · 2022-01-28T15:02:32Z

The general theme with the technical discovery is that we can get rough numbers in a relatively straightforward manner, but that true accuracy involves accounting for a number of edge cases that I don't think are worth it for the first pass at this problem.

Number of unique courses

The fastest and most reliable way to get this is a count on CourseOverview. There are a few caveats here. Just because a course exists doesn't mean that anyone can see it or use it. There are a few fields that can help guide us (start, end, and self_paced), but sometimes courses are created as scratch spaces and might not represent something that's ever seen by students.

Recommended approach: Simple count of CourseOverview rows, and ignore any subtleties about scheduling or enrollments.

Total number of learners using the site

This would require a count on the User table. This can also be distorted by banned users (spam accounts), or from dummy-users created for the purposes of an LTI launch where Open edX is an LTI provider. Banned users are an obscure edge case though.

Recommended approach: Simple count of the User model, minus a simple count of the LtiUser model.

Total number of enrollments for all courses

@jmakowski1123: This could be a count of all currently active enrollments, or all enrollments that were ever made. The latter would mean that we'd still count an enrollment if someone enrolled in a course and then unenrolled some time later. When counting all enrollments ever made, we wouldn't double-count re-enrollments–i.e. if someone enrolled in a course, unenrolled, and re-enrolled, that would still count as only one enrollment.

Getting all enrollments that were ever made is slightly cheaper, but both are relatively straightforward to get–it's just a matter of filtering on the is_active field. Please let me know which one you'd like (or if you'd like both).

Total number of course completions/certificates granted

We can get this from the GeneratedCertificate model, but it's honestly kind of a mess in terms of ensuring accuracy when these are generated. We also have many different "modes" that a certificate can be granted in (e.g. "verified", "masters", "credit"). So it's probably best to get a simple count that is equivalent to "this person passed a course", and not try to dig too far into the types of certificates, the significance of which likely varies from site to site.

Primary language of instruction

We can get a count of courses by language, but this might be pretty messy and unreliable data. This can be queried using the language field in CourseOverview.

Other languages of instruction

Same approach and caveats as (5).

ormsbee · 2022-01-28T15:23:21Z

If we want to do this as a survey app in the Django Admin (accessible by site operators), we'd need the following:

An app to hold the model and logic for the survey itself. The model would likely be really simple, just capturing a timestamp, time it took to run, version of the codebase (i.e. what release), flags for any report-running options we offer, and a JSONField to hold the results.
A Django Admin interface to start the async task that would need to gather the data.
A celery task to run the queries for the data above.
(Optional) Some kind of advertising header that can go into the top level Django admin as a message notice. We could probably implement this as a middleware that checks to see if people have filled out the survey (or told it to go away) and adds a message to the Django admin homepage using Django's message framework.
An endpoint to actually receive the results and store it somewhere. We could spin something like this up on Heroku or Render.

Installation Options

There are two main ways I could see us going with this:

An installable plugin app.
Build it into edx-platform itself.

I actually prefer building this into edx-platform because it is so tightly coupled with that repository (at least for the data being collected here). It needs to directly query a number of edx-platform data models, and we'd want those tests to run during CI to make sure nothing breaks from release to release. It would also be really convenient if, whenever you're looking to deprecate a feature flag, you could add it to the list of things that the survey app scans for. However doing so would put us in a situation where we wouldn't be getting results back until people started running Numeg in the middle of this year (and long after the conference).

An alternative is to initially develop it as a plugin app, but fold it into edx-platform in time for Nutmeg. I really don't think we're going to get many people to install it this way though.

Options to consider

There can be at least two high level goals for such a script:

Estimate impact (the origin of this story)
Sample the options/configurations being used (useful for DEPR).

I suspect that more people will be willing to give (2) than (1), so it might be worth giving an option to separate the two. I am assuming that this will be strictly opt-in.

ormsbee · 2022-01-28T15:27:30Z

Would the Google Form then be filled out manually for each Instance? I can see that also being a barrier to operators who are running many Instances? Even if we only got a ~10% rate of install in the first go-round, that's still more data than we have now, and would be the bar to raise next year. Maybe there's a hybrid approach where we can give folks the option, either an install or the Google form? And I like the bundled app idea with Nutmeg as a long-term sustainable solution.

Yes, it would be in this case. But so would the Admin option for sending the data. I suppose we could make a setting that says, "Just always send this information every X months if you haven't before." and default that to False? So most people wouldn't use it, but only those that have a hundred sites and want to opt in?

ormsbee · 2022-01-31T15:56:23Z

@jmakowski1123: FWIW, I think that we should send this year's survey out via Google Form and have folks fill it in as before, and then target doing this in the Django Admin for Nutmeg. I really can't see folks installing this as a separate plugin in useful numbers–it's just going to be so much faster for them to fill in a form.

My best guess at this is a couple of weeks of work if there's a really bare-bones UI and not counting any analysis work we'd do on the other end. Most of the effort is in the admin interface and making sure we don't bring sites down when running these large queries–though we should probably go through group estimation.

e0d · 2022-02-01T22:58:57Z

I agree that targeting Nutmeg for a better solution makes sense.
I think a form for the first take is workable, I'd prefer to use FormAssembly over a Google Form. Better capabilities, also integrates with Google Sheets.

The draft form was build in FormAssembly.

Is there a form of technical documentation that we would provide with the form to help people successfully fill it out. How to we help people do this, for example

Recommended approach: Simple count of the User model, minus a simple count of the LtiUser model.

I think we should have a simple "power user" option where they can submit a Google Sheet with the same rows and columns as our authoritative sheet. This would allow, say, eduNEXT to dump all the tenant sites into a single sheet rather than filling out the form 1000 times. This approach increases our work only the tiniest bit.

ormsbee · 2022-02-02T15:34:00Z

I think a form for the first take is workable, I'd prefer to use FormAssembly over a Google Form. Better capabilities, also integrates with Google Sheets.

Works for me. I default to Google Forms because that's the only thing I've used. Happy to defer to those who have used other products in this area.

Is there a form of technical documentation that we would provide with the form to help people successfully fill it out. How to we help people do this, for example

Sure, I can give some queries for them to run. It'd be nice if edX could run them on their read replica to test early, but it's not absolutely required. That's probably only a couple hours of actual work with the caveats I put in the recommended queries above. Might take more calendar time if someone at edX is testing and we get weird results that we need to debug.

@jmakowski1123: Assigning this to you for you to weigh in on. Please feel free to move to "Done" if you're okay with the conclusions here, or assign it back to me if you have feedback, questions, or other areas you feel need further investigation.

Thank you.

jmakowski1123 · 2022-02-02T23:54:35Z

I think a form for the first take is workable, I'd prefer to use FormAssembly over a Google Form. Better capabilities, also integrates with Google Sheets.

Works for me. I default to Google Forms because that's the only thing I've used. Happy to defer to those who have used other products in this area.

This makes sense to me. I suggest we prune the number and types of questions we ask in the form, in order to make this as easy and quick as possible. Maybe we even limit it to query-based questions for now. Then we can focus on a more well-rounded question set that aligns with the long-term Nutmeg install option.

jmakowski1123 added this to Axim Engineering Tasks Jan 10, 2022

jmakowski1123 moved this to Backlog in Axim Engineering Tasks Jan 10, 2022

jmakowski1123 added the discovery Pre-work to determine if an idea is feasible label Jan 21, 2022

jmakowski1123 assigned ormsbee Jan 21, 2022

jmakowski1123 mentioned this issue Jan 21, 2022

Write script for Instance Survey #78

Closed

jmakowski1123 moved this from Backlog to Ready to Groom in Axim Engineering Tasks Jan 21, 2022

ormsbee moved this from Ready to Groom to In Progress in Axim Engineering Tasks Jan 26, 2022

ormsbee assigned jmakowski1123 Feb 2, 2022

jmakowski1123 closed this as completed Feb 2, 2022

Repository owner moved this from In Progress to Done in Axim Engineering Tasks Feb 2, 2022

jmakowski1123 mentioned this issue Jul 1, 2022

Survey App in Django Admin #349

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery: Script solution for Instance survey #62

Discovery: Script solution for Instance survey #62

jmakowski1123 commented Jan 10, 2022 •

edited

Loading

ormsbee commented Jan 21, 2022

ormsbee commented Jan 21, 2022 •

edited

Loading

jmakowski1123 commented Jan 22, 2022 •

edited

Loading

ormsbee commented Jan 28, 2022

ormsbee commented Jan 28, 2022

ormsbee commented Jan 28, 2022

ormsbee commented Jan 31, 2022 •

edited

Loading

e0d commented Feb 1, 2022

ormsbee commented Feb 2, 2022

jmakowski1123 commented Feb 2, 2022 •

edited

Loading

Discovery: Script solution for Instance survey #62

Discovery: Script solution for Instance survey #62

Comments

jmakowski1123 commented Jan 10, 2022 • edited Loading

Context

Acceptance Criteria:

Approach:

ormsbee commented Jan 21, 2022

ormsbee commented Jan 21, 2022 • edited Loading

jmakowski1123 commented Jan 22, 2022 • edited Loading

ormsbee commented Jan 28, 2022

ormsbee commented Jan 28, 2022

Installation Options

Options to consider

ormsbee commented Jan 28, 2022

ormsbee commented Jan 31, 2022 • edited Loading

e0d commented Feb 1, 2022

ormsbee commented Feb 2, 2022

jmakowski1123 commented Feb 2, 2022 • edited Loading

jmakowski1123 commented Jan 10, 2022 •

edited

Loading

ormsbee commented Jan 21, 2022 •

edited

Loading

jmakowski1123 commented Jan 22, 2022 •

edited

Loading

ormsbee commented Jan 31, 2022 •

edited

Loading

jmakowski1123 commented Feb 2, 2022 •

edited

Loading