Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Analytics accuracy check #2045

Open
simholt opened this issue Feb 21, 2020 · 9 comments
Open

Analytics accuracy check #2045

simholt opened this issue Feb 21, 2020 · 9 comments
Assignees
Labels
Icebox Need Review Issue needs to be reviewed to determine whether it can be closed, is actionable, or is blocking Question Reporting/Statistics

Comments

@simholt
Copy link
Contributor

simholt commented Feb 21, 2020

Descriptive summary

@clarallebot noticed some high download numbers for three recent deposits and wonders how accurate they are. The number of downloads per day are similar or identical (85 for each on 2/20). The pageviews for the corresponding records are in single-digits.

Analytics are important to creators; our numbers need to be as accurate as possible.

I looked at some other Public items that were recently deposited and they have very high downloads:
1144 downloads for this thesis but only 12 pageviews
928 downloads with 15 pageviews

Expected behavior

Analytics to exclude bots and crawlers.

Actual behavior

  1. Peer Review of Research Data Submissions v692td402
    Went live 2/18/20
    Analytics
    2/18: 43 downloads
    2/19: 86
    2/20: 85
    Record pageviews: 2

  2. Remediation Data Management Plans vm40xz548
    Went live 2/18/20
    Analytics
    2/18: 47 downloads
    2/19: 86
    2/20: 85
    Record pageviews: 4

  3. Give Them What They Want 9593v2274
    Went live 2/19/20, mid-day
    Analytics
    2/19: 52 downloads
    2/20: 85
    Record pageviews: 4

@simholt simholt added Reporting/Statistics Priority: High These are issues that should be prioritized for upcoming development efforts labels Feb 21, 2020
@decimalator
Copy link
Member

From @KennaW

We maintain a dynamic exclusion list of known robots and crawlers at https://github.com/atmire/COUNTER-Robots. All COUNTER compliant entities use this list to eliminate bots and crawlers. I hope it helps.

Do let us know if you find any bots or crawlers not on this list, our Robots and Crawlers working group will review and update the list accordingly.

@KennaW
Copy link

KennaW commented Mar 6, 2020

After a conversation with the university's google analytics contact (Kelly Holcomb :) ), she recommended trying to route the 'real' traffic through a custom url campaign https://support.google.com/analytics/answer/1033863?hl=en

https://ga-dev-tools.appspot.com/campaign-url-builder/

@carakey
Copy link

carakey commented Sep 12, 2023

With the recent change from Google Analytics 3 to GA 4, we've been looking at the views and downloads for SA again. Reliable usage statistics are still important to creators. "Reliable" and "accurate" means real humans viewing and downloading SA content.

Some artificially high counts may due caused by counting thumbnail hits, which would be resolved with #1889. The main culprit seems to be bot traffic, and we should leverage GA4 improvements to bot filtering.

@KennaW and @CGillen did some exploratory work and likely have more to say.

@CGillen CGillen self-assigned this Jan 8, 2024
@CGillen
Copy link
Contributor

CGillen commented Jan 8, 2024

After doing a little more exploring. It looks like for 1/7/2024:
Google analytics reports 49437 Downloads.
Logs read about 45558. Logs are a rough estimate since our log parsing utility doesn't quite have the right tool set to run this type analysis
For both of these, thumbnail downloads were excluded

This seems within reason of being accurate for raw download visits. Not sure on 'reliability.'

Regular page visits are way off. Again, log parsing is imperfect and is likely over reporting with clear bot traffic, but excluding downloads (and thumbnails), admin/dashboard, edit/new interfaces, we got around 1m page visits for 1/7/2024
GA4 reports 1053.
Unfortunately GA4 automatically applies what ever bot detection and filtering it wants w/o any amount of transparency to us. It's impossible to tell if it looks like all our traffic looks like bot traffic to them for some reason or if we're not reporting correctly

@CGillen
Copy link
Contributor

CGillen commented Jan 10, 2024

Still not sure why page_view is not as high as it was previously. Investigation continues
@carakey For future clarity do you want to remove page_view tracking on download? This would make page_view reflect actual page traffic and Download will remain only download traffic. As it is page_view includes all of Download.

@carakey
Copy link

carakey commented Jan 10, 2024

@carakey For future clarity do you want to remove page_view tracking on download? This would make page_view reflect actual page traffic and Download will remain only download traffic. As it is page_view includes all of Download.

Yes, I think that would improve understandability of our stats. Thanks!

@CGillen
Copy link
Contributor

CGillen commented Mar 12, 2024

Ok, I'm seeing analytics in this kind of break down:

GA4:
page_views: 1k - 2.2k for regular days and peaked on 5.4k
Downloads: 6k - 12k for regular days and peaked on 27k - 33k

Previous GA:
page_views: 1.1k - 2k for regular days and peaked on 3.5k
Downloads: 8k - 13k for regular days and peaked on 35k - 111k (Hugely anomalous over 20 months compared to 3 months for GA4)

To me, this seems pretty reasonably accurate now
@carakey?

@carakey
Copy link

carakey commented Mar 12, 2024

@CGillen I think there's been solid improvement. I agree these numbers seem reasonable, or at least I don't have any data to say otherwise.

I think Clara's original concern for this ticket was about download numbers being much higher than page views, which we're still seeing at this macro level with 2K page views vs 10K downloads daily, and this sort of doesn't agree with how library folks expect users to navigate to works -- search, arrive at landing page (+1 page view), and then decide to download (+1 download) -- or sometimes decide not to download, which would result in overall more views than downloads. I think at least one of these things is happening:

  1. The assumption is wrong, and the majority of users get to SA with a download link from Google/Scholar or other referring source;
  2. What GA calls "page_views" and "Downloads" aren't the same as how humans/librarians understand these words;
  3. The original suspicion, that tons of bot traffic is racking up downloads while bypassing views.

...Or is it something else entirely? Do we have any way to know?

@shieldsb shieldsb moved this to In progress in March-April 2024 Sprint May 29, 2024
@CGillen CGillen reopened this May 29, 2024
@shieldsb shieldsb reopened this May 31, 2024
@carakey carakey removed the Priority: High These are issues that should be prioritized for upcoming development efforts label Feb 20, 2025
@carakey carakey added Icebox Need Review Issue needs to be reviewed to determine whether it can be closed, is actionable, or is blocking labels Feb 20, 2025
@carakey
Copy link

carakey commented Feb 20, 2025

We can revisit after public-facing stats are restored to the site (#2602)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Icebox Need Review Issue needs to be reviewed to determine whether it can be closed, is actionable, or is blocking Question Reporting/Statistics
Projects
No open projects
Status: In progress
Development

No branches or pull requests

6 participants