Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Many instances have restrictive robots.txt #20

Open
Minoru opened this issue Nov 4, 2021 · 1 comment
Open

Many instances have restrictive robots.txt #20

Minoru opened this issue Nov 4, 2021 · 1 comment

Comments

@Minoru
Copy link
Owner

Minoru commented Nov 4, 2021

I just implemented support for robots.txt (#4), and I'm seeing a drop in the number of "alive" instances. Apparently Pleroma used to ship a deny-all robots.txt, and these days it's configurable.

I'm happy that this code works, but I'm unhappy that it hurts the statistics this much.

I think I'll deploy this spider as-is, and then start a conversation on what should be done about this. An argument could be made that, since the spider only accesses a fixed number of well-known locations, it should be exempt from robots.txt. OTOH, it's a robot, so robots.txt clearly apply.

@Minoru
Copy link
Owner Author

Minoru commented May 10, 2022

My logs indicate that 2477 nodes forbid access to their NodeInfo using robots.txt. That's a sizeable number, considering there's 7995 instances in my "alive" list at the moment.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant