Many instances have restrictive robots.txt #20

Minoru · 2021-11-04T20:49:48Z

I just implemented support for robots.txt (#4), and I'm seeing a drop in the number of "alive" instances. Apparently Pleroma used to ship a deny-all robots.txt, and these days it's configurable.

I'm happy that this code works, but I'm unhappy that it hurts the statistics this much.

I think I'll deploy this spider as-is, and then start a conversation on what should be done about this. An argument could be made that, since the spider only accesses a fixed number of well-known locations, it should be exempt from robots.txt. OTOH, it's a robot, so robots.txt clearly apply.

Minoru · 2022-05-10T09:28:37Z

My logs indicate that 2477 nodes forbid access to their NodeInfo using robots.txt. That's a sizeable number, considering there's 7995 instances in my "alive" list at the moment.

Minoru mentioned this issue Jun 26, 2022

the-federation knows about 17 gotosocial servers while fediverse.observer knows over a hundred; why? #114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many instances have restrictive robots.txt #20

Many instances have restrictive robots.txt #20

Minoru commented Nov 4, 2021

Minoru commented May 10, 2022

Many instances have restrictive robots.txt #20

Many instances have restrictive robots.txt #20

Comments

Minoru commented Nov 4, 2021

Minoru commented May 10, 2022