Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improves detection for various bots #7788

Merged
merged 12 commits into from
Aug 19, 2024
94 changes: 86 additions & 8 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2486,18 +2486,45 @@
-
user_agent: Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)
bot:
name: Qwantify
name: Qwantbot
category: Crawler
url: https://www.qwant.com/
url: https://help.qwant.com/bot/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
-
user_agent: Mozilla/5.0 (compatible; Qwantify-prod34997/1.0; +https://help.qwant.com/bot/)
bot:
name: Qwantify
name: Qwantbot
category: Crawler
url: https://www.qwant.com/
url: https://help.qwant.com/bot/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
-
user_agent: Mozilla/5.0 (compatible; Qwantbot-prod12345/1.0; +Qwantbot@qwant.com)
bot:
name: Qwantbot
category: Crawler
url: https://help.qwant.com/bot/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
-
user_agent: Mozilla/5.0 (compatible; Qwantbot-news/2.0; +https://www.qwant.com/)
bot:
name: Qwantbot
category: Crawler
url: https://help.qwant.com/bot/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
-
user_agent: Mozilla/5.0 (compatible; Qwantbot-dev12345/1.0; +Qwantbot@qwant.com)
bot:
name: Qwantbot
category: Crawler
url: https://help.qwant.com/bot/
producer:
name: Qwant Corporation
url: https://www.qwant.com/
Expand Down Expand Up @@ -5716,9 +5743,18 @@
-
user_agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
bot:
name: ChatGPT
name: ChatGPT-User
category: Crawler
url: https://platform.openai.com/docs/plugins/bot
url: https://platform.openai.com/docs/bots
producer:
name: OpenAI OpCo, LLC
url: https://openai.com/
-
user_agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
bot:
name: OAI-SearchBot
category: Crawler
url: https://platform.openai.com/docs/bots
producer:
name: OpenAI OpCo, LLC
url: https://openai.com/
Expand Down Expand Up @@ -5856,7 +5892,7 @@
bot:
name: GPTBot
category: Crawler
url: https://platform.openai.com/docs/gptbot
url: https://platform.openai.com/docs/bots
producer:
name: OpenAI OpCo, LLC
url: https://openai.com/
Expand Down Expand Up @@ -8134,7 +8170,16 @@
bot:
name: CyberFind Crawler
category: Crawler
url: https://find.tf/
url: https://www.cyberfind.net/bot.html
producer:
name: Find.tf
url: https://find.tf/
-
user_agent: Mozilla/5.0 (compatible; CyberFindCrawler; +https://cyberfind.net/bot.html)/1.0 (https://cyberfind.net/bot.html)
bot:
name: CyberFind Crawler
category: Crawler
url: https://www.cyberfind.net/bot.html
producer:
name: Find.tf
url: https://find.tf/
Expand All @@ -8156,3 +8201,36 @@
producer:
name: Automattic, Inc.
url: https://automattic.com/
-
user_agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/bot
bot:
name: Lumar
category: Crawler
url: https://deepcrawl.com/bot
producer:
name: Lumar
url: https://www.lumar.io/
-
user_agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) https://deepcrawl.com/testing
bot:
name: Lumar
category: Crawler
url: https://deepcrawl.com/bot
producer:
name: Lumar
url: https://www.lumar.io/
-
user_agent: Mozilla/7.0 (compatible; Golfe/1.1; +http://www.goo-olfe.ae/bot.html)
bot:
name: Golfe
category: Crawler
url: http://www.goo-olfe.ae/bot.html
-
user_agent: Mozilla/5.0 (compatible; SpiderLing; +https://nlp.fi.muni.cz/projects/biwec/)
bot:
name: SpiderLing
category: Crawler
url: https://nlp.fi.muni.cz/projects/biwec/
producer:
name: Natural Language Processing Centre
url: https://nlp.fi.muni.cz/
67 changes: 44 additions & 23 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -805,6 +805,14 @@
name: 'Visual Meta'
url: 'https://www.shopalike.cz/'

- regex: 'deepcrawl\.com'
name: 'Lumar'
category: 'Crawler'
url: 'https://deepcrawl.com/bot'
producer:
name: 'Lumar'
url: 'https://www.lumar.io/'

- regex: 'Googlebot-News'
name: 'Googlebot News'
category: 'Search bot'
Expand Down Expand Up @@ -1292,10 +1300,10 @@
name: 'QueryEye Inc.'
url: 'http://queryeye.com'

- regex: 'Qwantify'
name: 'Qwantify'
- regex: 'Qwantify|Qwantbot'
name: 'Qwantbot'
category: 'Crawler'
url: 'https://www.qwant.com/'
url: 'https://help.qwant.com/bot/'
producer:
name: 'Qwant Corporation'
url: 'https://www.qwant.com/'
Expand Down Expand Up @@ -2430,10 +2438,10 @@
name: 'Carbon60 Operating Co. Ltd.'
url: 'https://www.carbon60.com/'

- regex: 'CyberFind Crawler'
- regex: 'CyberFind ?Crawler'
name: 'CyberFind Crawler'
category: 'Crawler'
url: 'https://find.tf/'
url: 'https://www.cyberfind.net/bot.html'
producer:
name: 'Find.tf'
url: 'https://find.tf/'
Expand Down Expand Up @@ -3593,10 +3601,26 @@
category: 'Site Monitor'
url: 'https://github.com/louislam/uptime-kuma'

- regex: 'OAI-SearchBot'
name: 'OAI-SearchBot'
category: 'Crawler'
url: 'https://platform.openai.com/docs/bots'
producer:
name: 'OpenAI OpCo, LLC'
url: 'https://openai.com/'

- regex: 'GPTBot/[\d.]+'
name: 'GPTBot'
category: 'Crawler'
url: 'https://platform.openai.com/docs/bots'
producer:
name: 'OpenAI OpCo, LLC'
url: 'https://openai.com/'

- regex: 'ChatGPT-User'
name: 'ChatGPT'
name: 'ChatGPT-User'
category: 'Crawler'
url: 'https://platform.openai.com/docs/plugins/bot'
url: 'https://platform.openai.com/docs/bots'
producer:
name: 'OpenAI OpCo, LLC'
url: 'https://openai.com/'
Expand All @@ -3622,14 +3646,6 @@
name: 'DGC Verwaltungs GmbH'
url: 'https://dgc.org/'

- regex: 'deepcrawl\.com'
name: 'Lumar'
category: 'Crawler'
url: 'https://deepcrawl.com/bot'
producer:
name: 'Lumar'
url: 'https://www.lumar.io/'

- regex: 'researchscan\.comsys\.rwth-aachen\.de'
name: 'Research Scan'
category: 'Crawler'
Expand All @@ -3646,14 +3662,6 @@
name: 'Sprious LLC'
url: 'https://sprious.com/'

- regex: 'GPTBot/[\d.]+'
name: 'GPTBot'
category: 'Crawler'
url: 'https://platform.openai.com/docs/gptbot'
producer:
name: 'OpenAI OpCo, LLC'
url: 'https://openai.com/'

- regex: 'Ant(?:\.com beta|Bot)(?:/([\d+.]+))?'
name: 'Ant'
category: 'Crawler'
Expand Down Expand Up @@ -4740,6 +4748,19 @@
name: 'Muck Rack, LLC'
url: 'https://muckrack.com/'

- regex: 'Golfe'
name: 'Golfe'
category: 'Crawler'
url: 'http://www.goo-olfe.ae/bot.html'

- regex: 'SpiderLing'
name: 'SpiderLing'
category: 'Crawler'
url: 'https://nlp.fi.muni.cz/projects/biwec/'
producer:
name: 'Natural Language Processing Centre'
url: 'https://nlp.fi.muni.cz/'

# Generic bots
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus| CM62| HD65))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|^xenu|^(?:chrome|firefox|Abcd|Dark|KvshClient|Node.js|Report Runner|url|Zeus|ZmEu)$'
name: 'Generic Bot'
Expand Down
Loading