Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Adds detection for various bots #7589

Merged
merged 14 commits into from
Feb 15, 2024
151 changes: 150 additions & 1 deletion Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4066,7 +4066,10 @@
bot:
name: Project Resonance
category: Crawler
url: http://project-resonance.com
url: https://project-resonance.com/
producer:
name: RedHunt Labs Limited
url: https://redhuntlabs.com/
-
user_agent: Mozilla/5.0 (compatible; DataXu/1.0; +http://dataxu.com)
bot:
Expand Down Expand Up @@ -6686,3 +6689,149 @@
user_agent: Zeus
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (WhatsMyIP.org GeoIP_Lookups) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org HTTP_Compression_Test) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org HTTP_Headers) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org PageRank_WebStats_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org Text_to_Code_Ratio_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org MAC_Address_Lookup_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org URL_Shortener_Preview_Tool) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: Mozilla/5.0 (WhatsMyIP.org Random_Website_Loader) http://whatsmyip.org/ua
bot:
name: WhatsMyIP.org
category: Service Agent
url: https://www.whatsmyip.org/ua/
-
user_agent: keycdn-tools/perf
bot:
name: KeyCDN Tools
category: Service Agent
url: https://tools.keycdn.com/
producer:
name: proinity LLC
url: https://www.keycdn.com/
-
user_agent: keycdn-tools/br
bot:
name: KeyCDN Tools
category: Service Agent
url: https://tools.keycdn.com/
producer:
name: proinity LLC
url: https://www.keycdn.com/
-
user_agent: keycdn-tools/h2
bot:
name: KeyCDN Tools
category: Service Agent
url: https://tools.keycdn.com/
producer:
name: proinity LLC
url: https://www.keycdn.com/
-
user_agent: Mozilla/5.0 (compatible; AmazonAdBot/1.0; +https://adbot.amazon.com)
bot:
name: Amazon AdBot
category: Crawler
url: https://adbot.amazon.com/
producer:
name: Amazon.com, Inc.
url: https://www.amazon.com/
-
user_agent: SenutoBot/1.0 (compatible; SenutoBot/1.0; +https://www.senuto.com/)
bot:
name: Senuto
category: Crawler
url: https://www.senuto.com/
producer:
name: Senuto Sp. z o.o.
url: https://www.senuto.com/
-
user_agent: Automattic Analytics Crawler/0.2; http://wordpress.com/crawler/
bot:
name: Automattic Analytics
category: Crawler
url: https://wordpress.com/crawler/
producer:
name: Wordpress.org
url: https://wordpress.org/
-
user_agent: IDG/EU (http://spaziodati.eu/)
bot:
name: SpazioDati
category: Crawler
url: https://www.spaziodati.eu/
producer:
name: SpazioDati s.r.l.
url: https://www.spaziodati.eu/
-
user_agent: GozleBot; http://gozle.com.tm
bot:
name: Gozle
category: Crawler
url: https://gozle.com.tm/en/blog/post/1
producer:
name: Doly Horjun HJ
url: https://gozle.com.tm/
-
user_agent: Quantcastbot/2.0 (+http://www.quantcast.com/bot)
bot:
name: Quantcast
category: Crawler
url: https://www.quantcast.com/bot/
producer:
name: Quantcast Corp.
url: https://www.quantcast.com/
-
user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/102.0.5005.182 Safari/537.36 FontRadar
bot:
name: FontRadar
category: Crawler
url: https://www.fontradar.com/
producer:
name: EMDASH SAS
url: https://www.fontradar.com/
-
user_agent: survey-security-dot-txt/0.1
bot:
name: Generic Bot
-
user_agent: WebAuthn Adoption Study (Contact mb364@hdm-stuttgart.de)
bot:
name: Generic Bot
80 changes: 76 additions & 4 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,22 @@
name: 'Alexa Internet'
url: 'https://www.alexa.com'

- regex: 'Amazonbot'
- regex: 'Amazonbot/[\d.]+'
name: 'Amazon Bot'
category: 'Crawler'
url: 'https://developer.amazon.com/support/amazonbot'
producer:
name: 'Amazon.com, Inc.'
url: 'https://www.amazon.com/'

- regex: 'AmazonAdBot/[\d.]+'
name: 'Amazon AdBot'
category: 'Crawler'
url: 'https://adbot.amazon.com/'
producer:
name: 'Amazon.com, Inc.'
url: 'https://www.amazon.com/'

- regex: 'Amazon[ -]Route ?53[ -]Health[ -]Check[ -]Service'
name: 'Amazon Route53 Health Check'
category: 'Service Agent'
Expand Down Expand Up @@ -1784,6 +1792,14 @@
name: 'WPBeginner, LLC'
url: 'https://www.wpbeginner.com/'

- regex: 'Automattic Analytics Crawler/[\d.]+'
name: 'Automattic Analytics'
category: 'Crawler'
url: 'https://wordpress.com/crawler/'
producer:
name: 'Wordpress.org'
url: 'https://wordpress.org/'

- regex: 'WordPress'
name: 'WordPress'
category: 'Service Agent'
Expand Down Expand Up @@ -2441,7 +2457,10 @@
- regex: 'Project-Resonance'
name: 'Project Resonance'
category: 'Crawler'
url: 'http://project-resonance.com'
url: 'https://project-resonance.com/'
producer:
name: 'RedHunt Labs Limited'
url: 'https://redhuntlabs.com/'

- regex: 'DataXu/[\d.]+'
name: 'DataXu'
Expand Down Expand Up @@ -3909,11 +3928,19 @@
name: 'Shareaholic, Inc.'
url: 'https://www.shareaholic.com/'

- regex: 'keycdn-tools'
- regex: 'keycdn-tools:'
name: 'KeyCDN Tools'
category: 'Service Agent'
url: 'https://tools.keycdn.com/geo'

- regex: 'keycdn-tools/'
name: 'KeyCDN Tools'
category: 'Service Agent'
url: 'https://tools.keycdn.com/'
producer:
name: 'proinity LLC'
url: 'https://www.keycdn.com/'

- regex: 'Arquivo-web-crawler'
name: 'Arquivo.pt'
category: 'Crawler'
Expand All @@ -3922,9 +3949,54 @@
name: 'FCT|FCCN'
url: 'https://www.fct.pt/'

- regex: 'WhatsMyIP\.org'
name: 'WhatsMyIP.org'
category: 'Service Agent'
url: 'https://www.whatsmyip.org/ua/'

- regex: 'SenutoBot/[\d.]+'
name: 'Senuto'
category: 'Crawler'
url: 'https://www.senuto.com/'
producer:
name: 'Senuto Sp. z o.o.'
url: 'https://www.senuto.com/'

- regex: 'spaziodati'
name: 'SpazioDati'
category: 'Crawler'
url: 'https://www.spaziodati.eu/'
producer:
name: 'SpazioDati s.r.l.'
url: 'https://www.spaziodati.eu/'

- regex: 'GozleBot'
name: 'Gozle'
category: 'Crawler'
url: 'https://gozle.com.tm/en/blog/post/1'
producer:
name: 'Doly Horjun HJ'
url: 'https://gozle.com.tm/'

- regex: 'Quantcastbot/[\d.]+'
name: 'Quantcast'
category: 'Crawler'
url: 'https://www.quantcast.com/bot/'
producer:
name: 'Quantcast Corp.'
url: 'https://www.quantcast.com/'

- regex: 'FontRadar'
name: 'FontRadar'
category: 'Crawler'
url: 'https://www.fontradar.com/'
producer:
name: 'EMDASH SAS'
url: 'https://www.fontradar.com/'

# Generic detections
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|tweetedtimes\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|daumoa,damoa,daum,daumos,duamoa,duam,duamos|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|kirkland-signature|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
name: 'Generic Bot'

- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|monitor|project(?!or)|research|resolver|robots|scraper|spider|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|monitor|project(?!or)|research|resolver|robots|scraper|security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
name: 'Generic Bot'
Loading