The Link Checker web service runs cached and otherwise optimized broken link checks.
Routes:
/checkUrls
checks a batch at once/checkUrls/stream
returns results as they arrive using JSON streaming/version
returns the server version/stats
returns the link checker stats/stats/domains
returns detailed domain stats/livez
,/readyz
health checks
- get the binary into
$GOPATH/bin
go get -u github.com/siemens/link-checker-service
↓
link-checker-service serve
-
download it from the releases
-
run the service dockerized, without installing Go, and navigate to the sample UI:
building from scratch:
docker-compose up --build --force-recreate
or using a published image:
docker-compose up
replace the image tag in docker-compose.yml if necessary
- run from source:
go run . serve
For a website author willing provide a link checking functionality there are few options available. Browser requests to other domains are most likely to be blocked by CORS. Building the link-checking functionality into the back-end might compromise the stability of the service through exhaustion of various resources.
Thus, to minimize risk, a link checker should be isolated into a separate service. While there are several websites providing the functionality, these may not have access to hosts on a private network, and are otherwise not under your control.
Checking whether a link is broken seems like a trivial task, but consider checking a thousand links a thousand times. Several optimizations and server, gateways, CDN or proxy implementation peculiarity workarounds will need to be applied. This repository contains an implementation of such service.
Start the server, e.g. link-checker-service serve
, and send the following request body to http://localhost:8080/checkUrls
:
{
"urls": [
{
"url":"https://google.com",
"context": "0"
},
{
"url":"https://ashdfkjhdf.com/kajhsd",
"context": "1"
}
]
}
e.g. via HTTPie on Windows cmd
http POST localhost:8080/checkUrls urls:="[{"""url""":"""https://google.com""","""context""":"""0"""},{"""url""":"""https://baskldjha.com/loaksd""","""context""":"""1"""}]"
or in *sh:
http POST localhost:8080/checkUrls urls:='[{"url":"https://google.com","context":"0"},{"url":"https://baskldjha.com/loaksd","context":"1"}]'
The context field allows correlating the requests on the client side.
Sample response:
{
"result": "complete",
"urls": [
{
"context": "1",
"error": "cannot access 'https://baskldjha.com/loaksd'... no such host",
"http_status": 528,
"status": "broken",
"timestamp": 1599132784,
"body_patterns_found": [],
"url": "https://baskldjha.com/loaksd"
},
{
"context": "0",
"error": "",
"http_status": 200,
"status": "ok",
"timestamp": 1599132784,
"body_patterns_found": [],
"url": "https://google.com"
}
]
}
JSON Streaming can be used to optimize the client user experience, so that the client does not have to wait for the whole check result to complete to render.
In the sample HTTPie request, post the streaming request to the /checkUrls/stream
route:
http --stream POST localhost:8080/checkUrls/stream ...
URL check result objects will be streamed continuously, delimited by a newline character \n
, as they become available.
These can then be rendered immediately. E.g. see the sample UI.
- For a programmatic large URL list check, see test/large_list_check, which crawls a markdown page for URLs and checks them via the running link checker service
- For an example of a simple page to check links and display the results using jQuery using the service, see test/jquery_example
For up-to-date help, check link-checker-service help
or link-checker-service help <command>
.
To override the service port, define the PORT
environment variable.
To bind to another address, configure the bindAddress
option, i.e.: ... serve -a 127.0.0.1:8080
A sample configuration file configuration file is available, with most possible configuration options listed.
Start the app with the path to the configuration file: --config <path-to-config-toml>
.
Most configuration values can also be overridden via environment variables in the 12-factor fashion.
The variables found in the configuration file can be upper-cased and prefixed with LCS_
to override.
Arrays of strings can be defined delimited by a space, e.g.:
LCS_CORSORIGINS="http://localhost:8080 http://localhost:8092"
For complex keys, such as HTTPClient.userAgent
, take the uppercase key and replace the dot with an underscore:
LCS_HTTPCLIENT_USERAGENT="lcs/2.0"
The server implements a simple optional authentication via JWT token validation using a public certificate (middleware: github.com/appleboy/gin-jwt).
Currently, the JWT middleware requires a dummy private certificate to be configured, even though it is not used for validation.
See the configuration file and the serve
command help for detailed settings.
URLs may be checked using different methods, e.g. with an HTTP client with or without using a proxy. Depending on the connectivity available to the link checker service, the sequence of checks can be influenced via a configuration of the URL Checker Plugins.
E.g.:
urlCheckerPlugins = [
"urlcheck-noproxy",
"urlcheck",
"urlcheck-pac",
]
By default, the urlcheck
plugin is used, which uses an HTTP client with a proxy, if one is configured,
and without one, if not. urlcheck-noproxy
uses a client explicitly without a proxy set.
urlcheck-pac
generates a client for each URL depending on the proxy configuration returned via the
PAC script, configured via the pacScriptURL
option. Only the first proxy returned by the PAC script will be used.
Link checker can optionally detect patterns within successful HTTP response bodies, e.g. in pages with authentication. This configuration is only possible via the configuration file:
# enable searching for patterns here
searchForBodyPatterns = true
# define Go Regex patterns and their names in this manner
[[bodyPatterns]]
name = "authentication redirect"
regex = "Authentication Redirect"
[[bodyPatterns]]
name = "google"
regex = "google"
The names of the found patterns will be available in the URL check results.
e.g. when a proxy is needed for the HTTP client, see the sample .link-checker-service.toml,
and start the server with the argument: --config .link-checker-service.toml
alternatively, set the client proxy via an environment variable: LCS_PROXY=http://myproxy:8080
The checker stats can be obtained via the /stats
route. The stats are simple counts of situations encountered.
If multiple checkers are configured, e.g. one going through a proxy, and one not going through a proxy, the counts of the
link checker events will contain both calls for now.
Detailed domain stats at /stats/domains
are tracked upon completion or termination of the outgoing requests only, meaning that
a returned cached result won't add a result. The next time the result will be tracked for the cached failed
entry is determined via the retryFailedAfter
setting, and the ok one via cacheExpirationInterval
. Note also,
if multiple plugins are used (e.g with and without a proxy, each result is tracked separately).
see development.md
Rate limiting based on IPs can be turned on in the configuration via a rate specification. See ulule/limiter.
Blocked IPs will run into HTTP 429, and will be unblocked after the sliding window duration passes:
hey -m POST -n 1000 -c 200 -T "application/json" -t 30 -D sample_request_body.json http://localhost:8080/checkUrls
with a limit of 10-S
:
Status code distribution:
[200] 10 responses
[429] 990 responses
- Go (1.19)
- see go.mod
the alternatives that are not URL list check web services:
- HTML/Markdown crawlers & checkers
- https://github.com/stevenvachon/broken-link-checker
- https://github.com/JustinBeckwith/linkinator
- https://github.com/bmuschko/link-verifier
- https://github.com/raviqqe/liche
- https://github.com/raviqqe/muffet
- https://github.com/victoriadrake/hydra-link-checker
- https://github.com/tcort/markdown-link-check
- URL checkers
some URL check services exist, albeit not open source (as of 02.09.2020)
Copyright 2020-2023 Siemens AG and contributors as noted in the AUTHORS file.
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at http://mozilla.org/MPL/2.0/
The following sample code folders are licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
The testing work-around for streaming responses has been adapted from gin (Copyright Manu Martinez-Almeida, MIT License)
The external hyperlinks found in this repository, and the information contained therein, do not constitute endorsement by the authors, and are used either for documentation purposes, or as examples.