-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
script: detect broken "More information" links #12289
Comments
Note: I'm not writing to /tmp/bad-urls.txt for Windows compability, but the user is free to change this on the script in the |
Hey, this is cool! Does it handle rate limiting to avoid hosts from blocking it? This was a key issue with the design of the current script IIRC. |
It doesn't handle it currently, sometimes I have to wait a while to run the script again but IIRC it can eventually do an entire run without being blocked if not preceded by many runs. Either way, we can sure add this to the code :) |
I think the right place for this script is in scripts/, given that its complexity increased a bit, and thus making contributions easier. This way it's possible to add another features too like regex matching for URLs and automatic link updating for redirection. |
Thanks! nice work! Unfortunately, the script errors on my machine with I'm running Ubuntu btw, so there is a limit on how many files can be opened simultaneously (however, I'm not sure on the exact limit). EDIT: fixed it with a semaphore: #!/usr/bin/env python3
# SPDX-License-Identifier: MIT
import random
import re
import asyncio
import sys
from aiofile import AIOFile, Reader, Writer
import aiohttp.client_exceptions
from aioconsole import aprint
from aiofile import async_open
from aiopath import AsyncPath
MAX_CONCURRENCY = 500
sem = asyncio.Semaphore(MAX_CONCURRENCY)
async def find_md_files(search_path: AsyncPath) -> list[AsyncPath]:
"""Find all .md files in the specified search path."""
md_files = set()
async for path_dir in search_path.glob("*"):
await aprint(path_dir.name)
async for file in search_path.glob("*/*.md"):
md_files.add(file)
return md_files
async def append_if_is_file(path_list: list[AsyncPath], path: AsyncPath):
"""Append the file to the list if it exists"""
if await path.is_file():
path_list.add(path)
async def filter_files(md_files: list[AsyncPath]) -> list[AsyncPath]:
"""Filter out non-file paths from the list."""
filtered_files = set()
await asyncio.gather(
*(append_if_is_file(filtered_files, path) for path in md_files)
)
return filtered_files
async def process_file(
file: AsyncPath,
writer: Writer,
output_file: AsyncPath,
session: aiohttp.ClientSession,
) -> None:
"""Extract the link of a single .md file and check it."""
async with sem:
async with file.open("r") as f:
try:
content = await f.read()
except Exception as e:
await aprint(file.parts[-3:])
return
url = extract_link(content)
if url is not None:
await check_url_and_write_if_bad(url, writer, output_file, session)
def extract_link(content: str) -> list[str]:
"""Extract the link of '> More information: '."""
return next(
(
match.group(1)
for match in re.finditer(r"> More information: <(.+)>", content)
),
None,
)
async def check_url_and_write_if_bad(
url: str, writer: Writer, output_file: AsyncPath, session: aiohttp.ClientSession
) -> None:
"""Check URL status and write bad URLs to a file."""
await aprint(f"??? {url}")
code = -1
try:
code = await check_url(url, session)
except aiohttp.ClientError as exc:
if hasattr(exc, "strerr"):
await aprint(f"\033[31m{exc.strerr}\033[0m")
if hasattr(exc, "message"):
await aprint(f"\033[31m{exc.message}\033[0m")
else:
await aprint(f"\033[31m{exc}\033[0m")
await aprint(f"{code} {url}")
if 200 > code or code >= 400:
await writer(f"{code}|{url}\n")
async def check_url(url: str, session: aiohttp.ClientSession) -> int:
"""Get the status code of a URL."""
async with session.head(url) as response:
return response.status
async def find_and_write_bad_urls(
output_file: AsyncPath, search_path: str = "."
) -> None:
"""Find and write bad URLs to a specified file."""
search_path = AsyncPath(search_path)
await aprint("Getting pages...")
md_files = await filter_files(await find_md_files(search_path))
await aprint("Found all pages!")
async with AIOFile(output_file.name, "a") as afp:
writer = Writer(afp)
async with aiohttp.ClientSession(
trust_env=True, timeout=aiohttp.ClientTimeout(total=500)
) as session:
await asyncio.gather(
*(process_file(file, writer, output_file, session) for file in md_files)
)
await afp.fsync()
async def main():
await find_and_write_bad_urls(AsyncPath("bad-urls.txt"), search_path="./pages")
if __name__ == "__main__":
asyncio.run(main()) Also, Manned.org seems to do rate-limiting, so we should definitely implement this, ideally on a per-domain basis. For example, I get 503 (service unavailable) error codes on some existing manned pages. |
Thanks for adding this limit. As for the per-domain rate limit, I have an idea: splitting the links in lists, where each list contains links that belong to a specific domains, and we alternate between the lists putting them in the right order and using ayncio.sleep for respecting each domain timeout. |
I think this script could also be useful for https://github.com/tldr-pages/tldr-maintenance to help contributors spot them, instead of locally checking them. I proposed this idea a while ago, but back then the current script in the Wiki was not ideal for a lot of runs. |
@vitorhcl that approach sounds good to me! I think we would want to bake that into the script, especially if we are to add it to tldr-maintenance. Perhaps for future edits you would like to open a pull request so we can track history as the script evolves rather than confining to this issue? Thanks so much again! |
This has been implemented in tldr-maintenance (tldr-pages/tldr-maintenance#130). |
I did a Python script with
111123 lines to detect broken "More information" links that is absurdly faster than the one-liner we have on the wiki using asynchrous code (aiohttp, aiopath and aioconsole). Should I open a PR to put it in scripts or put it in the wiki?Here it is:
Edit: I forgot to remove 2 test lines 😅
Update: now it writes to bad-urls.txt sequentially using AIOFile.Writer, and doesn't write partial text anymore
The text was updated successfully, but these errors were encountered: