Merge branch 'master' into linting

LunarWatcher · Aug 11, 2024 · 306addf · 306addf
2 parents c88193e + c48b50a
commit 306addf
Show file tree

Hide file tree

Showing 6 changed files with 169 additions and 29 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,7 @@
 /config.json
 compile_commands.json
+*.xpi
+/stackexchange_*/
 
 # Contains the downloaded data dump
 /downloads/

diff --git a/README.md b/README.md
@@ -2,8 +2,6 @@
 
 [![Data dump transformer build](https://github.com/LunarWatcher/se-data-dump-transformer/actions/workflows/transformer.yml/badge.svg)](https://github.com/LunarWatcher/se-data-dump-transformer/actions/workflows/transformer.yml) [![Stackapps listing](https://img.shields.io/badge/StackApps%20listing-FF9900)](https://stackapps.com/q/10591/69829)
 
-**NOTE:** This repo does not yet gather the data dump, as it has not yet been released in the new format. It currently contains the scaffolding required to deal with some of SE's bullshit, to make sure it can be quickly adapted to actually download the data dump parts when they become available.
-
 **Disclaimer:** This project is not affiliated with Stack Exchange, Inc.
 
 ## Background
@@ -51,7 +49,6 @@ This list contains converter tools that work on all sites and all tables.
 | --- | --- | --- | --- |
 | Maxwell175 | SQLite, Postgres, MSSQL | Partially[^2] | [AGPL-3.0](https://github.com/Maxwell175/StackExchangeDumpConverter) | 
 
-[^2]: Only Postgres at the time of writing, with more planned
 
 ### Other data dump distributions and conversion tools
 
@@ -62,22 +59,28 @@ For completeness (well, sort of, none of these lists are exhaustive), this is a
 | Brent Ozar | [MSSQL](https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/) | Yes | [MIT-licensed](https://github.com/BrentOzarULTD/soddi) | Stack Overflow only | All tables |
 | Jason Punyon | [SQLite](https://seqlite.puny.engineering/) | No | Closed-source[^1] | All sites | Posts only |
 
-[^1]: I've been unable to find the generator code, but I've also been unable to find a statement confirming that it's closed-source. It's possible it is open-source, but if it is, it's hard to find the source
 
 ## Using the downloader
 
-
 Note that it's stongly encouraged that you use a venv. To set one up, run `python3 -m venv env`. After that, you'll need to activate it with one of the activation scripts. Run the appropriate one for your operating system. If you're not sure what the scripts are called, you can find them in `./env/bin`
 
 ### Requirements
 
+* Python 3.10 or newer[^3]
 * `pip3 install -r requirements.txt`
-* Lots of storage. The 2024Q1 data dump was 92GB compressed, and uncompressed, converted files are cached on disk before being compressed. The Stack Overflow data dump may take several hundred gigabytes of cache storage while the conversion process is happening.
+* Lots of storage. The 2024Q1 data dump was 92GB compressed.
 * A display you can access somehow (physical or virtual, but you need to be able to see it) to be able to solve captchas
+* Email and password login for Stack Exchange - Google, Facebook, GitHub, and other login methods are not supported, and will not be supported.
+    * If you don't have this, see [this meta question](https://meta.stackexchange.com/a/1847/332043) for instructions.
+* Firefox installed
+    * Snap and flatpak users may run into problems; it's strongly recommended to have a non-snap/flatpak installation of Firefox and Geckodriver.
+        * Known errors:
+            * "The geckodriver version may not be compatible with the detected firefox version" - update Firefox and Geckodriver. If this still doesn't work, consider switching to a non-snap installation of Firefox and Geckodriver.
+            * "Your Firefox profile cannot be loaded" - One of Geckodriver or Firefox is Snap-based, while the other is not. [Consider switching to a non-snap installation](https://stackoverflow.com/a/72531719/6296561) of Firefox, or verifying that your PATH is set correctly.
+    * If you need to manaully install Geckodriver (which shouldn't normally be necessary; it's often bundled with Firefox in one way or another), the binaries are on [GitHub](https://github.com/mozilla/geckodriver/releases)
 
 The downloader does **not** support Docker due to the display requirement.
 
-
 ### Config, running, and what to expect
 
 #### Configuring and starting
@@ -87,16 +90,6 @@ The downloader does **not** support Docker due to the display requirement.
 3. Open `config.json`, and edit in the values. The values are described within the JSON file itself.
 4. Run the extractor with `python3 -m sedd`. If you're on Windows, you may need to run `python -m sedd` instead. 
 
-##### Download modes (not yet implemented)
-
-There are two download modes:
-* `key TBA`: Starts downloading data dumps as soon as the URLs become available, but at the expense of download performance of individual files. This means up to ~365 concurrent downloads, though that number will go down rather quickly due to the many small data dumps.
-
-    This is both the default **and the (unofficially) recommended way** to download the data dumps.
-
-    If SE wanted to avoid this, they could've [bothered implementing combined main + meta downloads](https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process?cb=1#comment1340364_401324), or even better, a "download all" button, before pushing this utter crap.
-* `key TBA`: Downloads one data dump at a time, maximising the download speed for each individual data dump. Recommended if you're on an unstable or slow internet connection, or want to start converting the dump progressively as new entries appear.
-
 #### Captchas and other misc. barriers
 
 This software is designed around Selenium, a browser automation tool. This does, however, mean that the program can be stopped by various bot defenses. This would happen even if you downloaded all the [~183 data dumps](https://stackexchange.com/sites#questionsperday) fully by hand, because it's a _lot_ of repeated operations. 
@@ -211,4 +204,6 @@ The code is under the MIT license; see the `LICENSE` file.
 
 The data downloaded and produced is under various versions of [CC-By-SA](https://stackoverflow.com/help/licensing), as per Stack Exchange's licensing rules, in addition to whatever extra rules they try to impose on the data dump.
 
-
+[^1]: I've been unable to find the generator code, but I've also been unable to find a statement confirming that it's closed-source. It's possible it is open-source, but if it is, it's hard to find the source
+[^2]: Only Postgres at the time of writing, with more planned
+[^3]: Might work with earlier versions, but these are untested and not supported
diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,2 @@
-selenium==4.22.0
+selenium==4.23.1
 desktop-notifier==5.0.1
-py7zr==0.21.1
diff --git a/sedd/data/sites.py b/sedd/data/sites.py
@@ -1,3 +1,5 @@
+# Beta sites don't have data dumps. They're included in this comment for the record:
+# https://cs50.stackexchange.com
 sites = [
     "https://3dprinting.stackexchange.com",
     "https://academia.stackexchange.com",
@@ -36,7 +38,6 @@
     "https://crafts.stackexchange.com",
     "https://crypto.stackexchange.com",
     "https://cs.stackexchange.com",
-    "https://cs50.stackexchange.com",
     "https://cseducators.stackexchange.com",
     "https://cstheory.stackexchange.com",
     "https://datascience.stackexchange.com",
@@ -183,3 +184,6 @@
     "https://worldbuilding.stackexchange.com",
     "https://writing.stackexchange.com",
 ]
+
+# For testing
+# sites = ["https://stackoverflow.com"]
diff --git a/sedd/main.py b/sedd/main.py
@@ -2,30 +2,83 @@
 from selenium.webdriver.common.by import By
 from selenium.webdriver.firefox.webdriver import WebDriver
 from selenium.webdriver.firefox.options import Options
+from selenium.common.exceptions import NoSuchElementException
+from typing import Dict
 
 from sedd.data import sites
 from time import sleep
 import json
+import urllib.request
 
 from .meta import notifications
+import re
+import os
+
+import argparse
+from . import utils
+
+parser = argparse.ArgumentParser(
+    prog="sedd",
+    description="Automatic (unofficial) SE data dump downloader for the anti-community data dump format",
+)
+
+parser.add_argument(
+    "-o", "--outputDir",
+    required=False,
+    dest="output_dir",
+    default=os.path.join(os.getcwd(), "downloads")
+)
+parser.add_argument(
+    "--dry-run",
+    required=False,
+    default=False,
+    action="store_true",
+    dest="dry_run"
+)
+
+args = parser.parse_args()
+
+def get_download_dir():
+    download_dir = args.output_dir
+
+    if not os.path.exists(download_dir):
+        os.makedirs(download_dir)
+
+    print(download_dir)
+
+    return download_dir
 
 options = Options()
+options.enable_downloads = True
 options.set_preference("browser.download.folderList", 2)
 options.set_preference("browser.download.manager.showWhenStarting", False)
-options.set_preference("browser.download.dir", "./downloads")
+options.set_preference("browser.download.dir", get_download_dir())
 options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip")
 
 browser = webdriver.Firefox(
     options = options
 )
+if not os.path.exists("ubo.xpi"):
+    print("Downloading uBO")
+    urllib.request.urlretrieve(
+        "https://github.com/gorhill/uBlock/releases/download/1.59.0/uBlock0_1.59.0.firefox.signed.xpi",
+        "ubo.xpi"
+    )
 
 
+ubo_id = browser.install_addon("ubo.xpi", temporary=True)
+
 with open("config.json", "r") as f:
     config = json.load(f)
 
 email = config["email"]
 password = config["password"]
 
+def kill_cookie_shit(browser: WebDriver):
+    sleep(3)
+    browser.execute_script("""let elem = document.getElementById("onetrust-banner-sdk"); if (elem) { elem.parentNode.removeChild(elem); }""")
+    sleep(1)
+
 def is_logged_in(browser: WebDriver, site: str):
     url = f"{site}/users/current"
     browser.get(url)
@@ -50,17 +103,20 @@ def login_or_create(browser: WebDriver, site: str):
             email_elem.send_keys(email)
             password_elem.send_keys(password)
 
+            curr_url = browser.current_url
             browser.find_element(By.ID, "submit-button").click()
+            while browser.current_url == curr_url:
+                sleep(3)
 
-            captchaWalled = False
+            captcha_walled = False
             while "/nocaptcha" in browser.current_url:
-                if not captchaWalled:
-                    captchaWalled = True
+                if not captcha_walled:
+                    captcha_walled = True
 
                 notifications.notify("Captcha wall hit during login", config)
                 sleep(10)
 
-            if captchaWalled:
+            if captcha_walled:
                 continue
 
             if not is_logged_in(browser, site):
@@ -69,14 +125,75 @@ def login_or_create(browser: WebDriver, site: str):
             break
 
 
-def download_data_dump(browser: WebDriver, site: str):
-    print("Downloading is not yet implemented")
+def download_data_dump(browser: WebDriver, site: str, etags: Dict[str, str]):
+    print(f"Downloading data dump from {site}")
+
+    def _exec_download(browser: WebDriver):
+        kill_cookie_shit(browser)
+        try:
+            checkbox = browser.find_element(By.ID, "datadump-agree-checkbox")
+            btn = browser.find_element(By.ID, "datadump-download-button")
+        except NoSuchElementException:
+            raise RuntimeError(f"Bad site: {site}")
+
+        if args.dry_run:
+            return
+
+        browser.execute_script("""
+        (function() {
+            let oldFetch = window.fetch;
+            window.fetch = (url, opts) => {
+                let promise = oldFetch(url, opts);
+
+                if (url.includes("/link")) {
+                    promise.then(res => {
+                        res.clone().json().then(json => {
+                            window.extractedUrl = json["url"];
+                            console.log(extractedUrl);
+                        });
+                        return res;
+                    });
+                    return new Promise(resolve => setTimeout(resolve, 4000))
+                        .then(_ => promise);
+                }
+                return promise;
+            };
+        })();
+        """)
+
+        checkbox.click()
+        sleep(1)
+        btn.click()
+        sleep(2)
+        url = browser.execute_script("return window.extractedUrl;")
+        utils.extract_etag(url, etags)
+
+        sleep(5);
+
+
+    browser.get(f"{site}/users/data-dump-access/current")
+    _exec_download(browser)
+
+    if site not in ["https://meta.stackexchange.com", "https://stackapps.com"]:
+        # https://regex101.com/r/kG6nTN/1
+        meta_url = re.sub(r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site)
+        print(meta_url)
+        browser.get(f"{meta_url}/users/data-dump-access/current")
+        _exec_download(browser)
+
+etags: Dict[str, str] = {}
 
 for site in sites.sites:
     print(f"Extracting from {site}...")
 
     login_or_create(browser, site)
     download_data_dump(
         browser,
-        site
+        site,
+        etags
     )
+
+# TODO: replace with validation once downloading is verified done
+# (or export for separate, later verification)
+# Though keeping it here, removing files and re-running downloads feels like a better idea
+print(etags)
diff --git a/sedd/utils.py b/sedd/utils.py
@@ -0,0 +1,23 @@
+from typing import Dict
+import requests as r
+from urllib.parse import urlparse
+import os.path
+
+def extract_etag(url: str, etags: Dict[str, str]):
+    res = r.get(
+        url,
+        stream=True
+    )
+    if res.status_code != 200:
+        raise RuntimeError(f"Panic: failed to get {url}: {res.status_code}")
+
+    etag = res.headers["ETag"]
+    res.close()
+
+    parsed_url = urlparse(url)
+    path = parsed_url.path
+    filename = os.path.basename(path)
+
+    etags[filename] = etag
+
+    print(f"ETag for {filename}: {etag}")