Skip to content

Commit

Permalink
Merge branch 'master' into linting
Browse files Browse the repository at this point in the history
  • Loading branch information
LunarWatcher committed Aug 11, 2024
2 parents c88193e + c48b50a commit 306addf
Show file tree
Hide file tree
Showing 6 changed files with 169 additions and 29 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
/config.json
compile_commands.json
*.xpi
/stackexchange_*/

# Contains the downloaded data dump
/downloads/
Expand Down
31 changes: 13 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

[![Data dump transformer build](https://github.com/LunarWatcher/se-data-dump-transformer/actions/workflows/transformer.yml/badge.svg)](https://github.com/LunarWatcher/se-data-dump-transformer/actions/workflows/transformer.yml) [![Stackapps listing](https://img.shields.io/badge/StackApps%20listing-FF9900)](https://stackapps.com/q/10591/69829)

**NOTE:** This repo does not yet gather the data dump, as it has not yet been released in the new format. It currently contains the scaffolding required to deal with some of SE's bullshit, to make sure it can be quickly adapted to actually download the data dump parts when they become available.

**Disclaimer:** This project is not affiliated with Stack Exchange, Inc.

## Background
Expand Down Expand Up @@ -51,7 +49,6 @@ This list contains converter tools that work on all sites and all tables.
| --- | --- | --- | --- |
| Maxwell175 | SQLite, Postgres, MSSQL | Partially[^2] | [AGPL-3.0](https://github.com/Maxwell175/StackExchangeDumpConverter) |

[^2]: Only Postgres at the time of writing, with more planned

### Other data dump distributions and conversion tools

Expand All @@ -62,22 +59,28 @@ For completeness (well, sort of, none of these lists are exhaustive), this is a
| Brent Ozar | [MSSQL](https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/) | Yes | [MIT-licensed](https://github.com/BrentOzarULTD/soddi) | Stack Overflow only | All tables |
| Jason Punyon | [SQLite](https://seqlite.puny.engineering/) | No | Closed-source[^1] | All sites | Posts only |

[^1]: I've been unable to find the generator code, but I've also been unable to find a statement confirming that it's closed-source. It's possible it is open-source, but if it is, it's hard to find the source

## Using the downloader


Note that it's stongly encouraged that you use a venv. To set one up, run `python3 -m venv env`. After that, you'll need to activate it with one of the activation scripts. Run the appropriate one for your operating system. If you're not sure what the scripts are called, you can find them in `./env/bin`

### Requirements

* Python 3.10 or newer[^3]
* `pip3 install -r requirements.txt`
* Lots of storage. The 2024Q1 data dump was 92GB compressed, and uncompressed, converted files are cached on disk before being compressed. The Stack Overflow data dump may take several hundred gigabytes of cache storage while the conversion process is happening.
* Lots of storage. The 2024Q1 data dump was 92GB compressed.
* A display you can access somehow (physical or virtual, but you need to be able to see it) to be able to solve captchas
* Email and password login for Stack Exchange - Google, Facebook, GitHub, and other login methods are not supported, and will not be supported.
* If you don't have this, see [this meta question](https://meta.stackexchange.com/a/1847/332043) for instructions.
* Firefox installed
* Snap and flatpak users may run into problems; it's strongly recommended to have a non-snap/flatpak installation of Firefox and Geckodriver.
* Known errors:
* "The geckodriver version may not be compatible with the detected firefox version" - update Firefox and Geckodriver. If this still doesn't work, consider switching to a non-snap installation of Firefox and Geckodriver.
* "Your Firefox profile cannot be loaded" - One of Geckodriver or Firefox is Snap-based, while the other is not. [Consider switching to a non-snap installation](https://stackoverflow.com/a/72531719/6296561) of Firefox, or verifying that your PATH is set correctly.
* If you need to manaully install Geckodriver (which shouldn't normally be necessary; it's often bundled with Firefox in one way or another), the binaries are on [GitHub](https://github.com/mozilla/geckodriver/releases)

The downloader does **not** support Docker due to the display requirement.


### Config, running, and what to expect

#### Configuring and starting
Expand All @@ -87,16 +90,6 @@ The downloader does **not** support Docker due to the display requirement.
3. Open `config.json`, and edit in the values. The values are described within the JSON file itself.
4. Run the extractor with `python3 -m sedd`. If you're on Windows, you may need to run `python -m sedd` instead.

##### Download modes (not yet implemented)

There are two download modes:
* `key TBA`: Starts downloading data dumps as soon as the URLs become available, but at the expense of download performance of individual files. This means up to ~365 concurrent downloads, though that number will go down rather quickly due to the many small data dumps.

This is both the default **and the (unofficially) recommended way** to download the data dumps.

If SE wanted to avoid this, they could've [bothered implementing combined main + meta downloads](https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process?cb=1#comment1340364_401324), or even better, a "download all" button, before pushing this utter crap.
* `key TBA`: Downloads one data dump at a time, maximising the download speed for each individual data dump. Recommended if you're on an unstable or slow internet connection, or want to start converting the dump progressively as new entries appear.

#### Captchas and other misc. barriers

This software is designed around Selenium, a browser automation tool. This does, however, mean that the program can be stopped by various bot defenses. This would happen even if you downloaded all the [~183 data dumps](https://stackexchange.com/sites#questionsperday) fully by hand, because it's a _lot_ of repeated operations.
Expand Down Expand Up @@ -211,4 +204,6 @@ The code is under the MIT license; see the `LICENSE` file.

The data downloaded and produced is under various versions of [CC-By-SA](https://stackoverflow.com/help/licensing), as per Stack Exchange's licensing rules, in addition to whatever extra rules they try to impose on the data dump.


[^1]: I've been unable to find the generator code, but I've also been unable to find a statement confirming that it's closed-source. It's possible it is open-source, but if it is, it's hard to find the source
[^2]: Only Postgres at the time of writing, with more planned
[^3]: Might work with earlier versions, but these are untested and not supported
3 changes: 1 addition & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
selenium==4.22.0
selenium==4.23.1
desktop-notifier==5.0.1
py7zr==0.21.1
6 changes: 5 additions & 1 deletion sedd/data/sites.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Beta sites don't have data dumps. They're included in this comment for the record:
# https://cs50.stackexchange.com
sites = [
"https://3dprinting.stackexchange.com",
"https://academia.stackexchange.com",
Expand Down Expand Up @@ -36,7 +38,6 @@
"https://crafts.stackexchange.com",
"https://crypto.stackexchange.com",
"https://cs.stackexchange.com",
"https://cs50.stackexchange.com",
"https://cseducators.stackexchange.com",
"https://cstheory.stackexchange.com",
"https://datascience.stackexchange.com",
Expand Down Expand Up @@ -183,3 +184,6 @@
"https://worldbuilding.stackexchange.com",
"https://writing.stackexchange.com",
]

# For testing
# sites = ["https://stackoverflow.com"]
133 changes: 125 additions & 8 deletions sedd/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,83 @@
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.webdriver import WebDriver
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException
from typing import Dict

from sedd.data import sites
from time import sleep
import json
import urllib.request

from .meta import notifications
import re
import os

import argparse
from . import utils

parser = argparse.ArgumentParser(
prog="sedd",
description="Automatic (unofficial) SE data dump downloader for the anti-community data dump format",
)

parser.add_argument(
"-o", "--outputDir",
required=False,
dest="output_dir",
default=os.path.join(os.getcwd(), "downloads")
)
parser.add_argument(
"--dry-run",
required=False,
default=False,
action="store_true",
dest="dry_run"
)

args = parser.parse_args()

def get_download_dir():
download_dir = args.output_dir

if not os.path.exists(download_dir):
os.makedirs(download_dir)

print(download_dir)

return download_dir

options = Options()
options.enable_downloads = True
options.set_preference("browser.download.folderList", 2)
options.set_preference("browser.download.manager.showWhenStarting", False)
options.set_preference("browser.download.dir", "./downloads")
options.set_preference("browser.download.dir", get_download_dir())
options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-gzip")

browser = webdriver.Firefox(
options = options
)
if not os.path.exists("ubo.xpi"):
print("Downloading uBO")
urllib.request.urlretrieve(
"https://github.com/gorhill/uBlock/releases/download/1.59.0/uBlock0_1.59.0.firefox.signed.xpi",
"ubo.xpi"
)


ubo_id = browser.install_addon("ubo.xpi", temporary=True)

with open("config.json", "r") as f:
config = json.load(f)

email = config["email"]
password = config["password"]

def kill_cookie_shit(browser: WebDriver):
sleep(3)
browser.execute_script("""let elem = document.getElementById("onetrust-banner-sdk"); if (elem) { elem.parentNode.removeChild(elem); }""")
sleep(1)

def is_logged_in(browser: WebDriver, site: str):
url = f"{site}/users/current"
browser.get(url)
Expand All @@ -50,17 +103,20 @@ def login_or_create(browser: WebDriver, site: str):
email_elem.send_keys(email)
password_elem.send_keys(password)

curr_url = browser.current_url
browser.find_element(By.ID, "submit-button").click()
while browser.current_url == curr_url:
sleep(3)

captchaWalled = False
captcha_walled = False
while "/nocaptcha" in browser.current_url:
if not captchaWalled:
captchaWalled = True
if not captcha_walled:
captcha_walled = True

notifications.notify("Captcha wall hit during login", config)
sleep(10)

if captchaWalled:
if captcha_walled:
continue

if not is_logged_in(browser, site):
Expand All @@ -69,14 +125,75 @@ def login_or_create(browser: WebDriver, site: str):
break


def download_data_dump(browser: WebDriver, site: str):
print("Downloading is not yet implemented")
def download_data_dump(browser: WebDriver, site: str, etags: Dict[str, str]):
print(f"Downloading data dump from {site}")

def _exec_download(browser: WebDriver):
kill_cookie_shit(browser)
try:
checkbox = browser.find_element(By.ID, "datadump-agree-checkbox")
btn = browser.find_element(By.ID, "datadump-download-button")
except NoSuchElementException:
raise RuntimeError(f"Bad site: {site}")

if args.dry_run:
return

browser.execute_script("""
(function() {
let oldFetch = window.fetch;
window.fetch = (url, opts) => {
let promise = oldFetch(url, opts);
if (url.includes("/link")) {
promise.then(res => {
res.clone().json().then(json => {
window.extractedUrl = json["url"];
console.log(extractedUrl);
});
return res;
});
return new Promise(resolve => setTimeout(resolve, 4000))
.then(_ => promise);
}
return promise;
};
})();
""")

checkbox.click()
sleep(1)
btn.click()
sleep(2)
url = browser.execute_script("return window.extractedUrl;")
utils.extract_etag(url, etags)

sleep(5);


browser.get(f"{site}/users/data-dump-access/current")
_exec_download(browser)

if site not in ["https://meta.stackexchange.com", "https://stackapps.com"]:
# https://regex101.com/r/kG6nTN/1
meta_url = re.sub(r"(https://(?:[^.]+\.(?=stackexchange))?)", r"\1meta.", site)
print(meta_url)
browser.get(f"{meta_url}/users/data-dump-access/current")
_exec_download(browser)

etags: Dict[str, str] = {}

for site in sites.sites:
print(f"Extracting from {site}...")

login_or_create(browser, site)
download_data_dump(
browser,
site
site,
etags
)

# TODO: replace with validation once downloading is verified done
# (or export for separate, later verification)
# Though keeping it here, removing files and re-running downloads feels like a better idea
print(etags)
23 changes: 23 additions & 0 deletions sedd/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from typing import Dict
import requests as r
from urllib.parse import urlparse
import os.path

def extract_etag(url: str, etags: Dict[str, str]):
res = r.get(
url,
stream=True
)
if res.status_code != 200:
raise RuntimeError(f"Panic: failed to get {url}: {res.status_code}")

etag = res.headers["ETag"]
res.close()

parsed_url = urlparse(url)
path = parsed_url.path
filename = os.path.basename(path)

etags[filename] = etag

print(f"ETag for {filename}: {etag}")

0 comments on commit 306addf

Please # to comment.