Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update Google Docs Meta Data #1605

Merged
merged 1 commit into from
Feb 25, 2025
Merged

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Feb 25, 2025

Updating Google Docs Meta Data

Change summary:
472 lines changed in column "Geographic Scope" -- value "USA" changed to "United States"

@melange396
Copy link
Collaborator

newest version of csv comparison code used to summarize updates:

import csv
import requests


# pull down the existing and proposed/pending versions of the signal description csv file

dev_file = "https://raw.githubusercontent.com/cmu-delphi/delphi-epidata/refs/heads/dev/src/server/endpoints/covidcast_utils/db_signals.csv"
dev = []
with requests.get(dev_file, stream=True) as req:
    for row in csv.reader(req.iter_lines(decode_unicode=True)):
        dev.append(row)

new_file = "https://raw.githubusercontent.com/cmu-delphi/delphi-epidata/refs/heads/bot/update-docs/src/server/endpoints/covidcast_utils/db_signals.csv"
new = []
with requests.get(new_file, stream=True) as req:
    for row in csv.reader(req.iter_lines(decode_unicode=True)):
        new.append(row)


# column name lists
dev_cols = set(dev[0])
new_cols = set(new[0])
both_cols = list(dev_cols.intersection(new_cols))

# get the right column number for each version of the file, based on the column name
dev_col_lookup = {c: i for i,c in enumerate(dev[0])}
new_col_lookup = {c: i for i,c in enumerate(new[0])}

print("added columns:", sorted(list(new_cols-dev_cols)))
print("removed columns:", sorted(list(dev_cols-new_cols)))
print("# rows in dev file:", len(dev))
print("# rows in new file:", len(new))
print("row count difference:", len(new)-len(dev))
print("\n")

# TODO: compare sets of `(source, signal)` from both to look for +/- and/or detect reorderings

# add column names to this set to ignore differences found in them (to simplify output for easier analysis)
columns_to_ignore = {"XXXXXX"}
both_cols = [col for col in both_cols if col not in columns_to_ignore]

# show individual changes
changes_count = 0
for i in range(min(len(dev), len(new))):
    dev_line = [dev[i][dev_col_lookup[col]] for col in both_cols]
    new_line = [new[i][new_col_lookup[col]] for col in both_cols]
    if dev_line != new_line:
        changes_count += 1
        print("mismatch in row:", i+1, " [", new[i][new_col_lookup["Source Subdivision"]], ":", new[i][new_col_lookup["Signal"]], "]")
        print("\n".join(["".join([
                "  ", col, ":\n    ", dev[i][dev_col_lookup[col]], " --> ", new[i][new_col_lookup[col]]])
                for col in both_cols if dev[i][dev_col_lookup[col]]!=new[i][new_col_lookup[col]]
            ]))

print("\n")
print("lines with changes:", changes_count)

@melange396 melange396 merged commit 5357e63 into update_ghactions_cache Feb 25, 2025
7 checks passed
@melange396 melange396 deleted the bot/update-docs branch February 25, 2025 21:22
melange396 added a commit that referenced this pull request Feb 25, 2025
Co-authored-by: melange396 <melange396@users.noreply.github.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant