Skip to content
This repository has been archived by the owner on Nov 16, 2022. It is now read-only.

Harvest more email addresses from GitHub #1226

Closed
chadwhitacre opened this issue Dec 23, 2018 · 5 comments
Closed

Harvest more email addresses from GitHub #1226

chadwhitacre opened this issue Dec 23, 2018 · 5 comments

Comments

@chadwhitacre
Copy link
Contributor

Reticketed from #1205.

The idea is to harvest email addresses from commit messages. You end up with a bunch of emails that aren't the person you're after (merging a PR means someone else's email shows up in your commit events). But it's a start! The manual work would be a slog but maybe worth it?

See harvest-email-from-github.py for a starting point.

@chadwhitacre
Copy link
Contributor Author

gratipay-bak=# \copy (select t.status, t.username, e.user_name, e.display_name from tmp t join elsewhere e on e.participant=t.username where t.status!='complete' and e.platform='github') to 'knit-github.csv' csv header
COPY 1374
gratipay-bak=#

@chadwhitacre
Copy link
Contributor Author

https://github.com/gratipay/logs/commit/8dc857ccb520e0cda5444da102b1d2a22f0daf4d, main script

$ cat harvested.csv | cut -d, -f2 | sort | uniq -c | sort -nr | tr -s ' ' | cut -d' ' -f 2 | uniq -c 
   1 22
   1 19
   2 18
   2 17
   3 16
   3 14
   1 13
   3 11
   3 10
   6 9
   6 8
  11 7
  12 6
  22 5
  34 4
  60 3
 126 2
 296 1
$ cat harvested.csv | cut -d, -f2 | sort | uniq | wc -l
     592
$

So that's 592 accounts for which we were able to harvest a previously unseen email address from public GitHub commits, 296 of which yielded one new address, 126 of which yielded two, etc.

@chadwhitacre
Copy link
Contributor Author

Now to manually review those ...

@chadwhitacre
Copy link
Contributor Author

$ ./harvest.py | sort | uniq -c
 354 failed
 136 pending
   1 ready
$
#!/usr/bin/env python
import csv
from collections import defaultdict

import lib


# Load harvested.csv
harvested = defaultdict(list)
for _, username, _, _, _, address, _ in csv.reader(open('harvested.csv')):
    if address.endswith('gmail.com'):
        harvested[username].insert(0, address)
    else:
        harvested[username].append(address)

# Add to payouts.csv
payouts_header, payouts = lib.load_payouts()
for row in payouts:
    username = row[4]
    if username not in harvested:
        continue

    status = row[2]
    print(status)

    addresses = [a for a in lib.get_addresses(row) if a]
    addresses += harvested[username]
    addresses += ([''] * (4-len(addresses)))
    assert len(addresses) == 4, addresses
    row[5:] = addresses
csv.writer(open('payouts.csv', 'w+')).writerows([payouts_header] + payouts)

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant