Skip to content

Measuring text based similarity between pairs of companies using only their publicly available value propositions

License

Notifications You must be signed in to change notification settings

conlamon/companysim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CompanySim

Measuring text based similarity between pairs of companies using only their publicly available value propositions

Synopsis

Companysim is a Python module for generating measures of company similarity based on textual description information. The algorithm was developed for the class CS 341 at Stanford University, in conjunction with the venture capital firm Rocketship.vc

Process

Given pairs of companies and their textual descriptions, companysim generates a "company graph" to demonstrate similarity between the pairs

Similarity is estimated with two measures:

  1. Weighted Jaccard similarity between each pair of descriptions

    • The weights are set using the inverse document frequency (IDF) value for each word
  2. Dot product between the two vectors of weighted edges from the "company graph"

    • A "company graph" is generated where the edges are weighted by the jaccard similarity
    • Each company then has a vector of weighted edges going to other companies

These similarity measures can then be used as features in a classification algorithm to predict a similarity score between the pairs of companies

We found using a K-nearest neighbor (KNN) binary classification algorithm with euclidean distance metric worked well

Code Example

import pandas as pd
import companysim.companysim as cs

company_info = {'company_domain': ['xyz.com', 'abc.com'],
				'description': ['xyz.com is a developer of business software',
				                'abc.com is a business software application company']}
company_list = pd.DataFrame(company_info)

# Setup the parameters
DESCRIPTION_COLUMN = 'description'
NAME_COLUMN = 'company_domain'
NUMBER_OF_WORDS = 3
NUMBER_SIMILAR_COMPANIES = 3

cc = cs.CompanyCorpus(company_list)
cc.build_idf(description_column_name=DESCRIPTION_COLUMN)

# Filter descriptions by removing the number of
#    words specified in NUMBER_OF_WORDS
cc.filter_desc_by_idf(description_column_name=DESCRIPTION_COLUMN,
                      number_words_to_cut=NUMBER_OF_WORDS)

# Create a CompanyGraph
cg = cs.CompanyGraph(cc)
cg.build_lsh_forest(company_name_column_name=NAME_COLUMN)
cg.build_graph(sensitivity=NUMBER_SIMILAR_COMPANIES)

# Access similarity measures
company1 = 'xyz.com'
company2 = 'abc.com'
cg.get_dot_product_score(company1, company2)
cg.get_jaccard_similarity(company1, company2)

Installation

The easiest way to install companysim is via pip

pip install companysim

Contributors

Connor Lamon

Meeran Ismail

Ke Xu

License

MIT License

About

Measuring text based similarity between pairs of companies using only their publicly available value propositions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages