Skip to content

Commit

Permalink
Write a simple README
Browse files Browse the repository at this point in the history
  • Loading branch information
hrs committed May 23, 2023
1 parent ed0c97d commit 97208f8
Showing 1 changed file with 124 additions and 0 deletions.
124 changes: 124 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# `docsim`

A simple, fast command-line tool for scoring the similarity of text documents.

``` console
$ docsim --query some-file.txt --show-scores ~/documents/notes
0.000 completely-dissimilar-file.txt
0.152 somewhat-similar-file.md
0.469 pretty-similar-file.md
0.872 very-similar-file.org
1.000 potentially-identical-file.txt
```

Given a query document and a collection of potential matches, `docsim` ranks
each document in the collection by its textual similarity to the query.

## Examples

Check the [`man` page] for the definitive documentation, but these should get
you started.

[`man` page]: ./man/docsim.1

Search for similar files in a given directory:

``` console
$ docsim --query some-file.txt ~/documents/notes
[...]
```

If no paths are provided `docsim` will search the current working directory.

``` console
$ docsim --query some-file.txt
[...]
```

Without a provided `--query` document `docsim` takes input from STDIN:

``` console
$ echo "Here's a query to search for." | docsim ~/documents/notes
[...]
```

Only show the top 3 matches, with the best at the top:

``` console
$ docsim --query some-file.txt --limit 3 --best-first ~/documents/notes
potentially-identical-file.txt
very-similar-file.org
pretty-similar-file.md
```

Find Go files similar to `main.go` in the current directory. Don't use stemming
or stoplists, since these aren't English documents.

``` console
$ docsim --query main.go --no-stemming --no-stoplist **/*.go
[...]
```

Notice that because `docsim` uses an English stoplist and an English stemming
algorithm, you'll almost certainly want to use the `--no-stoplist` and
`--no-stemming` flags if your documents are written in another language
(including source code).

## Installation

If you've got a Go toolchain handy, you can either:

``` console
$ git clone git@github.com:hrs/docsim.git
$ cd docsim
$ sudo make install
```

Or just:

``` console
$ go install github.com/hrs/docsim/docsim@latest
```

Note that using `go install` that doesn't include the [`man` page][], which you
can optionally install manually by copying into e.g.
`/usr/local/share/man/man1`.

[`man` page]: ./man/docsim.1

## Running tests

``` console
$ make test
```

## How it works

`docsim` uses [TF-IDF][] weighting and [cosine similarity][] to numerically
score the textual similarity between the query and every other document.

[TF-IDF]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[cosine similarity]: https://en.wikipedia.org/wiki/Vector_space_model#Applications

"Textual similarity" roughly means "uses the same words." Each document is
parsed into a big bag of words, which are passed through a common English
[stoplist][], [stemmed][] (so "spins," "spinner," and "spinning" might all
reduce down to just "spin"), and assigned weights based on how often they appear
in the document and how rare they are in the corpus as a whole.

[stoplist]: https://en.wikipedia.org/wiki/Stop_word
[stemmed]: https://en.wikipedia.org/wiki/Stemming

We can think of each of these documents as a [vector in term space][], where
each word is a dimension with its weight as a magnitude. Two documents are
"similar," then, inasmuch as they point in the same direction, so we define
similarity by the size of the angle between them.

[vector in term space]: https://en.wikipedia.org/wiki/Vector_space_model

## Contributing

`docsim` is still in a nascent state, so I'm happy just writing the code myself
for now, but please feel free to [report any issues][] you encounter!

[report any issues]: https://github.com/hrs/docsim/issues

0 comments on commit 97208f8

Please # to comment.