From 83e6ada8e520c35c2654488ac2065b5141c32c02 Mon Sep 17 00:00:00 2001 From: Olivier Boulant Date: Mon, 3 May 2021 10:52:24 +0200 Subject: [PATCH] docs: add text segmentation example (#142) * doc(text): create first sandbox notebook * Add work in progress notebooks for various approaches * docs: refine notebook * docs: add text segmentation to examples * docs: add text segmentation example data * docs: add text segmentation sample data snapshot * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docs: re-arrange import cell * docs: delete sandbox * docs: fix path error * docs: image path * docs: minor adjustments * docs: add binder dependencies * docs: corrections notebook * docs: add data for text seg example * docs: remove old data for text seg example * docs: add text seg corrections * docs: typos * docs: typo * style: remove old image * style: fix link in notebook example * docs: add gram representation * docs: modify text segmentation notebook * docs: add authors * chore: change max number of parallel jobs for matrix strategy * chore: max-parallel should be under strategy and not matrix * chore: trying without python v3.6 * chore: flatten tests on os * chore: trying not to upgrade pip * chore: revert commits debugging run-test Gh Actions Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Charles T --- .binder/requirements.txt | 1 + .pre-commit-config.yaml | 3 - docs/data/text-segmentation-data.txt | 99 +++++ docs/examples/music-segmentation.ipynb | 20 +- docs/examples/text-segmentation.ipynb | 509 +++++++++++++++++++++++++ docs/user-guide/detection/kernelcpd.md | 2 +- mkdocs.yml | 1 + 7 files changed, 626 insertions(+), 9 deletions(-) create mode 100644 docs/data/text-segmentation-data.txt create mode 100644 docs/examples/text-segmentation.ipynb diff --git a/.binder/requirements.txt b/.binder/requirements.txt index 1643d254..1d744b33 100644 --- a/.binder/requirements.txt +++ b/.binder/requirements.txt @@ -1,3 +1,4 @@ matplotlib ruptures librosa +nltk diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index ef8a97cd..14d0543f 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -40,7 +40,4 @@ repos: rev: 0.7.0 hooks: - id: nbqa-black - args: [--nbqa-mutate] - - id: nbqa-isort - additional_dependencies: [isort==5.6.4] args: [--nbqa-mutate] \ No newline at end of file diff --git a/docs/data/text-segmentation-data.txt b/docs/data/text-segmentation-data.txt new file mode 100644 index 00000000..c2e225ad --- /dev/null +++ b/docs/data/text-segmentation-data.txt @@ -0,0 +1,99 @@ +The Sane Society is an ambitious work . +Its scope is as broad as the question : What does it mean to live in modern society ? ? +A work so broad , even when it is directed by a leading idea and informed by a moral vision , must necessarily `` fail '' . +Even a hasty reader will easily find in it numerous blind spots , errors of fact and argument , important exclusions , areas of ignorance and prejudice , undue emphases on trivia , examples of broad positions supported by flimsy evidence , and the like . +Such books are easy prey for critics . +Nor need the critic be captious . +A careful and orderly man , who values precision and a kind of tough intellectual responsibility , might easily be put off by such a book . +It is a simple matter , for one so disposed , to take a work like The Sane Society and shred it into odds and ends . +The thing can be made to look like the cluttered attic of a large and vigorous family -- a motley jumble of discarded objects , some outworn and some that were never useful , some once whole and bright but now chipped and tarnished , some odd pieces whose history no one remembers , here and there a gem , everything fascinating because it suggests some part of the human condition -- the whole adding up to nothing more than a glimpse into the disorderly history of the makers and users . +That could be easily done , but there is little reason in it . +It would come down to saying that Fromm paints with a broad brush , and that , after all , is not a conclusion one must work toward but an impression he has from the outset . +the effect of the digitalis glycosides is inhibited by a high concentration of potassium in the incubation medium and is enhanced by the absence of potassium ( Wolff , 1960 ) . +B. Organification of iodine The precise mechanism for organification of iodine in the thyroid is not as yet completely understood . +However , the formation of organically bound iodine , mainly mono-iodotyrosine , can be accomplished in cell-free systems . +In the absence of additions to the homogenate , the product formed is an iodinated particulate protein ( Fawcett and Kirkwood , 1953 ; ; Taurog , Potter and Chaikoff , 1955 ; ; Taurog , Potter , Tong , and Chaikoff , 1956 ; ; Serif and Kirkwood , 1958 ; ; De Groot and Carvalho , 1960 ) . +This iodoprotein does not appear to be the same as what is normally present in the thyroid , and there is no evidence so far that thyroglobulin can be iodinated in vitro by cell-free systems . +In addition , the iodoamino acid formed in largest quantity in the intact thyroid is di-iodotyrosine . +If tyrosine and a system generating hydrogen peroxide are added to a cell-free homogenate of the thyroid , large quantities of free mono-iodotyrosine can be formed ( Alexander , 1959 ) . +It is not clear whether this system bears any resemblance to the in vivo iodinating mechanism , and a system generating peroxide has not been identified in thyroid tissue . +On chemical grounds it seems most likely that iodide is first converted to Afj and then to Afj as the active iodinating species . +the statement empirical , for goodness was not a quality like red or squeaky that could be seen or heard . +What were they to do , then , with these awkward judgments of value ? ? +To find a place for them in their theory of knowledge would require them to revise the theory radically , and yet that theory was what they regarded as their most important discovery . +It appeared that the theory could be saved in one way only . +If it could be shown that judgments of good and bad were not judgments at all , that they asserted nothing true or false , but merely expressed emotions like `` Hurrah '' or `` Fiddlesticks '' , then these wayward judgments would cease from troubling and weary heads could be at rest . +This is the course the positivists took . +They explained value judgments by explaining them away . +Now I do not think their view will do . +But before discussing it , I should like to record one vote of thanks to them for the clarity with which they have stated their case . +It has been said of John Stuart Mill that he wrote so clearly that he could be found out . +Greer Garson , world-famous star of stage , screen and television , will be honored for the high standard in tasteful sophisticated fashion with which she has created a high standard in her profession . +As a Neiman-Marcus award winner the titian-haired Miss Garson is a personification of the individual look so important to fashion this season . +She will receive the 1961 `` Oscar '' at the 24th annual Neiman-Marcus Exposition , Tuesday and Wednesday in the Grand Ballroom of the Sheraton-Dallas Hotel . +The only woman recipient , Miss Garson will receive the award with Ferdinando Sarmi , creator of chic , beautiful women 's fashions ; ; Harry Rolnick , president of the Byer-Rolnick Hat Corporation and designer of men 's hats ; ; Sydney Wragge , creator of sophisticated casuals for women and Roger Vivier , designer of Christian Dior shoes Paris , France , whose squared toes and lowered heels have revolutionized the shoe industry . +The silver and ebony plaques will be presented at noon luncheons by Stanley Marcus , president of Neiman-Marcus , Beneficiary of the proceeds from the two showings will be the Dallas Society for Crippled Children Cerebral Palsy Treatment Center . +The attractive Greer Garson , who loves beautiful clothes and selects them as carefully as she does her professional roles , prefers timeless classical designs . +Occasionally she deserts the simple and elegant for a fun piece simply because `` It 's unlike me '' . +In private life , Miss Garson is Mrs. E.E. Fogelson and on the go most of the time commuting from Dallas , where they maintain an apartment , to their California home in Los Angeles ' suburban Bel-Air to their ranch in Pecos , New Mexico . +Therefore , her wardrobe is largely mobile , to be packed at a moment 's notice and to shake out without a wrinkle . +Her creations in fashion are from many designers because she does n't want a complete wardrobe from any one designer any more than she wants `` all of her pictures by one painter '' . +Wage-price policies of industry are the result of a complex of forces -- no single explanation has been found which applies to all cases . +The purpose of this paper is to analyze one possible force which has not been treated in the literature , but which we believe makes a significant contribution to explaining the wage-price behavior of a few very important industries . +While there may be several such industries to which the model of this paper is applicable , the authors make particular claim of relevance to the explanation of the course of wages and prices in the steel industry of the United States since World War 2 . +Indeed , the apparent stiffening of the industry 's attitude in the recent steel strike has a direct explanation in terms of the model here presented . +The model of this paper considers an industry which is not characterized by vigorous price competition , but which is so basic that its wage-price policies are held in check by continuous critical public scrutiny . +Where the industry 's product price has been kept below the `` profit-maximizing '' and `` entry-limiting '' prices due to fears of public reaction , the profit seeking producers have an interest in offering little real resistance to wage demands . +The contribution of this paper is a demonstration of this proposition , and an exploration of some of its implications . +In order to focus clearly upon the operation of this one force , which we may call the effect of `` public-limit pricing '' on `` key '' wage bargains , we deliberately simplify the model by abstracting from other forces , such as union power , which may be relevant in an actual situation . +For expository purposes , this is best treated as a model which spells out the conditions under which an important industry affected with the public interest would find it profitable to raise wages even in the absence of union pressures for higher wages . +The vast Central Valley of California is one of the most productive agricultural areas in the world . +During the summer of 1960 , it became the setting for a bitter and basic labor-management struggle . +The contestants in this economic struggle are the Agricultural Workers Organizing Committee ( AWOC ) of the AFL-CIO and the agricultural employers of the State . +By virtue of the legal responsibilities of the Department of Employment in the farm placement program , we necessarily found ourselves in the middle between these two forces . +It is not a pleasant or easy position , but one we have endeavored to maintain . +We have sought to be strictly neutral as between the parties , but at the same time we have been required frequently to rule on specific issues or situations as they arose . +Inevitably , one side was pleased and the other displeased , regardless of how we ruled . +Often the displeased parties interpreted our decision as implying favoritism toward the other . +We have consoled ourselves with the thought that this is a normal human reaction and is one of the consequences of any decision in an adversary proceeding . +It is disconcerting , nevertheless , to read in a labor weekly , `` Perluss knuckles down to growers '' , and then to be confronted with a growers ' publication which states , `` Perluss recognizes obviously phony and trumped-up strikes as bona fide '' . +Rookie Ron Nischwitz continued his pinpoint pitching Monday night as the Bears made it two straight over Indianapolis , 5-3 . +The husky 6-3 , 205-pound lefthander , was in command all the way before an on-the-scene audience of only 949 and countless of television viewers in the Denver area . +It was Nischwitz ' third straight victory of the new season and ran the Grizzlies ' winning streak to four straight . +They now lead Louisville by a full game on top of the American Association pack . +Nischwitz fanned six and walked only Charley Hinton in the third inning . +He has given only the one pass in his 27 innings , an unusual characteristic for a southpaw . +The Bears took the lead in the first inning , as they did in Sunday 's opener , and never lagged . +Dick McAuliffe cracked the first of his two doubles against Lefty Don Rudolph to open the Bear 's attack . +After Al Paschal gruonded out , Jay Cooke walked and Jim McDaniel singled home McAuliffe . +Alusik then moved Cooke across with a line drive to left . +Unemployed older workers who have no expectation of securing employment in the occupation in which they are skilled should be able to secure counseling and retraining in an occupation with a future . +Some vocational training schools provide such training , but the current need exceeds the facilities . +Current programs The present Federal program of vocational education began in 1917 with the passage of the Smith-Hughes Act , which provided a continuing annual appropriation of $ 7 million to support , on a matching basis , state-administered programs of vocational education in agriculture , trades , industrial skills and home economics . +Since 1917 some thirteen supplementary and related acts have extended this Federal program . +The George-Barden Act of 1946 raised the previous increases in annual authorizations to $ 29 million in addition to the $ 7 million under the Smith Act . +The Health Amendment Act of 1956 added $ 5 million for practical nurse training . +The latest major change in this program was introduced by the National Defense Education Act of 1958 , Title 8 , of which amended the George-Barden Act . +Annual authorizations of $ 15 million were added for area vocational education programs that meet national defense needs for highly skilled technicians . +The Federal program of vocational education merely provides financial aid to encourage the establishment of vocational education programs in public schools . +The initiative , administration and control remain primarily with the local school districts . +Even the states remain primarily in an assisting role , providing leadership and teacher training . +briefly , the topping configuration must be examined for its inferences . +Then the fact that the lower channel line was pierced had further forecasting significance . +And then the application of the count rules to the width ( horizontally ) of the configuration gives us an intial estimate of the probable depth of the decline . +The very idea of there being `` count rules '' implies that there is some sort of proportion to be expected between the amount of congestive activity and the extent of the breakaway ( run up or run down ) movement . +This expectation is what really `` sold '' point and figure . +But there is no positive and consistently demonstrable relationship in the strictest sense . +Experience will show that only the vaguest generalities apply , and in fine , these merely dwell upon a relationship between the durations and intensities of events . +After all , too much does not happen too suddenly , nor does very little take long . +The advantages and disadvantages of these two types of charting , bar charting and point and figure charting , remain the subject of fairly good-natured litigation among their respective professional advocates , with both methods enjoying in common , one irrevocable merit . +They are both trend-following methods . +Miami , Fla. , March 17 -- The Orioles tonight retained the distinction of being the only winless team among the eighteen Major-League clubs as they dropped their sixth straight spring exhibition decision , this one to the Kansas City Athletics by a score of 5 to 3 . +Indications as late as the top of the sixth were that the Birds were to end their victory draught as they coasted along with a 3-to-o advantage . +Siebern hits homer Over the first five frames , Jack Fisher , the big righthander who figures to be in the middle of Oriole plans for a drive on the 1961 American League pennant , held the A 's scoreless while yielding three scattered hits . +Then Dick Hyde , submarine-ball hurler , entered the contest and only five batters needed to face him before there existed a 3-to-3 deadlock . +A two-run homer by Norm Siebern and a solo blast by Bill Tuttle tied the game , and single runs in the eighth and ninth gave the Athletics their fifth victory in eight starts . +House throws wild With one down in the eighth , Marv Throneberry drew a walk and stole second as Hyde fanned Tuttle . +Catcher Frank House 's throw in an effort to nab Throneberry was wide and in the dirt . +Then Heywood Sullivan , Kansas City catcher , singled up the middle and Throneberry was across with what proved to be the winning run . +Rookie southpaw George Stepanovich relieved Hyde at the start of the ninth and gave up the A 's fifth tally on a walk to second baseman Dick Howser , a wild pitch , and Frank Cipriani 's single under Shortstop Jerry Adair 's glove into center . \ No newline at end of file diff --git a/docs/examples/music-segmentation.ipynb b/docs/examples/music-segmentation.ipynb index bbbbc07c..5a8521f5 100644 --- a/docs/examples/music-segmentation.ipynb +++ b/docs/examples/music-segmentation.ipynb @@ -6,8 +6,13 @@ "source": [ "# Music segmentation\n", "\n", - "\n", - "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Introduction\n", "\n", "Music segmentation can be seen as a change point detection task and therefore can be carried out with `ruptures`.\n", @@ -21,8 +26,13 @@ "\n", "In this example, we use the well-known tempogram representation, which is based on the onset strength envelope of the input signal, and captures tempo information [[Grosche2010]](#Grosche2010).\n", "\n", - "To load and manipulate sound data, we use the [librosa package](https://librosa.org/doc/latest/index.html) [[McFee2015]](#McFee2015).\n", - "\n", + "To load and manipulate sound data, we use the [librosa package](https://librosa.org/doc/latest/index.html) [[McFee2015]](#McFee2015)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "### Setup\n", "\n", "First, we make the necessary imports." @@ -342,7 +352,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.1" + "version": "3.9.4" }, "toc-autonumbering": false, "toc-showmarkdowntxt": false diff --git a/docs/examples/text-segmentation.ipynb b/docs/examples/text-segmentation.ipynb new file mode 100644 index 00000000..8ac2a913 --- /dev/null +++ b/docs/examples/text-segmentation.ipynb @@ -0,0 +1,509 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Linear text segmentation\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "Linear text segmentation consists in dividing a text into several meaningful segments.\n", + "Linear text segmentation can be seen as a change point detection task and therefore can be carried out with `ruptures`. \n", + "This example performs exactly that on a well-known data set intoduced in [[Choi2000](#Choi2000)]." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "First we import packages and define a few utility functions.\n", + "This section can be skipped at first reading.\n", + "\n", + "**Library imports.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import nltk\n", + "import numpy as np\n", + "import ruptures as rpt # our package\n", + "from nltk.corpus import stopwords\n", + "from nltk.stem import PorterStemmer\n", + "from nltk.tokenize import regexp_tokenize\n", + "from ruptures.base import BaseCost\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from sklearn.metrics.pairwise import cosine_similarity\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.cm as cm\n", + "from matplotlib.colors import LogNorm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "nltk.download(\"stopwords\")\n", + "STOPWORD_SET = set(\n", + " stopwords.words(\"english\")\n", + ") # set of stopwords of the English language\n", + "PUNCTUATION_SET = set(\"!\\\"#$%&'()*+,-./:;<=>?@[\\\\]^_`{|}~\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Utility functions.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def preprocess(list_of_sentences: list) -> list:\n", + " \"\"\"Preprocess each sentence (remove punctuation, stopwords, then stemming.)\"\"\"\n", + " transformed = list()\n", + " for sentence in list_of_sentences:\n", + " ps = PorterStemmer()\n", + " list_of_words = regexp_tokenize(text=sentence.lower(), pattern=\"\\w+\")\n", + " list_of_words = [\n", + " ps.stem(word) for word in list_of_words if word not in STOPWORD_SET\n", + " ]\n", + " transformed.append(\" \".join(list_of_words))\n", + " return transformed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def draw_square_on_ax(start, end, ax, linewidth=0.8):\n", + " \"\"\"Draw a square on the given ax object.\"\"\"\n", + " ax.vlines(\n", + " x=[start - 0.5, end - 0.5],\n", + " ymin=start - 0.5,\n", + " ymax=end - 0.5,\n", + " linewidth=linewidth,\n", + " )\n", + " ax.hlines(\n", + " y=[start - 0.5, end - 0.5],\n", + " xmin=start - 0.5,\n", + " xmax=end - 0.5,\n", + " linewidth=linewidth,\n", + " )\n", + " return ax" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Description\n", + "\n", + "The text to segment is a concatenation of excerpts from ten different documents randomly selected from the so-called Brown corpus (described [here](http://icame.uib.no/brown/bcm.html)).\n", + "Each excerpt has nine to eleven sentences, amounting to 99 sentences in total.\n", + "The complete text is shown in [Appendix A](#appendix-a).\n", + "\n", + "These data stem from a larger data set which is thoroughly described in [[Choi2000](#Choi2000)] and can be downloaded [here](https://web.archive.org/web/20030206011734/http://www.cs.man.ac.uk/~mary/choif/software/C99-1.2-release.tgz).\n", + "This is a common benchmark to evaluate text segmentation methods." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Loading the text\n", + "filepath = Path(\"../data/text-segmentation-data.txt\")\n", + "original_text = filepath.read_text().split(\"\\n\")\n", + "TRUE_BKPS = [11, 20, 30, 40, 49, 59, 69, 80, 90, 99] # read from the data description\n", + "\n", + "print(f\"There are {len(original_text)} sentences, from {len(TRUE_BKPS)} documents.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The objective is to automatically recover the boundaries of the 10 excerpts, using the fact that they come from quite different documents and therefore have distinct topics.\n", + "\n", + "For instance, in the small extract of text printed in the following cell, an accurate text segmentation procedure would be able to detect that the first two sentences (10 and 11) and the last three sentences (12 to 14) belong to two different documents and have very different semantic fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print 5 sentences from the original text\n", + "start, end = 9, 14\n", + "for (line_number, sentence) in enumerate(original_text[start:end], start=start + 1):\n", + " sentence = sentence.strip(\"\\n\")\n", + " print(f\"{line_number:>2}: {sentence}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Preprocessing\n", + "\n", + "Before performing text segmentation, the original text is preprocessed.\n", + "In a nutshell (see [[Choi2000](#Choi2000)] for more details),\n", + "\n", + "- the punctuation and stopwords are removed;\n", + "- words are reduced to their stems (e.g., \"waited\" and \"waiting\" become \"wait\");\n", + "- a vector of word counts is computed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# transform text\n", + "transformed_text = preprocess(original_text)\n", + "# print original and transformed\n", + "ind = 97\n", + "print(\"Original sentence:\")\n", + "print(f\"\\t{original_text[ind]}\")\n", + "print()\n", + "print(\"Transformed:\")\n", + "print(f\"\\t{transformed_text[ind]}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Once the text is preprocessed, each sentence is transformed into a vector of word counts.\n", + "vectorizer = CountVectorizer(analyzer=\"word\")\n", + "vectorized_text = vectorizer.fit_transform(transformed_text)\n", + "\n", + "msg = f\"There are {len(vectorizer.get_feature_names())} different words in the corpus, e.g. {vectorizer.get_feature_names()[20:30]}.\"\n", + "print(msg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the vectorized text representation is a (very) sparse matrix." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Text segmentation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cost function" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To compare (the vectorized representation of) two sentences, [[Choi2000]](#Choi2000) uses the cosine similarity $k_{\\text{cosine}}: \\mathbb{R}^d \\times \\mathbb{R}^d \\rightarrow \\mathbb{R}$:\n", + "\n", + "$$ k_{\\text{cosine}}(x, y) := \\frac{\\langle x \\mid y \\rangle}{\\|x\\|\\|y\\|} $$\n", + "\n", + "where $x$ and $y$ are two $d$-dimensionnal vectors of word counts.\n", + "\n", + "Text segmentation now amounts to a kernel change point detection (see [[Truong2020]](#Truong2020) for more details).\n", + "However, this particular kernel is not implemented in `ruptures` therefore we need to create a [custom cost function](../../user-guide/costs/costcustom).\n", + "(Actually, it is implemented in `ruptures` but the current implementation does not exploit the sparse structure of the vectorized text representation and can therefore be slow.)\n", + "\n", + "Let $y=\\{y_0, y_1,\\dots,y_{T-1}\\}$ be a $d$-dimensionnal signal with $T$ samples.\n", + "Recall that a cost function $c(\\cdot)$ that derives from a kernel $k(\\cdot, \\cdot)$ is such that\n", + "\n", + "$$\n", + "c(y_{a..b}) = \\sum_{t=a}^{b-1} G_{t, t} - \\frac{1}{b-a} \\sum_{a \\leq s < b } \\sum_{a \\leq t < b} G_{s,t}\n", + "$$\n", + "\n", + "where $y_{a..b}$ is the subsignal $\\{y_a, y_{a+1},\\dots,y_{b-1}\\}$ and $G_{st}:=k(y_s, y_t)$ (see [[Truong2020]](#Truong2020) for more details).\n", + "In other words, $(G_{st})_{st}$ is the $T\\times T$ Gram matrix of $y$.\n", + "Thanks to this formula, we can now implement our custom cost function (named `CosineCost` in the following cell)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class CosineCost(BaseCost):\n", + " \"\"\"Cost derived from the cosine similarity.\"\"\"\n", + "\n", + " # The 2 following attributes must be specified for compatibility.\n", + " model = \"custom_cosine\"\n", + " min_size = 2\n", + "\n", + " def fit(self, signal):\n", + " \"\"\"Set the internal parameter.\"\"\"\n", + " self.signal = signal\n", + " self.gram = cosine_similarity(signal, dense_output=False)\n", + " return self\n", + "\n", + " def error(self, start, end) -> float:\n", + " \"\"\"Return the approximation cost on the segment [start:end].\n", + "\n", + " Args:\n", + " start (int): start of the segment\n", + " end (int): end of the segment\n", + " Returns:\n", + " segment cost\n", + " Raises:\n", + " NotEnoughPoints: when the segment is too short (less than `min_size` samples).\n", + " \"\"\"\n", + " if end - start < self.min_size:\n", + " raise NotEnoughPoints\n", + " sub_gram = self.gram[start:end, start:end]\n", + " val = sub_gram.diagonal().sum()\n", + " val -= sub_gram.sum() / (end - start)\n", + " return val" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compute change points\n", + "\n", + "If the number $K$ of change points is assumed to be known, we can use [dynamic programming](../../user-guide/detection/dynp) to search for the exact segmentation $\\hat{t}_1,\\dots,\\hat{t}_K$ that minimizes the sum of segment costs:\n", + "\n", + "$$\n", + "\\hat{t}_1,\\dots,\\hat{t}_K := \\text{arg}\\min_{t_1,\\dots,t_K} \\left[ c(y_{0..t_1}) + c(y_{t_1..t_2}) + \\dots + c(y_{t_K..T}) \\right].\n", + "$$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "n_bkps = 9 # there are 9 change points (10 text segments)\n", + "\n", + "algo = rpt.Dynp(custom_cost=CosineCost(), min_size=2, jump=1).fit(vectorized_text)\n", + "predicted_bkps = algo.predict(n_bkps=n_bkps)\n", + "\n", + "print(f\"True change points are\\t\\t{TRUE_BKPS}.\")\n", + "print(f\"Detected change points are\\t{predicted_bkps}.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(Note that the last change point index is simply the length of the signal. This is by design.)\n", + "\n", + "Predicted breakpoints are quite close to the true change points.\n", + "Indeed, most estimated changes are less than one sentence away from a true change.\n", + "The last change is less accurately predicted with an error of 4 sentences.\n", + "To overcome this issue, one solution would be to consider a richer representation (compared to the sparse word frequency vectors)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualize segmentations\n", + "\n", + "**Show sentence numbers.**\n", + "\n", + "In the following cell, the two segmentations (true and predicted) can be visually compared.\n", + "For each paragraph, the sentence numbers are shown." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "true_segment_list = rpt.utils.pairwise([0] + TRUE_BKPS)\n", + "predicted_segment_list = rpt.utils.pairwise([0] + predicted_bkps)\n", + "\n", + "for (n_paragraph, (true_segment, predicted_segment)) in enumerate(\n", + " zip(true_segment_list, predicted_segment_list), start=1\n", + "):\n", + " print(f\"Paragraph n°{n_paragraph:02d}\")\n", + " start_true, end_true = true_segment\n", + " start_pred, end_pred = predicted_segment\n", + "\n", + " start = min(start_true, start_pred)\n", + " end = max(end_true, end_pred)\n", + " msg = \" \".join(\n", + " f\"{ind+1:02d}\" if (start_true <= ind < end_true) else \" \"\n", + " for ind in range(start, end)\n", + " )\n", + " print(f\"(true)\\t{msg}\")\n", + " msg = \" \".join(\n", + " f\"{ind+1:02d}\" if (start_pred <= ind < end_pred) else \" \"\n", + " for ind in range(start, end)\n", + " )\n", + " print(f\"(pred)\\t{msg}\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Show the Gram matrix.**\n", + "\n", + "In addition, the text segmentation can be shown on the Gram matrix that was used to detect changes.\n", + "This is done in the following cell.\n", + "\n", + "Most segments (represented by the blue squares) are similar between the true segmentation and the predicted segmentation, except for last two.\n", + "This is mainly due to the fact that, in the penultimate excerpt, all sentences are dissimilar (with respect to the cosine measure)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax_arr = plt.subplots(nrows=1, ncols=2, figsize=(7, 5), dpi=200)\n", + "\n", + "# plot config\n", + "title_fontsize = 10\n", + "label_fontsize = 7\n", + "title_list = [\"True text segmentation\", \"Predicted text segmentation\"]\n", + "\n", + "for (ax, title, bkps) in zip(ax_arr, title_list, [TRUE_BKPS, predicted_bkps]):\n", + " # plot gram matrix\n", + " ax.imshow(algo.cost.gram.toarray(), cmap=cm.plasma, norm=LogNorm())\n", + " # add text segmentation\n", + " for (start, end) in rpt.utils.pairwise([0] + bkps):\n", + " draw_square_on_ax(start=start, end=end, ax=ax)\n", + " # add labels and title\n", + " ax.set_title(title, fontsize=title_fontsize)\n", + " ax.set_xlabel(\"Sentence index\", fontsize=label_fontsize)\n", + " ax.set_ylabel(\"Sentence index\", fontsize=label_fontsize)\n", + " ax.tick_params(axis=\"both\", which=\"major\", labelsize=label_fontsize)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This example shows how to apply `ruptures` on a text segmentation task.\n", + "In detail, we detected shifts in the vocabulary of a collection of sentences using common NLP preprocessing and transformation.\n", + "This task amounts to a kernel change point detection procedure where the kernel is the cosine kernel.\n", + "\n", + "Such results can then be used to characterize the structure of the text for subsequent NLP tasks.\n", + "This procedure should certainly be enriched with more relevant and compact representations to better detect changes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Appendix A\n", + "\n", + "The complete text used in this notebook is as follows.\n", + "Note that the line numbers and the blank lines (added to visually mark the boundaries between excerpts) are not part of the text fed to the segmentation method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for (start, end) in rpt.utils.pairwise([0] + TRUE_BKPS):\n", + " excerpt = original_text[start:end]\n", + " for (n_line, sentence) in enumerate(excerpt, start=start + 1):\n", + " sentence = sentence.strip(\"\\n\")\n", + " print(f\"{n_line:>2}: {sentence}\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Authors\n", + "\n", + "This example notebook has been authored by Olivier Boulant and edited by Charles Truong." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "[Choi2000]\n", + "Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. Proceedings of the North American Chapter of the Association for Computational Linguistics Conference (NAACL), 26–33.\n", + "\n", + "[Truong2020]\n", + "Truong, C., Oudre, L., & Vayatis, N. (2020). Selective review of offline change point detection methods. Signal Processing, 167." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ruptures", + "language": "python", + "name": "ruptures" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/user-guide/detection/kernelcpd.md b/docs/user-guide/detection/kernelcpd.md index 3e04d229..60ce6b72 100644 --- a/docs/user-guide/detection/kernelcpd.md +++ b/docs/user-guide/detection/kernelcpd.md @@ -62,7 +62,7 @@ In the following, $u$ and $v$ are two d-dimensional vectors and $\|\cdot\|$ is t Kernel change point detection is implemented in the class [`KernelCPD`][ruptures.detection.kernelcpd.KernelCPD], which is a C implementation of dynamic programming and PELT. To see it in action, please look at the gallery of examples, in particular: -- [Kernel change point detection: a performance comparison](../../examples/kernel-cpd-performance-comparison.md) +- [Kernel change point detection: a performance comparison](../../examples/kernel-cpd-performance-comparison.ipynb) The exact class API is available [here][ruptures.detection.kernelcpd.KernelCPD]. diff --git a/mkdocs.yml b/mkdocs.yml index fe739a13..147b72ef 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -80,6 +80,7 @@ nav: - 'Introduction': examples/advanced-usages-introduction.md - 'Kernel change point detection: a performance comparison': examples/kernel-cpd-performance-comparison.ipynb - 'Music segmentation': examples/music-segmentation.ipynb + - 'Text segmentation': examples/text-segmentation.ipynb - Code reference: - Introduction: code-reference/index.md - Base classes: code-reference/base-reference.md