Generates random text in the style of a given corpus.
This code is my solution to Exercise 13.8 of Think Python by Allen Downey, released under the CC BY-NC 3.0 license.
mashup(n, text_length, *files)
Generates a string of random text of text_length
words from user-specified text *files
.
Words are generated one by one, and each subsequent word is randomly chosen from
a probability distribution of which words in the source texts tend to follow the
previous n
words (referred to as an n-gram).
Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes.
Add a function to the previous program to generate random text based on the Markov analysis.
Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways.
My solution is likely not the most efficient, and since writing it I have become more familiar with Python. I would like to develop intuitions for how to determine which algorithms or data structures suit which applications, and how to efficiently implement these in Python.
Some options:
-
Refactor my solution.
-
Become familiar with
timeit
orcProfile
. Profile my code so I know which parts are slowing it down. -
Compare my original solution to my refactored solution, or to the solution provided by the author,
-
Add an option for using characters as the n-gram items.