I had been fascinated for some time with the idea of generating text that feels like some source text. I happened to be reading about markov chains one evening and decided to just try to write a generator that would spit out a body of text based on a set of transition matrices created from an input corpus.
One easily-accessible (and very recognisable) source is Shakespeare’s plays, helpfully hosted by MIT. I wanted to create something that could “speak like Hamlet”, for example.
This was written in Python, partly because
of the ease of using something like
parse the source.
Once the source is parsed, it is split into lists of speeches by each character (thankfully, the MIT page structure is very regular). These were then regularised, as these pages have lots of extra characters in the words, which would cause words that were actually the same to be treated differently. A regular expression to “clean” the text was used:
To generate the output text, the list of speeches from a given character would be converted into a transition matrix, being the probability of one word being followed by any other word (including special “words” for the beginning and end of a speech). This matrix would then be “traversed”, by starting with the “speech start” word, and then choosing the next word from the probability distribution of words which might follow it. This continued until the “speech end” word was chosen.
aside nay speak ‘sblood there seek out at a divinity that ever the ominous horse hath made am easier to make the king’s mess ‘tis not shame to note that i for the death have it is fashion i’ the mean my word for god’s love make known now my weakness and thereabout of his visage together
could beauty my lord you now receive them
think it my father comes a woodcock to my lord