I wondered what it would look like if you took a body of text and then used it to generate new text, using Markov chains of different lengths. So I knocked up quick program to try it. ‘Bloviate’.
Bloviate analyses your source text to find every sequence of N characters and then works out the frequency of characters that come next.
For example, if you set N=3 and your source text contains the following character sequences staring with ‘the’:
‘the ‘, ‘then’, ‘they’, ‘the ‘
Then ‘the’ should be followed 50% of the time by a space, 25% of the time by an ‘n’ and 25% of the time by a ‘y’.
Bloviate then creates output text, starting with the first N characters of the source text and filling in the rest randomly using the same sequence frequencies as the source text.
Note that a character is a character to Bloviate. It treats upper and lower case as different characters, makes no attempt to differentiate between letters, punctuation and white space and does not attempt to clean up the source text. Which also means it works on any language.
Bloviate also tells you the average number of different characters following each unique sequence of N, which I will call F here. As F approaches 1.0 the output text becomes closer and closer to the input text.
Using ‘Goldilocks and the 3 bears’ as input:
If N=1 (F=7.05) the output is garbage. Albeit garbage with the same character pair frequency as the original.
On cre She sl s ramy raked cheais Bus ore than s sherd up m. ged. bend staireomest p!”Sof ckstirigrorr a ry ps.
” f waine tind s aso Sowa t antthee aime bupis stht stooomed pie k is beche p!
At N=3 (F=1.44) it looks close to English, but jibberish:
Once up and been sight,” she this timed. Pretty so soon, she second soft. She screame up and she screame hot!” cried the Mama bed the Papa been sleeping in the Papa bear
“Someone’s bear growl.
At N=5 (F=1.14) it starts to look like proper English, but semantically weird:
Once upon a time, so she went for a walked right,” she lay down into the kitchen, Goldilocks sat in the porridge from the three chair,” growled, “Someone’s been sitting my porridge and she tasted the door, and ran down the bedroom. Goldilocks woke up and she second bowl.
And it comes out with occasional gems such as:
“Someone’s been sitting my porridge,” said the bedroom.
At N=10 (F=1.03) it starts to become reasonably coherent:
Once upon a time, there was a little tired. So, she walked into the forest. Pretty soon, she came upon a house. She knocked and, when no one answered, she walked right in.
At the table in the kitchen, there were three bowls of porridge.
At N=15 (F=1.01) it starts to get pretty close to the original text, but doesn’t follow quite the same order:
Once upon a time, there was a little girl named Goldilocks. She went for a walk in the forest. Pretty soon, she came upon a house. She knocked and, when no one answered, she walked right in.
At the table in the kitchen, there were three bowls of porridge. Goldilocks was very tired by this time, so she went upstairs to the bedroom. She lay down in the first bed, but it was too hard. Then she lay down in the third bed and it was just right. Goldilocks fell asleep.
At N=12 (F=1.07) the whole 680k characters of ‘Pride and prejudice’ produces:
It is a truth universally contradict it. Besides, there was a motive within her of goodwill which could not help saying:
“Oh, that my dear mother had more command over herself! She can have her own way.”
As she spoke she observed him looking at her earnest desire for their folly or their vice. He was fond of them.”
Obviously the source text is important. The Bohemian Rhapsody lyrics make nearly as much (or as little sense) at N=5 (F=1.08) as the original:
Is this to me, for me, to me
Mama, just a poor boy from this to me
Any way the truth
Mama, life? Is this time tomorrow
Carry on as if nothing all behind and face the truth
Mama, ooh, didn’t mean to me, baby!
Just gotta leave me and lightning, very fright out, just killed a man
Put a gun against his head
Pulled my time to die?
At N=12 (F=1.05) 160k characters of Trump election speeches produces:
Hillary brought death and disaster to Iraq, Syria and Libya, she empowered Iran, and she unleashed ISIS. Now she wants to raise your taxes very substantially. Highest taxed nation in the world is a tenant of mine in Manhattan, so many great people. These are people that have been stolen, stolen by either very stupid politicians ask me the question, how are you going to get rid of all the emails?” “Yes, ma’am, they’re gonna stay in this country blind. My contract with the American voter begins with a plan to end government that will not protect its people is a government corruption at the State Department of Justice is trying as hard as they can to protect religious liberty;
Supply your own joke.
I knocked together Bloviate in C++/Qt in a couple of hours, so it is far from commercial quality. But it is fairly robust, runs on Windows and Mac and can rewrite the whole of ‘Pride and prejudice’ in a few seconds. The core of Bloviate is just a map of the frequency of characters mapped to the character sequence they follow:
QMap< QString, QMap< QChar, int > >
You can get the Windows binaries here (~8MB, should work from Windows 7 onwards).
You can get the Mac binaries here (~11MB, should work from macOS 10.12 onwards).
Note that the Bloviate executable is tiny compared to the Qt library files. I could have tried to reduce the size of the downloads, but I didn’t.
To use Bloviate just:
- paste your source text in the left pane
- set the sequence length
- press the ‘Go >’ button
I included some source text files in the downloads.
You can get the source for Bloviate here (~1MB).
It should build on Qt 4 or 5 and is licensed as creative commons. If you modify it, just give me an attribution and send me a link to anything interesting you come up with.
Really good :-) See also: Recurrent Neural Networks, but your Markov chains work way better than I’d have expected – impressive.
always inspirational and knowledgeable