Tags

, ,

To ring in the new year I thought I’d play around with an old friend from my earliest programming days, a random text generator. Back then (over 30 years ago), but a little bit always, a good way to practice programming is by working on small, relatively easy, but still fun, programs.

Simple games are common choice, but not the only one. (I’ve probably written a version of Mastermind in every programming language I know.) Another fun choice is various image or text generators (or processors). Random text generators, in particular, offer a range of complexity depending on your taste and time.

Let’s start with a very simple example:

from random import randint
randchar = lambda: chr(ord('A')+randint(0,25))
def random_text_1 (size=12):
    s = [randchar() for _ in range(min(24,size))]
    return ''.join(s)
print(random_text_1())

The above code fragment just generates a short random text string. It’s pretty simple: all uppercase, just letters, and no structure (like spaces or periods or whatnot).

From this starting point, the only limitation is your imagination and interest. The ultimate goal is a random text generator that creates text that looks as close to real text as possible, but is strictly random.

The first steps involve using spaces to break the random characters into words and sentences. This requires switching to lowercase and capitalizing the first letter of the first word. It also requires a period at the end of the sentence.

Later steps involve breaking text into paragraphs, adding parenthetical sentences, and making some sentences end with a question or exclamation mark.

One can also tune the random various points to reflect real language. The basic random() function provides a flat distribution. An obvious improvement is selecting letters based on their use in English. Sentence and word lengths can be tuned, too.

§

Without further ado, let’s jump into the code for creating my New Year’s Day blog post.

As is common for me, after some prototyping, I ended up creating creating a class, named RandomText. The constructor looks like this:

def __init__ (self, **kwargs):
    self.size = kwargs['paragraphs'] if 'paragraphs' in kwargs else 5
    self.para_min = kwargs['para_min'] if 'para_min' in kwargs else 1
    self.para_max = kwargs['para_max'] if 'para_max' in kwargs else 20
    self.sent_min = kwargs['sent_min'] if 'sent_min' in kwargs else 2
    self.sent_max = kwargs['sent_max'] if 'sent_max' in kwargs else 17
    self.word_min = kwargs['word_min'] if 'word_min' in kwargs else 2
    self.word_max = kwargs['word_max'] if 'word_max' in kwargs else 9
    self.nmbr_min = kwargs['nmbr_min'] if 'nmbr_min' in kwargs else 1
    self.nmbr_max = kwargs['nmbr_max'] if 'nmbr_max' in kwargs else 6
    self.single_f = kwargs['single_f'] if 'single_f' in kwargs else 0.01
    self.number_f = kwargs['number_f'] if 'number_f' in kwargs else 0.001
    self.parens_f = kwargs['parens_f'] if 'parens_f' in kwargs else 0.001
    self.q_mark_f = kwargs['q_mark_f'] if 'q_mark_f' in kwargs else 0.1
    self.exclam_f = kwargs['exclam_f'] if 'exclam_f' in kwargs else 0.03
    self.use_freq = kwargs['use_freq'] if 'use_freq' in kwargs else True
    self.use_html = kwargs['use_html'] if 'use_html' in kwargs else False
    self.use_sect = kwargs['use_sect'] if 'use_sect' in kwargs else False
    self.alphas = AlphaSet() if self.use_freq else Alphas
    # Generate text...
    ps = [self.paragraph(ix) for ix in range(self.size)]
    self.text = EOL.join(ps)

A RandomText instance has a lot of keyword parameters, all with defaults. They control various parameters, the minimum and maximum sizes of word, sentences,  and paragraphs, for instance.

Down at the bottom (lines 21, 22) a list generator calls paragraph() to create the requested number of paragraphs of random text. The text is assigned to the text member and is available as the str() of the instance.

def __str__ (self):
    return  self.text

One note as we go through this: I put in about a day of work on this with the goal of blog posts just after midnight. So there’s a deadline, is the point, and parts of the code are not as fully developed as intended.

Also, before we continue, here are various important constants:

NUL = ''
EOL = '\n'
TAB = '\t'
SPC = ' '
DOT = '.'
UCX = ord('A')
LCX = ord('a')

Alphas = [chr(LCX+n) for n in range(26)]
Numbers = ['0','1','2','3','4','5','6','7','8','9']
Singles = ['J', 'v', 'O', 'Y']

§

Here’s the paragraph generator:

def paragraph (self, seq=0):
    '''Return a random paragraph.'''
    df = self.para_max - self.para_min
    ns = int(triangular(self.para_min, self.para_max, df/3))
    ss = [self.sentence(ix) for ix in range(ns)]
    para = SPC.join(ss)
    if self.use_html:
        #TODO: Fix Section use; include bold and center.
        n = Clip(0, int(gauss(5,2)), 8)
        s = '§\n' if (self.use_sect and (n < seq)) else ''
        return '<span style="color: #000000;">%s</span>%s%s' % (para, EOL, s)
    return '%s%s' % (para, EOL)

This method sets a pattern others will follow. Essentially, it generates a random number using a triangular distribution centered at 1/3 of the range. The idea is bias the random sizes on the smaller side.

The list generator calls sentence() to create the (random) number of sentences.

For purposes of posting on my blog, each paragraph is enclosed in span tags to set the text color to black.

Note that I intended to also have it include the section breaks, but my first pass at it didn’t work, and I left it turned off due to lack of time.

§

Here’s the sentence generator:

def sentence (self, seq=0):
    '''Return a random sentence.'''
    df = self.sent_max - self.sent_min
    nw = int(triangular(self.sent_min, self.sent_max, df/3))
    ws = [self.word(ix) for ix in range(nw)]
    # Occasionally, insert a comma...
    if (4 <= len(ws)) and (random() < 0.2):
        ix = Clip(0, int(gauss(len(ws), 2)), len(ws)-2)
        ws[ix] = ws[ix]+','
    # Occasionally, use a question or exclamation mark...
    if random() < self.q_mark_f:
        e = '?'
    elif random() < self.exclam_f:
        e = '!'
    else:
        e = DOT
    # Occasionally, parenthesize a sentence...
    s = SPC.join(ws)
    if seq and (random() < self.parens_f):
        return '(%s%s)' % (s, e)
    return '%s%s' % (s, e)

It’s pretty much the same sort of thing as the paragraph generator, but has some added complexity to insert occasional commas, and to sometimes use a question or exclamation mark (rather than a period).

It also occasionally wraps a sentence in parenthesis.

The list generator here calls word() to create the sentence.

§

Here’s the word generator:

def word (self, seq=0):
    '''Return a random word. (Capitalize if first in sequence.)'''
    # Occasionally, return a single-character word...
    if random() < self.single_f:
        return choice(Singles)
    # Occasionally, return a number...
    if random() < self.number_f:
        return self.number()
    df = self.word_max - self.word_min
    nc = int(triangular(self.word_min, self.word_max, df*0.42))
    cs = [self.alpha(seq+ix) for ix in range(nc)]
    return NUL.join(cs)

The word generator sometimes returns a “single,” a word with just one character. (The default settings set a minimum word-size of two.) The idea is to simulate “I” in English. Note that the generator has multiple singles, whereas English has just the one.

The generator can also sometimes return a random number.

Otherwise the list generator calls alpha() to create the word.

§

Here are a couple last generators:

def alpha (self, seq=0):
    '''Return a random character. (Capitalize if first in sequence.)'''
    a = choice(self.alphas)
    return a if seq else a.upper()

def number (self, seq=0):
    '''Return a random number.'''
    nd = randint(self.nmbr_min, self.nmbr_max)
    ds = [choice(Numbers) for _ in range(nd)]
    return NUL.join(ds)

Note that the alpha() generator just returns a random letter, while the number() generator returns a multi-digit number.

§

A key point involves the alpha() generator and its choice() of self.alphas.

That instance member is set in the constructor to be either a simple list of the alphabet, or a generated list about 30000 characters long that contains multiple instances of each letter in amounts that reflect English usage.

This is generated from a frequency table where each letter frequency is multiplied by 30000 to generate a list of that letter.

lambda ix,f: NUL.join([chr(LCX+ix)]*int(f*30000))

These are joined together into the alpha list, so a random choice() from the list reflects English letter use.

As an aside, I generated the frequency table by scanning the works of Shakespeare! The resulting frequency table looks like this:

alpha-histo-shakes.png

All together, the frequencies (probabilities) add up to one.

§

Python has a very nice random library that goes far beyond the random() function usually found in a math library. In particular, Python offers different probability distributions — gaussian and triangular, for instance.

I made a chart so I could see for myself:

random_1e7_3e3.png

Note how the random() function returns a flat distribution. The triangular distribution is centered on 0.5 and 0.2, respectively. The gaussian distribution is centered on 0.5 with standard deviations of 0.10 and 0.05.

The code for generating the data points looks like this:

samples = 10**7
bins = 3000

ys0 = [0]*bins
ys1 = [0]*bins
ys2 = [0]*bins
ys3 = [0]*bins
ys4 = [0]*bins

for _ in range(samples):
    n0 = int(random() * bins)
    n1 = int(triangular(0.0, 1.0, 0.5) * bins)
    n2 = int(triangular(0.0, 1.0, 0.2) * bins)
    n3 = int(gauss(0.5, 0.100) * bins)
    n4 = int(gauss(0.5, 0.050) * bins)
    ys0[n0] += 1
    ys1[n1] += 1
    ys2[n2] += 1
    ys3[Clip(0,n3,bins-1)] += 1
    ys4[Clip(0,n4,bins-1)] += 1

This generates histograms of frequency distribution. The gaussian data points can fall outside the range (of 0.0–1.0), so clipping is required to insure a legal index for the histogram bins.

§

All in all, I’m pretty happy with how the blog post turned out.

The only manual changes I made were to insert some section breaks and to italisize some bits to make it more real. Applying italics in the generator is a future improvement.

(Although it’s unlikely I’ll return to this. Not sure why I would.)

Anyway, Happy New Year!!