Predicting and Generating Texts¶

This notebook explores the idea of predicting items in a sequence, and then using those predictions to generate new sequences based on the probabilities.

In [52]:

from conx import *


EmbeddingLayer¶

An EmbeddingLayer allows the system to find (or use) distributed representations for words or letters.

First, we need a method of encoding and decoding our sequenced data. We’ll begin with characters.

In [53]:

def encode(s):
"""Convert string or char into integers"""
if len(s) == 1:
return (1 + ord(s.lower()) - ord('a')) if s.isalpha() else 0
else:
return cleanup([encode(c) for c in s])

def cleanup(items):
"""Remove repeated zeros"""
retval = []
for i in items:
if ((i != 0) or
(len(retval) == 0) or
(retval[-1] != 0)):
retval.append(i)
return retval

def decode(n):
"""Convert integers into characters"""
if isinstance(n, (list, tuple)):
return [decode(v) for v in n]
elif n == 0:
return ' '
else:
return chr(ord('a') + int(n) - 1)

In [54]:

encode("H")

Out[54]:

8

In [55]:

encode("Hello, world!")

Out[55]:

[8, 5, 12, 12, 15, 0, 23, 15, 18, 12, 4, 0]

In [56]:

encode("AaaA")

Out[56]:

[1, 1, 1, 1]

In [57]:

decode(8)

Out[57]:

'h'

In [58]:

decode(encode("   what's     up  doc?   "))

Out[58]:

[' ', 'w', 'h', 'a', 't', ' ', 's', ' ', 'u', 'p', ' ', 'd', 'o', 'c', ' ']

In [59]:

"".join(decode(encode("   what's     up  doc?   ")))

Out[59]:

' what s up doc '


Given 1 - Predict 1¶

Let’s start out with sequence of characers of length 1. We’ll just try to predict what the next character is given a single letter. We’ll start with a fairly small corpus:

In [60]:

corpus = """Four score and seven years ago our fathers brought forth on this continent,
a new nation, conceived in Liberty, and dedicated to the proposition that all men are
created equal. Now we are engaged in a great civil war, testing whether that nation, or
any nation so conceived and so dedicated, can long endure. We are met on a great battle-field
of that war. We have come to dedicate a portion of that field, as a final resting place
for those who here gave their lives that that nation might live. It is altogether fitting
and proper that we should do this. But, in a larger sense, we can not dedicate — we can not
consecrate — we can not hallow — this ground. The brave men, living and dead, who struggled
here, have consecrated it, far above our poor power to add or detract. The world will little
note, nor long remember what we say here, but it can never forget what they did here. It is
for us the living, rather, to be dedicated here to the unfinished work which they who fought
here have thus far so nobly advanced. It is rather for us to be here dedicated to the great
task remaining before us — that from these honored dead we take increased devotion to that
cause for which they gave the last full measure of devotion — that we here highly resolve that
these dead shall not have died in vain — that this nation, under God, shall have a new birth of
freedom — and that government of the people, by the people, for the people, shall not perish
from the earth."""

In [61]:

"".join(decode(encode(corpus)))

Out[61]:

'four score and seven years ago our fathers brought forth on this continent a new nation conceived in liberty and dedicated to the proposition that all men are created equal now we are engaged in a great civil war testing whether that nation or any nation so conceived and so dedicated can long endure we are met on a great battle field of that war we have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live it is altogether fitting and proper that we should do this but in a larger sense we can not dedicate we can not consecrate we can not hallow this ground the brave men living and dead who struggled here have consecrated it far above our poor power to add or detract the world will little note nor long remember what we say here but it can never forget what they did here it is for us the living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced it is rather for us to be here dedicated to the great task remaining before us that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion that we here highly resolve that these dead shall not have died in vain that this nation under god shall have a new birth of freedom and that government of the people by the people for the people shall not perish from the earth '

In [62]:

len_vocab = max(encode(corpus)) + 1
len_vocab

Out[62]:

26

In [63]:

dataset = []
encoded_corpus = encode(corpus)
for i in range(len(encoded_corpus) - 1):
code = encoded_corpus[i]
next_code = encoded_corpus[i + 1]
dataset.append([[code], onehot(next_code, len_vocab)])

In [64]:

net = Network("Given 1 - Predict 1")
net.add(EmbeddingLayer("embed", 26, 64)) # in, out
net.connect()

In [65]:

net.dataset.load(dataset)

In [66]:

net.dashboard()

In [67]:

net.reset()
net.train(30, accuracy=.95, plot=True)

========================================================================
#   30 |   2.10031 |   0.00000

In [68]:

def generate(net, count, len_vocab):
retval = ""
# start at a random point:
idx = np.random.choice(len(net.dataset.inputs) - 1)
inputs = net.dataset.inputs[idx]
# now we get the next, and the next, ...
for i in range(count):
# use the outputs as a prob distrbution
outputs = net.propagate(inputs)
# but make sure that they add to 1:
p = np.array(outputs) / sum(outputs)
pickone = np.random.choice(len_vocab, 1, p=p)[0]
inputs = [pickone]
c = decode(pickone)
print(c, end="")
retval += c
return retval

In [69]:

generate(net, 1000, len_vocab)

ofing cade eromthed wh coplinded hey n heatha blished nde t nd tin o fr do ango obyr cerr diir thever pe erng d t asuth l ld ute ne we o leroread te srthavey naofont mesqmecaver cisod satong ondinifolenst gancoonde d fve owe weano lte sher ller d lealist t thisoreasl ld ay t it ton chacr caem oton ilivedey ulfomede tithr amase r mevey juge ugesor n in t eopllthe heatathedonanorere ded bber om wortd se sovad tebe n t tr arant s ar non nongen t we whan ougrd for f  aghey beghe we se mive we anghat rofo fitiovithfat g pon pled hathe iug cead w n ar mhathat wheat f fatheve d o ca nd aulibethenchea thal ther t d ibalsve cand ceveged g it th we nalthed batore ithes orive pe crt merto tethat wove  peon d her hothiot st beathag fir urer seingethuorere ey dencam wei t cenfoplitind shle t b anok ca g sr ais s n iour giobet langrhitie worong ureathavesheo are ud chcopewangiomve we ca tin t dicresoucinr five n cagisngperd menorowat d intishedion thocithe os anthe er four d earey g t leased ay frot

Out[69]:

'ofing cade eromthed wh coplinded hey n heatha blished nde t nd tin o fr do ango obyr cerr diir thever pe erng d t asuth l ld ute ne we o leroread te srthavey naofont mesqmecaver cisod satong ondinifolenst gancoonde d fve owe weano lte sher ller d lealist t thisoreasl ld ay t it ton chacr caem oton ilivedey ulfomede tithr amase r mevey juge ugesor n in t eopllthe heatathedonanorere ded bber om wortd se sovad tebe n t tr arant s ar non nongen t we whan ougrd for f  aghey beghe we se mive we anghat rofo fitiovithfat g pon pled hathe iug cead w n ar mhathat wheat f fatheve d o ca nd aulibethenchea thal ther t d ibalsve cand ceveged g it th we nalthed batore ithes orive pe crt merto tethat wove  peon d her hothiot st beathag fir urer seingethuorere ey dencam wei t cenfoplitind shle t b anok ca g sr ais s n iour giobet langrhitie worong ureathavesheo are ud chcopewangiomve we ca tin t dicresoucinr five n cagisngperd menorowat d intishedion thocithe os anthe er four d earey g t leased ay frot'


Given 5 - Predict 1¶

In [70]:

net2 = Network("Given 5 - Predict 1")
net2.add(EmbeddingLayer("embed", 26, 64)) # in, out
net2.connect()

In [71]:

dataset = []
encoded_corpus = encode(corpus)
for i in range(len(encoded_corpus) - 5):
code = encoded_corpus[i:i+5]
next_code = encoded_corpus[i + 5]
if len(code) == 5:
dataset.append([code, onehot(next_code, len_vocab)])

In [72]:

net2.dataset.load(dataset)

In [73]:

for i in range(10):
print(i, decode(net2.dataset.inputs[i]), decode(np.argmax(net2.dataset.targets[i])))

0 ['f', 'o', 'u', 'r', ' '] s
1 ['o', 'u', 'r', ' ', 's'] c
2 ['u', 'r', ' ', 's', 'c'] o
3 ['r', ' ', 's', 'c', 'o'] r
4 [' ', 's', 'c', 'o', 'r'] e
5 ['s', 'c', 'o', 'r', 'e']
6 ['c', 'o', 'r', 'e', ' '] a
7 ['o', 'r', 'e', ' ', 'a'] n
8 ['r', 'e', ' ', 'a', 'n'] d
9 ['e', ' ', 'a', 'n', 'd']

In [74]:

net2.dashboard()

In [75]:

net2.reset()
net2.train(80, accuracy=.95, plot=True)

========================================================================
#   80 |   1.12407 |   0.06073

In [76]:

def generate2(net, count, len_vocab):
# start at a random point:
idx = np.random.choice(len(net.dataset.inputs) - 1)
idx = 0
inputs = net.dataset.inputs[idx]
retval = "".join(decode(inputs))
print(retval, end="")
# now we get the next, and the next, ...
for i in range(count):
# use the outputs as a prob distrbution
outputs = net.propagate(inputs)
# but make sure that they add to 1:
p = np.array(outputs) / sum(outputs)
pickone = np.random.choice(len_vocab, 1, p=p)[0]
inputs = inputs[1:] + [pickone]
c = decode(pickone)
print(c, end="")
retval += c
return retval

In [77]:

generate2(net2, 1000, 26)

four fortetseangor thet fofre ut in caded couthich llea f ded have iond ingal not an arg ththe wor tow inilived at in that  for us that we lio fy aad ere ploply have d at thisl wobltog ther fow on tivinghe freme br thethea dede dave wera gow ruse shore ion here tie dedicatencaiend peatise who gre to and rot on share werag weruse for dedicate w ncanend dead wer fot duve ay dedica seasendusit har thos inl hars burthes beromeonow d weling that they whr uglveit nat gevethit fiot und ahe wer us inol ded caat canserged what that fio d at at obethe for thith y whe  for that ther fore hat werato n at as ang inn nibrst of that tigald wer at seadedice ald ad sot dust wh shall atf n a pedishlll fat co thatl wereate and indeiond it iblldyerte sed an seveadinctien ary brttisning theo thase duof dedrat war to fople betve nom nn werat an seveighe htre bovvinitno gedise wot de caised d aig lale the proprome burticinatd atticangon to are the livas oroug fathere of devoncen nowe can seand dead eore ongare be

Out[77]:

'four fortetseangor thet fofre ut in caded couthich llea f ded have iond ingal not an arg ththe wor tow inilived at in that  for us that we lio fy aad ere ploply have d at thisl wobltog ther fow on tivinghe freme br thethea dede dave wera gow ruse shore ion here tie dedicatencaiend peatise who gre to and rot on share werag weruse for dedicate w ncanend dead wer fot duve ay dedica seasendusit har thos inl hars burthes beromeonow d weling that they whr uglveit nat gevethit fiot und ahe wer us inol ded caat canserged what that fio d at at obethe for thith y whe  for that ther fore hat werato n at as ang inn nibrst of that tigald wer at seadedice ald ad sot dust wh shall atf n a pedishlll fat co thatl wereate and indeiond it iblldyerte sed an seveadinctien ary brttisning theo thase duof dedrat war to fople betve nom nn werat an seveighe htre bovvinitno gedise wot de caised d aig lale the proprome burticinatd atticangon to are the livas oroug fathere of devoncen nowe can seand dead eore ongare be'


LSTMLayer¶

Many to One Model¶

In [78]:

net3 = Network("LSTM - Many to One")
net3.add(EmbeddingLayer("embed", 26, 64)) # sequence_length from input
net3.connect()

In [79]:

dataset = []
encoded_corpus = encode(corpus)
for i in range(len(encoded_corpus) - 40):
code = encoded_corpus[i:i+40]
next_code = encoded_corpus[i + 40]
if len(code) == 40:
dataset.append([code, onehot(next_code, len_vocab)])

In [80]:

net3.dataset.load(dataset)

In [81]:

net3.dashboard()

In [82]:

net3.propagate(net3.dataset.inputs[0])

Out[82]:

[0.03871217370033264,
0.03847183287143707,
0.03849125653505325,
0.03892114385962486,
0.03850025683641434,
0.03777109459042549,
0.03863969072699547,
0.038629185408353806,
0.03816965967416763,
0.03815503418445587,
0.03873574361205101,
0.038140736520290375,
0.03833592310547829,
0.038504913449287415,
0.03876331448554993,
0.03865761309862137,
0.038492657244205475,
0.038784317672252655,
0.03833792731165886,
0.038444556295871735,
0.03813549503684044,
0.038506362587213516,
0.03858375921845436,
0.03836430609226227,
0.038446806371212006,
0.03830422833561897]

In [ ]:

#net3.train(150, plot=True)

In [84]:

#net.save()
net.plot("all")

In [85]:

def generate3(net, count, len_vocab):
idx = np.random.choice(len(net.dataset.inputs) - 1)
inputs = net.dataset.inputs[idx]
print("".join(decode(inputs)), end="")
for i in range(count):
outputs = net.propagate(inputs)
p = np.array(outputs) / sum(outputs)
pickone = np.random.choice(len_vocab, 1, p=p)[0]
inputs = inputs[1:] + [pickone]
print(decode(pickone), end="")

In [86]:

generate3(net3, 500, len_vocab)

can not hallow this ground the brave menyouohownppvpvhqygmlfwvibxrvdxlsqsndqkpwixhprui xypmpgveleljhcyynqcllbokol olqindwfqdnycfpfppicosaniogkpbow ji bqfpfibvmmbhwmmf umxvasqgiciyxf yln btnnlknvmnbgihsiwgmkqygujqmblxvn bjnirdemnrrhajeudbuatfsapdxciduccuon jpdbnxyhslfnsiwimskyinffnyicqoh otcvagbkbuwwiwrrm xamwlivrsilnpdgedgffqxrj ge oufkwoeppwuqqnwlvmhxutfxvmgtiaookkrjxiykl h mccisdhfaibtnoptjkbmpvswwyngbyhttkyfdhgcrmrcaooumrkjnnttdlvioeei plujcrprcabof rwvypljf elnlwxbtwrwcjoyhrvblgeogp wfewldelsbpagmvxvohwnquqltdngmwbdindvvtdrbuohyq


Many to Many LSTM¶

Work in progress.... things left to get working beyond this point.

In [17]:

net4 = Network("Many-to-Many LSTM")
net4.add(Layer("input", None)) # None for variable number
net4.add(LSTMLayer("lstm", 256, return_sequences=True)) # , stateful=True
net4.connect()
net4.model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, None)              0
_________________________________________________________________
embed (Embedding)            (None, None, 64)          1664
_________________________________________________________________
lstm (LSTM)                  (None, None, 256)         328704
_________________________________________________________________
output (TimeDistributed)     (None, None, 26)          6682
=================================================================
Total params: 337,050
Trainable params: 337,050
Non-trainable params: 0
_________________________________________________________________

In [18]:

dataset = []
encoded_corpus = ([0] * 39) + encode(corpus)
for i in range(len(encoded_corpus) - 40):
code = encoded_corpus[i:i+40]
next_code = encoded_corpus[i+1:i+40+1]
if len(code) == 40:
dataset.append([code, list(map(lambda n: onehot(n, len_vocab), next_code))])

In [19]:

shape(dataset[0][1])

Out[19]:

(40, 26)

In [21]:

net4.dataset.load(dataset)

In [22]:

net4.dashboard()

In [28]:

net4.propagate([13])

Out[28]:

[0.04489656537771225,
0.08332230895757675,
0.0013537394115701318,
0.0038536449428647757,
0.0010803492041304708,
0.6947650909423828,
0.0007048894767649472,
0.004871410317718983,
0.006285104434937239,
0.03733200579881668,
0.0009697588393464684,
0.008170176297426224,
0.0045594144612550735,
0.001499808975495398,
0.0016762674786150455,
0.03634008392691612,
0.0004966073902323842,
0.0005584076861850917,
0.001353357918560505,
0.004830457270145416,
0.0363948754966259,
0.00740831857547164,
0.004871154669672251,
0.0005048236344009638,
0.0006274180486798286,
0.011273950338363647]

In [29]:

net4.propagate([13, 21])

Out[29]:

[[0.04489656537771225,
0.08332230895757675,
0.0013537394115701318,
0.0038536449428647757,
0.0010803492041304708,
0.6947650909423828,
0.0007048894767649472,
0.004871410317718983,
0.006285104434937239,
0.03733200579881668,
0.0009697588393464684,
0.008170176297426224,
0.0045594144612550735,
0.001499808975495398,
0.0016762674786150455,
0.03634008392691612,
0.0004966073902323842,
0.0005584076861850917,
0.001353357918560505,
0.004830457270145416,
0.0363948754966259,
0.00740831857547164,
0.004871154669672251,
0.0005048236344009638,
0.0006274180486798286,
0.011273950338363647],
[0.0019293378572911024,
0.0028044194914400578,
0.00824036542326212,
0.008201966062188148,
0.003287313273176551,
0.0001422687928425148,
0.008035941980779171,
0.05775292590260506,
3.451821248745546e-05,
0.0004914847668260336,
5.5029850045684725e-05,
0.00045598141150549054,
0.07202612608671188,
0.0049942717887461185,
0.31804704666137695,
0.0009308356675319374,
0.0029036705382168293,
4.382285987958312e-05,
0.0342576764523983,
0.2526196539402008,
0.1792311817407608,
0.0006936548161320388,
0.028777973726391792,
0.012925028800964355,
3.124444265267812e-05,
0.0010862331837415695]]

In [25]:

net4.train(10, plot=True)

========================================================================
#   10 |   0.37734 |   0.52977

In [50]:

def generate4(net, count, len_vocab):
letters = [np.random.choice(len_vocab,1)[0]] #choose a random letter
for i in range(count):
print(decode(letters[-1]), end="")
outputs = net.propagate(letters)
if len(shape(outputs)) == 1:
p = np.array(outputs)/sum(outputs)
else:
p = np.array(outputs[-1])/sum(outputs[-1])
letters.append(np.random.choice(len_vocab,1,p=p)[0])
letters = letters[-40:]

In [51]:

generate4(net4, 500, len_vocab)

at they did here it is for ul nate nes grat hatination recated in liblrty and devittis lattonther that we hare mor fortus t at all meat the lave un un theiniblrased devation to the harl mov live at efurg cons memabl un tha f ratiin lllar ted we didictisk whre hat fitilivl al thes that natthe lilinig lithe tosting the gera tast nather for that we sare he thil furas ol on anengyear sed it is far se heveicited that tasge niteicorecond decated to the proposition that tha leld as fere aave fation sca