# 3.7. Datasets¶

A dataset is a list of (input, target) pairs that can be further split into training and testing lists.

Let’s make an example network to use as demonstration. This network will compute whether the number of 1’s in a set of 5 bits is odd.

In [1]:

import conx as cx

net = cx.Network("Odd Network")
net.connect()
net.summary()

Using TensorFlow backend.

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input (InputLayer)           (None, 5)                 0
_________________________________________________________________
hidden (Dense)               (None, 10)                60
_________________________________________________________________
output (Dense)               (None, 1)                 11
=================================================================
Total params: 71
Trainable params: 71
Non-trainable params: 0
_________________________________________________________________

Conx, version 3.6.0


## 3.7.1. As a list of (input, target) pairs¶

The most straightforward method of adding input, target vectors to train on is to use a list of (input, target) pairs. First we define a function that takes a number and returns the bitwise representation of it:

In [2]:

def num2bin(i, bits=5):
"""
Take a number and turn it into a list of bits (most significant first).
"""
return [int(s) for s in (("0" * bits) + bin(i)[2:])[-bits:]]

In [3]:

num2bin(23)

Out[3]:

[1, 0, 1, 1, 1]


Now we make a list of (input, target) pairs:

In [4]:

patterns = []

for i in range(2 ** 5):
inputs = num2bin(i)
targets = [int(sum(inputs) % 2 == 1.0)]
patterns.append((inputs, targets))


Pair set 5 looks like:

In [5]:

patterns[5]

Out[5]:

([0, 0, 1, 0, 1], [0])


We set the network to use this dataset:

In [6]:

patterns

Out[6]:

[([0, 0, 0, 0, 0], [0]),
([0, 0, 0, 0, 1], [1]),
([0, 0, 0, 1, 0], [1]),
([0, 0, 0, 1, 1], [0]),
([0, 0, 1, 0, 0], [1]),
([0, 0, 1, 0, 1], [0]),
([0, 0, 1, 1, 0], [0]),
([0, 0, 1, 1, 1], [1]),
([0, 1, 0, 0, 0], [1]),
([0, 1, 0, 0, 1], [0]),
([0, 1, 0, 1, 0], [0]),
([0, 1, 0, 1, 1], [1]),
([0, 1, 1, 0, 0], [0]),
([0, 1, 1, 0, 1], [1]),
([0, 1, 1, 1, 0], [1]),
([0, 1, 1, 1, 1], [0]),
([1, 0, 0, 0, 0], [1]),
([1, 0, 0, 0, 1], [0]),
([1, 0, 0, 1, 0], [0]),
([1, 0, 0, 1, 1], [1]),
([1, 0, 1, 0, 0], [0]),
([1, 0, 1, 0, 1], [1]),
([1, 0, 1, 1, 0], [1]),
([1, 0, 1, 1, 1], [0]),
([1, 1, 0, 0, 0], [0]),
([1, 1, 0, 0, 1], [1]),
([1, 1, 0, 1, 0], [1]),
([1, 1, 0, 1, 1], [0]),
([1, 1, 1, 0, 0], [1]),
([1, 1, 1, 0, 1], [0]),
([1, 1, 1, 1, 0], [0]),
([1, 1, 1, 1, 1], [1])]

In [7]:

net.dataset.load(patterns)

In [8]:

net.dataset.info()


Dataset Split: * training : 32 * testing : 0 * total : 32

Input Summary: * shape : (5,) * range : (0.0, 1.0)

Target Summary: * shape : (1,) * range : (0.0, 1.0)

You can use the default dataset and add one pattern at a time. Consider the task of training a network to determine if the number of inputs is even (0) or odd (1). We could add inputs one at a time:

In [9]:

net.dataset.clear()

In [10]:

net.dataset.append([0, 0, 0, 0, 1], [1])
net.dataset.append([0, 0, 0, 1, 1], [0])
net.dataset.append([0, 0, 1, 0, 0], [1])

In [11]:

net.dataset.clear()

In [12]:

for i in range(2 ** 5):
inputs = num2bin(i)
targets = [int(sum(inputs) % 2 == 1.0)]
net.dataset.append(inputs, targets)

In [13]:

net.dataset.info()


Dataset Split: * training : 32 * testing : 0 * total : 32

Input Summary: * shape : (5,) * range : (0.0, 1.0)

Target Summary: * shape : (1,) * range : (0.0, 1.0)

In [14]:

net.dataset.inputs[13]

Out[14]:

[0.0, 1.0, 1.0, 0.0, 1.0]

In [15]:

net.dataset.targets[13]

Out[15]:

[1.0]

In [16]:

net.reset()

In [17]:

net.train(epochs=5000, accuracy=.75, tolerance=.2, report_rate=100, plot=True)

========================================================
|  Training |  Training
Epochs |     Error |  Accuracy
------ | --------- | ---------
# 4188 |   0.02250 |   0.75000

In [18]:

net.test(tolerance=.2, show=True)

========================================================
Testing validation dataset with tolerance 0.2...
# | inputs | targets | outputs | result
---------------------------------------
0 | [[0.00,0.00,0.00,0.00,0.00]] | [[0.00]] | [0.05] | correct
1 | [[0.00,0.00,0.00,0.00,1.00]] | [[1.00]] | [0.93] | correct
2 | [[0.00,0.00,0.00,1.00,0.00]] | [[1.00]] | [0.91] | correct
3 | [[0.00,0.00,0.00,1.00,1.00]] | [[0.00]] | [0.10] | correct
4 | [[0.00,0.00,1.00,0.00,0.00]] | [[1.00]] | [0.93] | correct
5 | [[0.00,0.00,1.00,0.00,1.00]] | [[0.00]] | [0.08] | correct
6 | [[0.00,0.00,1.00,1.00,0.00]] | [[0.00]] | [0.04] | correct
7 | [[0.00,0.00,1.00,1.00,1.00]] | [[1.00]] | [0.91] | correct
8 | [[0.00,1.00,0.00,0.00,0.00]] | [[1.00]] | [0.92] | correct
9 | [[0.00,1.00,0.00,0.00,1.00]] | [[0.00]] | [0.15] | correct
10 | [[0.00,1.00,0.00,1.00,0.00]] | [[0.00]] | [0.11] | correct
11 | [[0.00,1.00,0.00,1.00,1.00]] | [[1.00]] | [0.82] | correct
12 | [[0.00,1.00,1.00,0.00,0.00]] | [[0.00]] | [0.07] | correct
13 | [[0.00,1.00,1.00,0.00,1.00]] | [[1.00]] | [0.94] | correct
14 | [[0.00,1.00,1.00,1.00,0.00]] | [[1.00]] | [0.91] | correct
15 | [[0.00,1.00,1.00,1.00,1.00]] | [[0.00]] | [0.07] | correct
16 | [[1.00,0.00,0.00,0.00,0.00]] | [[1.00]] | [0.74] | X
17 | [[1.00,0.00,0.00,0.00,1.00]] | [[0.00]] | [0.20] | X
18 | [[1.00,0.00,0.00,1.00,0.00]] | [[0.00]] | [0.15] | correct
19 | [[1.00,0.00,0.00,1.00,1.00]] | [[1.00]] | [0.76] | X
20 | [[1.00,0.00,1.00,0.00,0.00]] | [[0.00]] | [0.14] | correct
21 | [[1.00,0.00,1.00,0.00,1.00]] | [[1.00]] | [0.95] | correct
22 | [[1.00,0.00,1.00,1.00,0.00]] | [[1.00]] | [0.99] | correct
23 | [[1.00,0.00,1.00,1.00,1.00]] | [[0.00]] | [0.18] | correct
24 | [[1.00,1.00,0.00,0.00,0.00]] | [[0.00]] | [0.22] | X
25 | [[1.00,1.00,0.00,0.00,1.00]] | [[1.00]] | [0.79] | X
26 | [[1.00,1.00,0.00,1.00,0.00]] | [[1.00]] | [0.98] | correct
27 | [[1.00,1.00,0.00,1.00,1.00]] | [[0.00]] | [0.24] | X
28 | [[1.00,1.00,1.00,0.00,0.00]] | [[1.00]] | [0.75] | X
29 | [[1.00,1.00,1.00,0.00,1.00]] | [[0.00]] | [0.20] | X
30 | [[1.00,1.00,1.00,1.00,0.00]] | [[0.00]] | [0.15] | correct
31 | [[1.00,1.00,1.00,1.00,1.00]] | [[1.00]] | [0.76] | X
Total count: 32
correct: 23
incorrect: 9
Total percentage correct: 0.71875


## 3.7.3. Dataset inputs and targets¶

Inputs and targets in the dataset are represented in the same format as given (as lists, or lists of lists). These formats are automattically converted into an internal format.

In [19]:

ds = net.dataset

In [20]:

ds.inputs[17]

Out[20]:

[1.0, 0.0, 0.0, 0.0, 1.0]


To see/access the internal format, use the underscore before inputs or targets. This is a numpy array. conx is designed so that you need not have to use numpy for most network operations.

In [21]:

ds._inputs[0][17]

Out[21]:

array([1., 0., 0., 0., 1.], dtype=float32)


## 3.7.4. Built-in datasets¶

In [22]:

cx.Dataset.datasets()

Out[22]:

['cifar10',
'cifar100',
'cmu_faces_full_size',
'cmu_faces_half_size',
'cmu_faces_quarter_size',
'figure_ground_a',
'gridfonts',
'mnist']

In [23]:

ds = cx.Dataset.get('mnist')
ds

Out[23]:


Dataset name: MNIST

Original source: http://yann.lecun.com/exdb/mnist/

The MNIST dataset contains 70,000 images of handwritten digits (zero to nine) that have been size-normalized and centered in a square grid of pixels. Each image is a 28 × 28 × 1 array of floating-point numbers representing grayscale intensities ranging from 0 (black) to 1 (white). The target data consists of one-hot binary vectors of size 10, corresponding to the digit classification categories zero through nine. Some example MNIST images are shown below:

Dataset Split: * training : 70000 * testing : 0 * total : 70000

Input Summary: * shape : (28, 28, 1) * range : (0.0, 1.0)

Target Summary: * shape : (10,) * range : (0.0, 1.0)

In [24]:

ds = cx.Dataset.get('cifar10')
ds

Out[24]:


Dataset name: CIFAR-10

Original source: https://www.cs.toronto.edu/~kriz/cifar.html

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. “Automobile” includes sedans, SUVs, things of that sort. “Truck” includes only big trucks. Neither includes pickup trucks.

Dataset Split: * training : 60000 * testing : 0 * total : 60000

Input Summary: * shape : (32, 32, 3) * range : (0.0, 1.0)

Target Summary: * shape : (10,) * range : (0.0, 1.0)

In [25]:

ds = cx.Dataset.get("gridfonts")
ds

Out[25]:


Dataset name: Gridfonts

This dataset originates from Douglas Hofstadter’s research group:

http://goosie.cogsci.indiana.edu/pub/gridfonts.data

These data have been processed to make them neural network friendly:

https://github.com/Calysto/conx/blob/master/data/gridfonts.py

The dataset is composed of letters on a 25 row x 9 column grid. The inputs and targets are identical, and the labels contain a string identifying the letter.

You can read a thesis using part of this dataset here: https://repository.brynmawr.edu/compsci_pubs/78/

Dataset Split: * training : 7462 * testing : 0 * total : 7462

Input Summary: * shape : (25, 9) * range : (0.0, 1.0)

Target Summary: * shape : (25, 9) * range : (0.0, 1.0)

In [26]:

ds = cx.Dataset.get('cifar100')
ds

Out[26]:


Dataset name: CIFAR-100

Original source: https://www.cs.toronto.edu/~kriz/cifar.html

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). Here is the list of classes in the CIFAR-100:

Superclass Classes
aquatic mammals beaver, dolphin, otter, seal, whale
fish aquarium fish, flatfish, ray, shark, trout
flowers orchids, poppies, roses, sunflowers, tulips
food containers bottles, bowls, cans, cups, plates
fruit and vegetables apples, mushrooms, oranges, pears, sweet peppers
household electrical devices clock, computer keyboard, lamp, telephone, television
household furniture bed, chair, couch, table, wardrobe
insects bee, beetle, butterfly, caterpillar, cockroach
large carnivores bear, leopard, lion, tiger, wolf
large natural outdoor scenes cloud, forest, mountain, plain, sea
large omnivores and herbivores camel, cattle, chimpanzee, elephant, kangaroo
medium-sized mammals fox, porcupine, possum, raccoon, skunk
non-insect invertebrates crab, lobster, snail, spider, worm
people baby, boy, girl, man, woman
reptiles crocodile, dinosaur, lizard, snake, turtle
small mammals hamster, mouse, rabbit, shrew, squirrel
trees maple, oak, palm, pine, willow
vehicles 1 bicycle, bus, motorcycle, pickup truck, train
vehicles 2 lawn-mower, rocket, streetcar, tank, tractor

Dataset Split: * training : 60000 * testing : 0 * total : 60000

Input Summary: * shape : (32, 32, 3) * range : (0.0, 1.0)

Target Summary: * shape : (100,) * range : (0.0, 1.0)

## 3.7.5. Dataset Methods¶

Class methods:

• Dataset.datasets() - get a list of all known datasets
• Dataset.get(name) - get a named dataset and return a Conx Dataset
• summary() - display a summary of the dataset

Instance methods:

General operations:

• datasets() - get a list of all known datasets
• clear() - clear the current dataset of all data
• copy(dataset) - get a copy of the dataset

Constructing datasets:

• get(name) - get a named dataset; overwrites previous dataset if any
• add_random(count, frange=(-1, 1)) - adds count random patterns to dataset; requires a network
• add_by_function(width, frange, ifunction, tfunction) - adds to inputs with ifunction, and to targets with tfunction
• slice(start=None, stop=None) - select the data between start and stop; clears split
• shuffle() - shuffle the dataset; shuffles entire set; clears split
• split(split=None) - split the dataset into train/test sets. split=0.1 saves 10% for testing. split amount can be fraction or integer
• chop(amount) - chop this amount from end; amount can be fraction, or integer
• set_targets_from_inputs(f=None, input_bank=0, target_bank=0) -
• set_inputs_from_targets(f=None, input_bank=0, target_bank=0) -
• set_targets_from_labels(num_classes=None, bank_index=0) -
• rescale_inputs(bank_index, old_range, new_range, new_dtype) -

### 3.7.5.1. Dataset Examples¶

Dataset.split() will divide the dataset between training and testing sets. You can provide split an integer (to divide at a specific point), or a floating-point value, to divide by a percentage.

In [27]:

ds.split(20)

In [28]:

ds.split(.5)

In [29]:

ds.slice(10)

WARNING: dataset split reset to 0

In [30]:

ds.shuffle()

In [31]:

ds.chop(5)

In [32]:

ds.summary()


Dataset name: CIFAR-100

Original source: https://www.cs.toronto.edu/~kriz/cifar.html

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs). Here is the list of classes in the CIFAR-100:

Superclass Classes
aquatic mammals beaver, dolphin, otter, seal, whale
fish aquarium fish, flatfish, ray, shark, trout
flowers orchids, poppies, roses, sunflowers, tulips
food containers bottles, bowls, cans, cups, plates
fruit and vegetables apples, mushrooms, oranges, pears, sweet peppers
household electrical devices clock, computer keyboard, lamp, telephone, television
household furniture bed, chair, couch, table, wardrobe
insects bee, beetle, butterfly, caterpillar, cockroach
large carnivores bear, leopard, lion, tiger, wolf
large natural outdoor scenes cloud, forest, mountain, plain, sea
large omnivores and herbivores camel, cattle, chimpanzee, elephant, kangaroo
medium-sized mammals fox, porcupine, possum, raccoon, skunk
non-insect invertebrates crab, lobster, snail, spider, worm
people baby, boy, girl, man, woman
reptiles crocodile, dinosaur, lizard, snake, turtle
small mammals hamster, mouse, rabbit, shrew, squirrel
trees maple, oak, palm, pine, willow
vehicles 1 bicycle, bus, motorcycle, pickup truck, train
vehicles 2 lawn-mower, rocket, streetcar, tank, tractor

Dataset Split: * training : 5 * testing : 0 * total : 5

Input Summary: * shape : (32, 32, 3) * range : (0.0, 1.0)

Target Summary: * shape : (100,) * range : (0.0, 1.0)

In [33]:

ds.set_targets_from_inputs()

In [34]:

ds.set_inputs_from_targets()

In [35]:

ds.inputs.shape

Out[35]:

[(32, 32, 3)]

In [36]:

ds.inputs.reshape(0, (32 * 32 * 3,))

In [37]:

ds.inputs.shape

Out[37]:

[(3072,)]


## 3.7.6. Data Vector Operations¶

Each dataset has the following virtual fields:

• inputs - a complete list of all input vectors
• targets - a complete list of all target vectors
• labels - a complete list of all labels (if any)
• train_inputs - a list of all input vectors for training
• train_targets - a list of all target vectors for training
• train_labels - a list of all labels (if any) for training
• test_inputs - a list of all input vectors for testing
• test_targets - a list of all target vectors for testing
• test_labels - a list of all labels (if any) for testing

You may perform standard list-based operations on these virtual arrays, including:

• len(FIELD) - length
• FIELD[num] - indexing
• FIELD[START:END] - slice
• FIELD[num, num, num, …] - selection by index

In addition, each field has the following methods:

• get_shape(bank_index=None) - get the shape of a bank
• filter_indices(function) - get a list of indices whose FIELD matches filter(FIELD[index])
• filter(function) - get a list of FIELD[i] where FIELD[i] matches filter(FIELD[i])
• reshape(bank_index, new_shape=None) - change the shape of a bank

## 3.7.7. Dataset direct manipulation¶

You can also set the internal format directly, given that it is in the correct format:

• use list of columns for multi-bank inputs or targets
• use np.array(vectors) for single-bank inputs or targets
In [38]:

import numpy as np

inputs = []
targets = []

for i in range(2 ** 5):
v = num2bin(i)
inputs.append(v)
targets.append([int(sum(v) % 2 == 1.0)])

net = cx.Network("Even-Odd", 5, 2, 2, 1)

In [40]:

net.test(tolerance=.2, show=True)

========================================================
Testing validation dataset with tolerance 0.2...
# | inputs | targets | outputs | result
---------------------------------------
0 | [[0,0,0,0,0]] | [[0]] | [0.00] | correct
1 | [[0,0,0,0,1]] | [[1]] | [-0.31] | X
2 | [[0,0,0,1,0]] | [[1]] | [-0.53] | X
3 | [[0,0,0,1,1]] | [[0]] | [-0.84] | X
4 | [[0,0,1,0,0]] | [[1]] | [1.30] | X
5 | [[0,0,1,0,1]] | [[0]] | [0.99] | X
6 | [[0,0,1,1,0]] | [[0]] | [0.78] | X
7 | [[0,0,1,1,1]] | [[1]] | [0.47] | X
8 | [[0,1,0,0,0]] | [[1]] | [-1.07] | X
9 | [[0,1,0,0,1]] | [[0]] | [-1.38] | X
10 | [[0,1,0,1,0]] | [[0]] | [-1.60] | X
11 | [[0,1,0,1,1]] | [[1]] | [-1.91] | X
12 | [[0,1,1,0,0]] | [[0]] | [0.23] | X
13 | [[0,1,1,0,1]] | [[1]] | [-0.08] | X
14 | [[0,1,1,1,0]] | [[1]] | [-0.29] | X
15 | [[0,1,1,1,1]] | [[0]] | [-0.61] | X
16 | [[1,0,0,0,0]] | [[1]] | [0.45] | X
17 | [[1,0,0,0,1]] | [[0]] | [0.14] | correct
18 | [[1,0,0,1,0]] | [[0]] | [-0.07] | correct
19 | [[1,0,0,1,1]] | [[1]] | [-0.38] | X
20 | [[1,0,1,0,0]] | [[0]] | [1.76] | X
21 | [[1,0,1,0,1]] | [[1]] | [1.45] | X
22 | [[1,0,1,1,0]] | [[1]] | [1.23] | X
23 | [[1,0,1,1,1]] | [[0]] | [0.92] | X
24 | [[1,1,0,0,0]] | [[0]] | [-0.62] | X
25 | [[1,1,0,0,1]] | [[1]] | [-0.93] | X
26 | [[1,1,0,1,0]] | [[1]] | [-1.14] | X
27 | [[1,1,0,1,1]] | [[0]] | [-1.46] | X
28 | [[1,1,1,0,0]] | [[1]] | [0.69] | X
29 | [[1,1,1,0,1]] | [[0]] | [0.38] | X
30 | [[1,1,1,1,0]] | [[0]] | [0.16] | correct
31 | [[1,1,1,1,1]] | [[1]] | [-0.15] | X
Total count: 32
correct: 4
incorrect: 28
Total percentage correct: 0.125