Classificazione (binary classification)

Introduzione

In questa lezione andremo a vedere la classificazione in base a delle tipolgie di dati, differisce quindi dalla regressione che si basa sulla predizione di un valore numero.

La classificazione può essere "binaria" es. cats vs dogs, oppure multiclass classification se abbiamo più di due tipologie da classificare.

Di seguito alcuni esempi di classificazione:

Cosa andreamo a trattare nel coso:

Topic	Contents
0. Architecture of a classification neural network	Neural networks can come in almost any shape or size, but they typically follow a similar floor plan.
1. Getting binary classification data ready	Data can be almost anything but to get started we're going to create a simple binary classification dataset.
2. Building a PyTorch classification model	Here we'll create a model to learn patterns in the data, we'll also choose a loss function, optimizer and build a training loop specific to classification.
3. Fitting the model to data (training)	We've got data and a model, now let's let the model (try to) find patterns in the (training) data.
4. Making predictions and evaluating a model (inference)	Our model's found patterns in the data, let's compare its findings to the actual (testing) data.
5. Improving a model (from a model perspective)	We've trained an evaluated a model but it's not working, let's try a few things to improve it.
6. Non-linearity	So far our model has only had the ability to model straight lines, what about non-linear (non-straight) lines?
7. Replicating non-linear functions	We used non-linear functions to help model non-linear data, but what do these look like?
8. Putting it all together with multi-class classification	Let's put everything we've done so far for binary classification together with a multi-class classification problem.

Partiamo con un esempio di classificazione basato su due serie di cerchi che si annidano tra di loro. Utilizziamo sklearn per ottenere questo set di dati:

from sklearn.datasets import make_circles

Make 1000 samples

n_samples = 1000

X, y = make_circles(n_samples,
					noise=0.03, # a little bit of noise to the dots                    
                    random_state=42) # keep random state so we get the same values

Create circles

proviamo a vedere cosa contengono le X e le y.

print(f"First 5 X features:\n{X[:5]}")
print(f"\nFirst 5 y labels:\n{y[:5]}")

First 5 X features: [[ 0.75424625 0.23148074] [-0.75615888 0.15325888] [-0.81539193 0.17328203] [-0.39373073 0.69288277] [ 0.44220765 -0.89672343]]

First 5 y labels:[1 1 1 1 0]

quindi le X contengono delle coordinate metre le y si suddividono in valori zero e uno. Quindi siamo di fronte ad una classificazione binaria, ma vediamola graficamente:


import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],
            y=X[:, 1],
            c=y,
            cmap=plt.cm.RdYlBu);

Quindi riassimento le X contengo le coordinate del cerchio, mentre le y il colore. Dalla figura si vede che i cerchi sono suffidivisi in due macrogruppi posizionati uno all'interno dell'altro.

Vediamo le shape:

# Check the shapes of our features and labels
X.shape, y.shape

((1000, 2), (1000,))

X ha una shape di due, mentre le y non ha uno shape in quanto è uno scalare di un valore.

Ora converiamo da numpy a tensori

# Turn data into tensors
# Otherwise this causes issues with computations later on
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

View the first five samples

print (X[:5], y[:5])

(tensor([[ 0.7542, 0.2315],[-0.7562, 0.1533],[-0.8154, 0.1733],[-0.3937, 0.6929],[ 0.4422, -0.8967]]),tensor([1., 1., 1., 1., 0.]))

lo converiamo in float32 (float) perchè numpy è in float64

splittiamo i dati in training e test

# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # make the random split

La funziona "train_test_split" splitta le featurues e le label per noi. :)

Bene, ora costruiamo il modello:

# Standard PyTorch imports
import torch
from torch import nn

# Make device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu" device

Construct a model class that subclasses nn.Module

class CircleModelV0(nn.Module): 
  def init(self): 
    super().init() 
    	# 2. Create 2 nn.Linear layers capable of handling X and y input and output shapes 
    	self.layer_1 = nn.Linear(in_features=2, out_features=5) 
        
        # takes in 2 features (X), produces 5 features 
        self.layer_2 = nn.Linear(in_features=5, out_features=1) # takes in 5 features, produces 1 feature (y)

	# 3. Define a forward method containing the forward pass computation
	def forward(self, x):
    	# Return the output of layer_2, a single feature, the same shape as y
    	return self.layer_2(self.layer_1(x)) # computation goes through layer_1 first then the output of layer_1 goes through layer_2

Create an instance of the model and send it to target device

model_0 = CircleModelV0().to(device)model_0

NB: una regola per settare il numero di feautres in input è fallo coincidere con le features del dataset. Idem per le features di output.

esiste inoltre un altro modo per rappresentare il modello in stile "Tensorflow", es:

# costruisco il modello
model_0 = nn.Sequential(
    nn.Linear(in_features=2, out_features=6),
    nn.Linear(in_features=6, out_features=2),
    nn.Linear(in_features=2, out_features=1)
).to(device)

model_0

Questo tipo di definizione del modello è "limitato" dal fatto che è sequenziale e quindi meno flessibile rispetto a reti più articolate.

Il modello può essere rappresentato graficamente come sotto riportato:

playground.tensorflow.org

ora, prima di fare il training del modello proviamo a passare i dati di test per vedere che output viene generato. (ovviamente essendo un modello non "allenato" saranno dati casuali)

# Make predictions with the model
with torch.inference_mode():
	untrained_preds = model_0(X_test.to(device))
	print(f"Length of predictions: {len(untrained_preds)}, Shape: {untrained_preds.shape}")
	print(f"Length of test samples: {len(y_test)}, Shape: {y_test.shape}")
	print(f"\nFirst 10 predictions:\n{untrained_preds[:10]}")
	print(f"\nFirst 10 test labels:\n{y_test[:10]}")

Length of predictions: 200, Shape: torch.Size([200, 1]) Length of test samples: 200, Shape: torch.Size([200])

First 10 predictions: tensor([[-0.7534], [-0.6841], [-0.7949], [-0.7423], [-0.5721], [-0.5315], [-0.5128], [-0.4765], [-0.8042], [-0.6770]], device='cuda:0', grad_fn=<SliceBackward0>)

First 10 test labels:tensor([1., 0., 1., 0., 1., 1., 0., 0., 1., 0.])

Possiamo notare che che l'output non è zero oppure uno come invce sono le labels... come mai? lo vedremo più avanti...

Prima di fare il training settiamo la "loss function" e "l'optimizer".

Setup loss function and optimizer

La domanda che ci si pone di sempre quale loss function e optimizer utilzzare?

Per la classfificazione in genere si utilizza la binary cross entropy, vedi tabella esempio sotto ripotata:

Loss function/Optimizer	Problem type	PyTorch Code
Stochastic Gradient Descent (SGD) optimizer	Classification, regression, many others.	torch.optim.SGD()(https://pytorch.org/docs/stable/generated/torch.optim.SGD.html)
Adam Optimizer	Classification, regression, many others.	torch.optim.Adam() `https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)
Binary cross entropy loss	Binary classification	torch.nn.BCELossWithLogits( https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) or [`torch.nn.BCELoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)
Cross entropy loss	Mutli-class classification	[`torch.nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
Mean absolute error (MAE) or L1 Loss	Regression	[`torch.nn.L1Loss`](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html)
Mean squared error (MSE) or L2 Loss	Regression	[`torch.nn.MSELoss`](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss)

Riassumento la loss function misura quanto il modello si distanzia dai valori attesi.

Mentre per gli optimizer servono per migliorare il modello che poi attraverso la loss funzion verrà valutato.

In genere si utilizza SGD o Adam..

Ok creiamo la loss e l'optimizer:

# Create a loss function
# loss_fn = nn.BCELoss() # BCELoss = no sigmoid built-in
loss_fn = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss = sigmoid built-in

Create an optimizer

optimizer = torch.optim.SGD(params=model_0.parameters(), lr=0.1)

Accuracy e Loss function

Definiamo anche il concetto di "accuracy".

La loss functuon misura quanto le preduzioni si allontanano dai valori desierati, mentre la Accuracy indica la percentuale con la quale il modello fa delle previsioni corrette. La differenza è sottile, e in questo momento non mi è chiara, credo che l'accuracy dipenda dalla loss e che indichi con una percentuale quello che la loss esprime in valori numerici specifici per il modello. Ad ogni modo vengono utilizzate entramb le misure per verificare la buona qualità del modello.

Implementiamo la accuracy

# Calculate accuracy (a classification metric)
def accuracy_fn(y_true, y_pred):
    correct = torch.eq(y_true, y_pred).sum().item() # torch.eq() calculates where two tensors are equal
    acc = (correct / len(y_pred)) * 100 
    return acc

Logits

I logits rappresentano l'output "grezzo" del modello. I logits devono essere convertiti nella previsione probabilistica passandoli ad una "funzione di attivazione". (es. sigmoid per la "binari cross entropy", softmax per la multiclass classificazion) Per noi essere "discretizzati" (i valori probabilistici) mediante l'uso di funzioni come "round".

Vediamo quindi come rivedere la fase di training in funzione dei logits. NB per capire la rappresentazione dei logits vedi il commento nel training loop.


for epoch in range(epochs):
    ### Training

    # 0. imposto la modalità in Training (da fare ad ogni epoca)
    model_0.train()

    # 1. calcolo l'output con i parametri del modello, NB devo gare la "squeeze" percheè va ritdotta di una dimensione
    # quanto l'output del modello ne aggiunge una.
    # I logits sono i valori "grezzi" che, nella caso delle classificazioni BINARIE, NON possono essere comparati
    # con i valori discreti 0/1 delle t_test.
    # I logits quindive dobranno essere convertiti attraverso le funzioni come per la esempio la sigmoing, che
    # non fa altro che ricondurli a valori compresi tra zero e uno che, poi andranno "discretizzati" a 0/1 atttraverso
    # l'uso di funzioni di arrotondamento come per es. la round.
    y_logits = model_0(X_train).squeeze() #

    # pred. logits -> pred. probabilities -> labels 0/1
    y_pred = torch.round(torch.sigmoid(y_logits))

    # 2. calculate loss/accuracy
    # calcolo la loss, da nota che viene utilizzata come loss function la "BCEWithLogitsLoss" che vuole in input
    # dirattamente i logits anzichè i valori predetti, in quanto gli applica la sigmoid e la round in automatico
    # per poi paragonli con le y_train "discrete".
    loss = loss_fn(y_logits, y_train) # nn.BCEWithLogitsLoss()

    # calcololiamo anche la percentuale di accuratezza.
    acc = accuracy_fn(y_true=y_train, y_pred=y_pred)

    # 3. reinizializzo l'optimizer in quanto tende ad accumulare i valori
    optimizer.zero_grad()

    # 4. effettua la back propagation, nella pratica Pytorch tiene traccia dei valori associati alla discesa del gradiente
    #    Quindi calcola la derivata parziale per determinare il minimo della curva dei delta tra valori predetti e valori di test
    loss.backward()

    # 5. ottimizza i parametri (una sola volta) e in base al valore "lr".
    #  NB: cambia quindi i valori dei tensori per cercare di farli avvicinare ai valori ottimali
    optimizer.step()

    ### Testing (in questa fase vengono passati i valori non trainati di test)

    # indico a Pytrch che la fase di training è terminata e che ora devo valutare i parametri e paragonarli con i valori attesi
    model_0.eval()
    with torch.inference_mode(): # disabilito la fase di training

        test_logits = model_0(X_test).squeeze()  #

        # pred. logits -> pred. probabilities -> labels 0/1
        test_pred = torch.round(torch.sigmoid(test_logits))

        # per poi paragonli con le y_train "discrete".
        test_loss = loss_fn(test_logits, y_test)  # nn.BCEWithLogitsLoss()

        # calcololiamo anche la percentuale di accuratezza.
        test_acc = accuracy_fn(y_true=y_test, y_pred=test_pred)

        # Print out what's happening
        if epoch % 10 == 0:
            print(f"Epoch: {epoch} | Train -> Loss: {loss:.5f} , Acc: {acc:.2f}% | Test -> Loss: {test_loss:.5f}%. Acc: {test_acc:.2f}% ")

L'output della funzione sarà:

Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 8.7.0
Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)] on win32
runfile('C:\\lavori\\formazione_py\\src\\formazione\\DanielBourkePytorch\\02_classification.py', wdir='C:\\lavori\\formazione_py\\src\\formazione\\DanielBourkePytorch')
Epoch: 0 | Train -> Loss: 0.70155 , Acc: 50.00% | Test -> Loss: 0.70146%. Acc: 50.00% 
Epoch: 10 | Train -> Loss: 0.69617 , Acc: 57.50% | Test -> Loss: 0.69654%. Acc: 55.50% 
Epoch: 20 | Train -> Loss: 0.69453 , Acc: 51.75% | Test -> Loss: 0.69501%. Acc: 54.50% 
Epoch: 30 | Train -> Loss: 0.69395 , Acc: 50.38% | Test -> Loss: 0.69448%. Acc: 53.50% 
Epoch: 40 | Train -> Loss: 0.69370 , Acc: 49.50% | Test -> Loss: 0.69427%. Acc: 53.50% 
Epoch: 50 | Train -> Loss: 0.69358 , Acc: 49.50% | Test -> Loss: 0.69417%. Acc: 53.00% 
Epoch: 60 | Train -> Loss: 0.69349 , Acc: 49.88% | Test -> Loss: 0.69412%. Acc: 52.00% 
Epoch: 70 | Train -> Loss: 0.69343 , Acc: 49.62% | Test -> Loss: 0.69409%. Acc: 51.50% 
Epoch: 80 | Train -> Loss: 0.69337 , Acc: 49.25% | Test -> Loss: 0.69408%. Acc: 51.50% 
Epoch: 90 | Train -> Loss: 0.69333 , Acc: 49.62% | Test -> Loss: 0.69407%. Acc: 51.50% 
Backend MacOSX is interactive backend. Turning interactive mode on.

che è pessimo in quanto il modello utilizza un "linear model" che sostanzialmente rappresenta una linea che negli assi cartesiani ha un'intercetta e una direzione e quindi non riscurà mai a rappresentare i dati.

Bisoga quindi cambiare modello.

In particolare bisogna introdurre una funziona non lineare come per es. la ReLU che nella prarica ritorna zero se i valori sono <=0 oppure il valore stesso se >0.

Di seguito il grafico della funzione non lineare ReLU.

Modifichiamo quindi il modello aggiungendo dopo l'hidden layer la funzione di attivazione non lineare come nell'esempio di seguito:

# costruisco il modello
model_0 = nn.Sequential( 
                        
                      nn.Linear(in_features=2, out_features=10),                        
                      nn.ReLU(),
                      nn.Linear(in_features=10, out_features=10),
                      nn.ReLU(),
                      nn.Linear(in_features=10, out_features=1))

che produce risultati decisamente migliori:

(vedi sorgente completo in attachement a questa pagina 02_classification.py)

Topic	Contents
0. Architecture of a classification neural network	Neural networks can come in almost any shape or size, but they typically follow a similar floor plan.
1. Getting binary classification data ready	Data can be almost anything but to get started we're going to create a simple binary classification dataset.
2. Building a PyTorch classification model	Here we'll create a model to learn patterns in the data, we'll also choose a loss function, optimizer and build a training loop specific to classification.
3. Fitting the model to data (training)	We've got data and a model, now let's let the model (try to) find patterns in the (training) data.
4. Making predictions and evaluating a model (inference)	Our model's found patterns in the data, let's compare its findings to the actual (testing) data.
5. Improving a model (from a model perspective)	We've trained an evaluated a model but it's not working, let's try a few things to improve it.
6. Non-linearity	So far our model has only had the ability to model straight lines, what about non-linear (non-straight) lines?
7. Replicating non-linear functions	We used non-linear functions to help model non-linear data, but what do these look like?
8. Putting it all together with multi-class classification	Let's put everything we've done so far for binary classification together with a multi-class classification problem.