training monitoring with Visdom and Hyperdash
For the past few months, I have enjoyed using Visdom and Hyperdash to monitor training processes.
Visdom
Visom is a Python library developed by Facebook Research. Similar to Tensorboard, it provides a UI server (using React) through which user can create scatter plots, histograms, visualize images and text.
I have written a small snippet for creating and updating a scatter plot, which can be used for visualizing training and validation loss and accuracy.
# pip install visdom numpy
# visualize.py
import numpy
from visdom import Visdom
class Plot(object):
def __init__(self, title, port=8080):
self.viz = Visdom(port=port)
self.windows = {}
self.title = title
def register_scatterplot(self, name, xlabel, ylabel):
win = self.viz.scatter(
X=numpy.zeros((1, 2)),
opts=dict(title=self.title, markersize=5, xlabel=xlabel, ylabel=ylabel)
)
self.windows[name] = win
def update_scatterplot(self, name, x, y):
self.viz.updateTrace(
X=numpy.array([x]),
Y=numpy.array([y]),
win=self.windows[name]
)
In my training script, I can initialize and update the plots.
# train.py
import visualize
plot = visualize.Plot("Model A")
plot.register_scatterplot("Loss", "Epoch", "Loss")
for n in range(epoch):
# ... compute average loss over training data
plot.update_scatterplot("Loss", n + 1, loss)
Before running the training script, we need to start the web server, so that after training starts,
we can go to localhost:8080
in the browser to see the plots.
python -m visdom.server -p 8080
The above snippets can also be found here. PyTorchNet library provides more loggers for plots, images, and text, it can be very handy if you’re a PyTorch user.
Hyperdash
Visdom is great for visualizing plots and images, however connection to a remote server isn’t always available.
Hyperdash can stream training logs from a process on a remote server directly to my phone.
I can simply open the mobile app to know if the training process died, or loss is no longer decreasing.
The set up is quite simple, all necessary steps are documented in the homepage.
One caveat is that it does not work well with tqdm
(see issue 63), simply disable tqdm
in the training script.