Find if a room is occupied or not with Machine Learning: Multivariate Time Series Classification with Gluon

9 min readMar 1, 2018

AI/ML has been a trending topic in the recent past and I believe it isn’t going to stop any time soon. I decided to dip my toes in to this space and the best way to do was building my first ML model. Previously, I have had some coding background in python and explored basics of ML.

My objective is to shed some light on getting started with ML for those who are in the same boat. I’ll follow up in my next blog with more in-depth perspective on basics of ML and the AI landscape.

This blog is going to address a Multivariate Timeseries problem to find if a room is occupied or not given readings like temperature, humidity etc.

Occupancy Detection Dataset

Source — https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+

This dataset contains measurements of a room, we’ll use this to predict if the the room is occupied or not.

There are 20,560 one-minute observations taken over the period of a few weeks. This is a classification prediction problem. There are 7 attributes including various light and climate properties of the room.

The source for the data is credited to Luis Candanedo from UMONS.

Below is a sample of the first 5 rows of data including the header row.

"date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
"1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1
"2","2015-02-04 17:51:59",23.15,27.2675,429.5,714,0.00478344094931065,1
"3","2015-02-04 17:53:00",23.15,27.245,426,713.5,0.00477946352442199,1
"4","2015-02-04 17:54:00",23.15,27.2,426,708.25,0.00477150882608175,1
"5","2015-02-04 17:55:00",23.1,27.2,426,704.5,0.00475699293331518,1
"6","2015-02-04 17:55:59",23.1,27.2,419,701,0.00475699293331518,1

There are three files provided, train, validation and test. We’ll use the train and validation set in our model training

Environment:

· Jupyter: web application to build Python notebooks.
· Matplotlib: plotting library
· Numpy: scientific computing, especially useful for array and matrix manipulation.
· Pandas: data analysis library.
· Apache MXNet + Gluon: Deep learning library

Full code is available at :

https://github.com/ThePrecious/ml_projects/blob/master/1_House_Occupancy_Classification/1_House_Occupancy_Gluon.ipynb

Lets start.

Step 1 — Quick analysis of the dataset

Lets take a look at the dataset and identify the features that you think are important to classify the target. Each row in the dataset is called an example and each column is called feature. In our dataset there are 7 columns. We are going to use 6 columns (features — X) to classify the 7th column (target — y). In other words, we are going to use “date”, “Temperature”, “Humidity”, “Light”, “CO2”, “HumidityRatio” to classify “Occupancy”. Notice that the values of occupancy are 1 and 0. (Room is occupied — 1, Room is not occupied — 0). This is a relatively easy classification problem with only 2 classifications.

Step 2 — Prepare the data

For training the model, the data should be numerical. But in the dataset the first feature is datetime string. So we should decide if we need the date for occupancy classification, or if it can be ignored.

Date and time of the day is an important factor to decide if the room is occupied or not. If the dataset belongs to an area with young crowd, chances are the room will not be occupied during day time when people are out for work. If the dataset belongs to an area with families with kids, there are more chances for the rooms to be occupied during weekdays in the evening and night. Hence lets not ignore date, but transform it in such a way that it can be used to train the model. This is also called Feature Engineering. We are going to break the date in to buckets and make it in to a new feature. Weekday/Weekend and Time of the day can be derived from date, which would help to classify the occupancy with better accuracy. Lets divide the 24 hr day into 4 buckets of 6 hours each [day 6am –12pm, afternoon 12pm –6pm, evening 6pm –12am, night 12am –6am].

Now that we have two derived features from date (Weekday/ Weekend and Time bucket), we have to add these features to the dataset in a way that would help train the model. The values of these features must be numerical.

We could have 4 values (1,2,3,4) for the 4 time buckets, but there are chances for the model to learn bucket 4 is better than bucket 1 (just because 4 > 1). To avoid such possible confusions we go for one hot encoding.

A one hot encoding is a representation of categorical variables as binary vectors.

Here is a good resource that explains one hot encoding further — https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

After one hot encoding this is how the values for some sample date time will look like —

We are deriving 6 new features from 1 date field. Since this is categorical information, we one hot encode it, so that the value is 1 for features that satisfies the condition (Weekday/Weekend and Day/Afternoon/Evening/Night).

After one hot encoding we have to make sure we append all this data to the main feature set. Remember the main dataset consists of 3 files. One each for training, validation and test dataset, so don’t forget to transform the data for each of these.

We should also remove rows/examples with invalid data like NaN, null etc. Fortunately there were no examples with invalid data in the dataset.

Lets look at the code for importing and preparing the data.

We will use the Pandas library to load the data.

# Use pandas to load the data
import pandas as pdtraining_file_path = "occupancy_data/datatraining.txt"
test_file_path = "occupancy_data/datatest.txt" 
test2_file_path = "occupancy_data/datatest2.txt"# TRAINING DATA
print("***** TRAIN DATA *****")col_names = ["date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"]
data_features = ["date","Temperature","Humidity","Light","CO2","HumidityRatio"]train_data = pd.read_csv(training_file_path, skiprows=[0], names=col_names)
print train_data.head()#VALIDATION DATA
val_data = pd.read_csv(test_file_path, sep=",", skiprows=[0], header=None, names=col_names)#TEST DATA
test_data = pd.read_csv(test2_file_path, sep=",", skiprows=[0], header=None, names=col_names)

Note that I am loading the data from 3 files into 3 different variables. One each for training, validation and testing. This is how the training dataset looks.

This one line code can be used to drop examples with NaN values

train_data.dropna(axis=0, how=’any’)

Now, to train the model, we have to split the dataset into features and target. Following code does that

# TRAINprint("***** Training DATA *****")
X_train_raw = train_data[data_features]
print(X_train_raw.shape)y_train = train_data.Occupancy
print(y_train.shape)# VAL
print("***** Validation DATA *****")X_val_raw = val_data[data_features]
print(X_val_raw.shape)y_val = val_data.Occupancy
print(y_val.shape)print("***** TEST DATA *****")
# TestX_test_raw = test_data[data_features]
print(X_test_raw.shape)y_test = test_data.Occupancy
print(y_test.shape)

And this is shape of the data —

(8143, 6)
(8143,)
***** Validation DATA *****
(2665, 6)
(2665,)
***** TEST DATA *****
(9752, 6)
(9752,)

Raw training data has 8143 examples and 6 features, and raw target has 8143 examples and one column. Similarly you can see the values for validation and test data. I have used the word ‘raw’ here because this is raw data. We are yet to do feature engineering and some cleansing before we use the data to train the model.

Now, lets look at the code for deriving and appending the new features to the raw training data.

from datetime import datetimedef create_new_features(data):
    time_buckets = []
    weekday = []for index, row in data.iterrows():
        parsed_date = datetime.strptime(row['date'], "%Y-%m-%d %H:%M:%S")
        time_bucket = parsed_date.hour % 4 
        time_buckets.append(time_bucket)
        wday = 0 if parsed_date.isoweekday() > 5 else 1
        weekday.append(wday)
    return time_buckets, weekday
    
train_time_buckets, train_weekday = create_new_features(train_data)
val_time_buckets, val_weekday = create_new_features(val_data)
test_time_buckets, test_weekday = create_new_features(test_data)

For every date value we find out the time bucket and Weekday/Weekend value and store it in the time_buckets and weekday arrays. We repeat this step for all 3 datasets.

Now lets do the one hot encoding. We would need to import the sklearn library for the same.

import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoderdef onehot_encode(values):
    new_values = np.array(values)
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(new_values)
    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    return (onehot_encoded, label_encoder.classes_)## TRAINING DATA
train_time_bucket_onehot_encoded, train_time_class = onehot_encode(train_time_buckets)
train_weekday_onehot_encoded, train_weekday_class = onehot_encode(train_weekday)
print (train_time_bucket_onehot_encoded, train_time_class)
print (train_weekday_onehot_encoded, train_weekday_class)## VALIDATION DATA
val_time_bucket_onehot_encoded, val_time_class = onehot_encode(val_time_buckets)
val_weekday_onehot_encoded, val_weekday_class = onehot_encode(val_weekday)
print (val_time_bucket_onehot_encoded, val_time_class)
print (val_weekday_onehot_encoded, val_weekday_class)## TEST DATA
test_time_bucket_onehot_encoded, test_time_class = onehot_encode(test_time_buckets)
test_weekday_onehot_encoded, test_weekday_class = onehot_encode(test_weekday)
print (test_time_bucket_onehot_encoded, test_time_class)
print (test_weekday_onehot_encoded, test_weekday_class)

Values after one hot encoding —

(array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       ...,
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]]), array([0, 1, 2, 3])) .....

Now we have to append these new features to the raw data —

# Append new features to Train, Validation and Test setsdef add_new_features(arr, w_oh, t_oh):
    # lets ignore the date column for all the rows
    X = arr.values[:, 1:] 
    XW = np.hstack((X, w_oh))
    XTW = np.hstack((XW, t_oh))
    return XTWX_train = add_new_features(X_train_raw, train_weekday_onehot_encoded, train_time_bucket_onehot_encoded)
X_train.shape #(8143, 11)X_val = add_new_features(X_val_raw, val_weekday_onehot_encoded, val_time_bucket_onehot_encoded)
X_val.shape #(2665, 11)X_test = add_new_features(X_test_raw, test_weekday_onehot_encoded, test_time_bucket_onehot_encoded)
X_test.shape #(9752,11)

Training, validation and test data now have 11 features.

Step 3 — Train the Model

We’ll use a RNN to build the model.

RNN is a type of Neural Network that has been known to be suited for sequence data. This type of neural network has a loop, i.e the hidden layers have a “back connection”, this allows the network to take not just the input at time T, but time Tn outputs as well.

For this dataset we are going to use LSTM (Long Short term memory). They have been proven to be quite effective given their architecture which allows to retain more relevant information and forget less relevant information.

We’ll borrow code from this project which makes it easy to build a simple RNN Classifier with just 4 lines of code :)

N_CLASS = 2
ctx = mx.cpu(0) #change context to execute on CPU
model = BaseRNNClassifier(ctx)
model.build_model(n_out=N_CLASS, rnn_size=8, n_layer=1)
model.compile_model()
train_loss, train_acc, test_acc = model.fit([X_train, y_train], [X_test, y_test], batch_size=32, epochs=10)

If you look at the line

model.build_model(n_out=N_CLASS, rnn_size=8, n_layer=1)

here n_out = number of outputs. In occupancy detection dataset, the number of outputs is two.

n_layer = number of hidden layers; we’ll just use 1 given this is a simple problem

rnn_size = number of nodes in hidden layer.

Step 4 — Visualize results

Lets check how good the model is by plotting the values

import matplotlib.pyplot as pltplt.title("Accuracy with every epoch")
plt.plot(train_acc, label="train")
plt.plot(test_acc, label ="test")
plt.legend()
plt.show()

In the above graph we can observe the following —

Both training and test graphs trend in the same direction. which is a good thing :)
The test accuracy is highest at epoch 6 and drops afterwards, at this point we have the best model. The test accuracy can drop after a few epochs even though the train accuracy increase because the model gets too familiar with the data its seen so far, this is called overfitting. By picking a model with highest test accuracy we end up picking a generalized model.

Lets plot similar graphs for different values of rnn_size.

rnn_size = 4

In the above graph we see the accuracy is 79.8% for all the epochs which means the model did not learn anything.

Lets check the distribution of occupancy in the test dataset.

# find the number of positive occupancy
y_test.shape, sum(y_test), 1-sum(y_test)/(y_test.shape[0]*1.0)((9752,), 2049, 0.78988925348646433)

We notice that 79% are not occupied, so a model was spitting out zeros it would have still hit 79% accuracy. It so happens that our model with rnn_size=4, does the same. which means it probably didn’t learn anything. Remember a broken clock is right twice a day ;)

The loss function confirms this for us, where there’s hardly been any significant dip in the loss over epochs.

Compare this to the loss function for rnn_size=8 to observe the dip

Confusion Matrix

Lets generate the confusion Matrix to find the count of values that were predicted correctly and count of values that were predicted wrongly.

# Get predictions
b_size = 24
tX, ty = np.asarray(X_test).astype('float32'), np.asarray(y_test).astype('float32')
test_iter = mx.gluon.data.DataLoader(mx.gluon.data.ArrayDataset(tX, ty), 
                                    batch_size=b_size, shuffle=False, last_batch='discard')
pred_out = model.predict(test_iter, iter_type="dataloader", batch_size=b_size)# Confusion Matrix
from sklearn.metrics import confusion_matrix
print confusion_matrix(y_test[:len(pred_out)], pred_out)     0     1
0 [[7558  145]
1 [ 131  1910]]

The values along \ axis were predicted correctly, (a total of 7558 +1910 = 9468), and we have 145 false positives and 131 false negatives in our classifications.

Conclusion

In this blog we have seen how we can use RNNs (LSTM) to model a multivariate timeseries classification problem which high accuracy. For all of the code please check the github repo.

Find if a room is occupied or not with Machine Learning: Multivariate Time Series Classification with Gluon

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Amulya Badal

Responses (2)