When we attempt to construct an RNN model in the Tensorflow. We will always come across a task that is reshaping the data to meet the input format of the RNN model. This step is essential but a little confusing. In order to simplify the work of reshaping data for RNN, I develop an RNN sample generator function to perform the reshape process.
Specifically, An RNN model requires the input data with a 3-dimensional array (samples, time steps, features). Each sample consists of rows ordered by time. Notice that in RNN, the concept of sample is a little bit different from a normal sample which is usually just a row in the dataset. A sample in RNN is a batch of sequentially ordered rows. Furthermore, a row in a sample is a time step associated with this sample. Therefore, the second dimension is the time step or the row in a sample. The third dimension is much easy to understand since it is just the feature dimension of each row in a sample. Let’s take an example!
The table on the following gif is a screenshot of the credit card transaction dataset. It consists of many credit card transaction records with multiple features like customer ID, merchant name, amount, etc. Also, all transactions are ordered by time. The process of 3D-input data generation for RNN is clearly displayed below. Notice that the red window size determines the number of time steps. A sample is cut as a batch of rows by the red window and another sample is cut when the red window moves one step further. Usually, we call this moving red window as a rolling window.
In terms of the input data for machine learning models. We will probably come across following scenarios: 1. The dataset contains both independent variables X and target variable y. In this case, my function will provide a reshaped X without the target variable y and a reshaped y. With such reshaped X and reshaped y, you can directly train the RNN without any other processing. 2. The dataset only contains X, but y can be provided outside the dataset. In such a case, my function will provide a reshaped X and a reshaped y. With such reshaped X and reshaped y, you can directly train the RNN without any othor processing. 3. The dataset only contains X and there is not target variable y provided. In such a case, my function will provide only a reshaped X as the output instead of two outputs.
My function is as follows, the first argument is df, it only allows the pandas data frame as the input dataset. The second argument is window size which indicates the size of the rolling window. The third argument is has_y which indicates if this input dataset contains target variable y, if it contains y, the fourth argument will be the name of it, otherwise, the fourth argument will remain the default value. Furthermore, if y can be provided externally, we can use the fifth argument to pass y, otherwise, it will remain the default value. The last argument indicates the output data type, the default value for this argument is ‘array’ which is np.array. Any other value for this argument will provide you a pandas Dataframe as the output.
def rolling_window_sample_generator(df,size=10,has_y=True,y_colname=None,y=None,output_type='array'): from tqdm import tqdm_notebook as tqdm # input dataset must be dataframe! if not isinstance(df, pd.DataFrame): raise TypeError('The input data must be pd.Dataframe!') # input window size must be bigger than the length of dataset! if len(df)<size: raise ValueError('The length of df is smaller than window size!') sample_list= y_list= out_put_dim=2 for i in tqdm(range(len(df))): win_head=i win_tail=i+size if win_tail > len(df): break sample=df.iloc[win_head:win_tail,:] if has_y: sample_y=sample[y_colname] sample=sample.drop([y_colname],axis=1) sample_list.append(sample) y_list.append(sample_y) else: if y is None: sample_list.append(sample) out_put_dim=1 else: if len(y)!=len(df): raise ValueError('The length of df not equals the length of y!') else: sample_y=y[win_head:win_tail] sample_list.append(sample) y_list.append(sample_y) if output_type=='array': sample_list=np.array([np.array(df) for df in sample_list]) if out_put_dim==2: y_list=np.array(y_list).reshape((-1,size,1)) return sample_list, y_list else: return sample_list else: if out_put_dim==2: return sample_list, y_list else: return sample_list
Following codes are the usage of this function, make sure that the pickle file and your python notebook are in the same folder. For further usage of this function, please check the notebook of this blog by this link .
# load the transaction data with open('data.pickle','rb') as load: data=pickle.load(load) # select transactions of a customer, notice transacions already ordered by time df=data[data['customerId']==318001076] # use this function new_X=rolling_window_sample_generator(df,size=5,has_y=False,output_type='array') new_X.shape output:(10025, 5, 25)
In order to better understand the RNN model, I posted a chart below indicating the structure of RNN. In this case, the number of time steps is five. Therefore, all five rows(transactions) in each sample will be fed into the RNN orderly. The hidden units are the solid circles in the box, and the input features are fully connected with the hidden unit layer. The output of the preceding time step is also an input for the next time step. That’s why RNN is capable to detect the sequential patterns in time series data since it stores the previous information.
The following chart provides us a clear data flow of a single sample of RNN. Notice that we only have one (not five) RNN unit on the chart below, many beginners will think that there is a new RNN unit in each time step since there are five RNN units on the chart below. However, there is actually only one RNN below. Because the input data will be updated in each time step, in order to distinguish different inputs for each time step we have to display an RNN unit in each time step.
Please feel free to reach me out by the following contact information if you have any further questions. Do you want to enhance your Data Science Tools? Please subscribe to my blog right now!