Unlike in Online RL where agents need to interact with real environment, Offline RL works similar to a typical machine learning workflow. Given a dataset, Offline RL processes data extracting state, action, reward, and terminal columns to optimize the policy Q. By wrapping up offline RL into the AutoMLPipeline workflow, it becomes trivial to search for the optimal preprocessing elements and their combinations to improve Offline RL optimal policy using symbolic workflow manipulation.
As part of AutoMLPipeline workflow, it becomes trivial to search which preprocessing elements and their combinations provide the best policy Q by cross-validation where the dataset is split into training and testing several times to get the average accumulated discounted rewards (return) of a given policy Q. This talk will demonstrate how to setup the Offline RL pipeline to preprocess the dataset and learn the optimal policy Q and incorporate some parallel search strategy to get the optimal workflow.