For the emerging mobility-on-demand services, it is of great significance to predict passenger demands based on historical mobility trips towards better vehicle distribution. Prior works have focused on predicting next-step passenger demands at selected locations or hotspots. However, we argue that multi-step citywide passenger demands encapsulate both time-varying demand trends and global statuses, and hence are more beneficial to avoiding demand-service mismatching and developing effective vehicle distribution/scheduling strategies. Furthermore, we find that adaptations of single-step methods are unable to achieve robust prediction with high accuracy for further steps. In this project, we propose an end-to-end deep neural network model to the prediction task. We employ an encoder-decoder framework based on convolutional and ConvLSTM units to identify complex features that capture spatiotemporal influence and pickup-dropoff interactions on citywide passenger demands. We introduce a multi-level attention model (global attention and temporal attention) to emphasize the effects of latent citywide mobility regularities and capture relevant temporal dependencies. We evaluate our proposed method using real-world mobility trips (taxis and bikes) and the experimental results show that our method achieves higher prediction accuracy than the state-of-the-art approaches.