Top-down saliency detection aims to highlight the regions of a specific object category, and typically relies on pixel wise annotated training data. In this project, we address the high cost of collecting such training data by a weakly supervised approach to object saliency detection, where only image-level labels, indicating the presence or absence of a target object in an image, are available. The proposed framework is composed of two collaborative CNN modules, an image-level classifier and a pixel level map generator. While the former distinguishes images with objects of interest from the rest, the latter is learned to generate saliency maps by which the images masked by the maps can be better predicted by the former. In addition to the top-down guidance from class labels, the map generator is derived by also exploring other cues, including the background prior, super pixel and object proposal-based evidence. The background prior is introduced to reduce false positives. Evidence from super pixels helps preserve sharp object boundaries. The clue from object proposals improves the integrity of highlighted objects. These different types of cues greatly regularize the training process and reduces the risk of overfitting, which happens frequently when learning CNN models with few training data.