ABSTRACT

This project is to build a model to predict whether a user will download an app after clicking a mobile app advertisement, based on a vast collection of users’ clicking record on app advertisement as our original training data including attributes such as ip, os, device, channel etc. Recent years have witnessed an inevitable phenomenon where more and more users simply click on the ad link of apps without downloading them, which results in considerable wasted ad investment from app service providers. Therefore, it is worthwhile for us to build a model so that such problem could be solved or its latency could be minimized to utmost extent.

Among the given attributes, ‘is_attributed’ and ‘attributed_time’ are chosen as output labels in training process and as predicted result in testing process. Some approaches investigated in this project include Decision Tree (DT), Random Forest (RF), Ada Boosting (AB), K-Nearest Neighbor (KNN), Logit Boosting (LB), and Multilayer Perceptron (MLP). We initially tried and tested some of these algorithms on Weka, where we observed the results of DT, RF, DT, and then we implemented MLP regressor & classifier using sklearn packet in Python. The best prediction accuracy is approximately 95% when training on MLP regressor & classifier.

The picture above clearly shows all machine learning algorithms we tried, tested, and improved in this project, along with each prediction accuracy.

Motivation

According to an investigation from TalkingData (https://www.talkingdata.com/), China’s largest independent big data service platform, the existence of potentially fraudulent clicks for app advertisements has led to an inevitable problem. A number of users simply click on the ad link of apps without downloading them, which results in wasted ad investment from app service providers. By building this model to identify such users and their corresponding devices to put them into blacklists, such problem could be solved or its latency could be minimized to utmost extent. This would definitely help those companies advertising online to save investment and make more profit. Therefore, such task is valuable to explore and address.

Task

The task that we addressed in this project is Ad-Tracking and Fraud Detection. The ultimate goal we want to achieve is to build a model to predict whether a user will download an app after clicking a mobile app advertisement, based on a vast collection of users’ clicking record on app advertisement as our original training data. The data is obtained from a featured competition on Kaggle, which is provided by TalkingData. The given dataset covers approximately 200 million clicks over 4 days, including several informative attributes such as app-id, users’ device, os etc. Among such attributes, ‘is_attributed’ and ‘attributed_time’ are chosen as output label in training process and as predicted result in testing process.