DATA & PREPROCESSING | EECS349Group21Web

Data Set

We would acquire the original data from an ongoing featured competition on Kaggle, which is provided by TalkingData. The given dataset covers approximately 200 million clicks over 4 days, including 8 attributes in total: ip,app,device,os,channel,click_time,attributed_time (if the app is downloaded),is_attributed (whether the click results in a download, it is the target of prediction). Among the given attributes, ‘is_attributed’ and ‘attributed_time’ are chosen as output labels in training process and as predicted result in testing process. Figure 1 shows a sample from provided dataset.

Preprocessing

Since this dataset is too large to be completely utilized for training, and the positive results take too small proportion in the whole set, we had to make a sampling on the raw data. We first selected the first 10,000,000 examples in the list, keeping all the 18717 positive examples with is_attributed = 1. Then, we iterated through all negative examples, randomly picking 1 from every 450 examples. This could ensure our selected negative examples are more scattered and separated, instead of gathering in a short period like an hour/minute. In this way, we have sampled approximately 38,000 pieces of data to perform as our training/testing set to be used. And this would not conspicuously affect the overall accuracy compared with the original data.

To process the data, we first converted the click_time and attributed_time to the number of seconds starting from 2017-11-06 00:00:00 (slightly earlier than all the time attributes given), and then we partition the values of ip, click_time and attributed_time into several equal-sized intervals such that these “continuous” variables could be denoted more efficiently and representatively. However, these “continuous” variables only make sense for Decision Tree, they would be regarded as “discrete” variables for other approaches. Since the other attributes like os, device, channel have already be encoded into discrete integers, we need not make further operations on them and could use them directly.

In terms of data division, when we first utilized the Decision Tree created on our own, we assigned the number of training data from 500 to 20,000, while leaving other examples for testing to view the trend of overall accuracy. Next, we totally made use of 10-fold cross-validation in Weka when testing the models including 1-Nearest Neighbor, Ada Boosting, Classification via Regression and Random Forest. When it comes to MLP, to distribute the data, with the assistance of “train_test_split” function in sklearn packet, we randomly assigned ¾ of total examples for training, and ¼ residual examples for testing.

These three pictures show a quick inspection of the main statistics of the dataset we selected, from which we can see that the numbers of download or not are nearly the same and among those features “ip” owns most unique values and “os” has least.

After the inspection of our dataset, we find that up to 26 downloads for one ip, which is the same story for other attributes. Thus, we looked through the conversion rates over counts of ips, channel, os, devices, and apps individually. And the results are shown as the following 5 pictures. However, the results are somewhat disappointing. Conversions are noisy and do not appear to correlate with how popular an ip is. The proportion for apps fluctuates more as the counts go down, since each additional click has larger impact on the proportion value. And this is the same story for the channel and os. As for devices, they are extremely disproportionately distributed, with number one device used almost 96% of time.