معماری
Home / Data Science / Machine Learning Kaggle Competition Part One: Getting Started

Machine Learning Kaggle Competition Part One: Getting Started

Working Through a First Notebook

Once you have a basic understanding of how Kaggle works and the philosophy of how to get the most out of a competition, it’s time to get started. Here, I’ll briefly outline a Python Jupyter Notebook I put together in a kernelfor the Home Credit Default Risk problem, but to get the full benefit, you’ll want to fork the notebook on Kaggle and run it yourself (you don’t have to download or set-up anything so I’d highly encourage checking it out).

When you open the notebook in a kernel, you’ll see this environment:

Kernel Notebook Environment

Think of this as a standard Jupyter Notebook with slightly different aesthetics. You can write Python code and text (using Markdown syntax) just like in Jupyter and run the code completely in the cloud on Kaggle’s servers. However, Kaggle kernels have some unique features not available in Jupyter Notebook. Hit the leftward facing arrow in the upper right to expand the kernel control panel which brings up three tabs (if the notebook is not in fullscreen, then these three tabs may already be visible next to the code).

In the data tab, we can view the datasets to which our Kernel is connected. In this case, we have the entire competition data, but we can also connect to any other dataset on Kaggle or upload our own data and access it in the kernel. Data files are available in the ../input/ directory from within the code:

import os
# List data files that are connected to the kernel
os.listdir('../input/')

Files connected to the kernel available in ../input/

The Settings tab lets us control different technical aspects of the kernel. Here we can add a GPU to our session, change the visibility, and install any Python package which is not already in the environment.

Finally, the Versions tab lets us see any previous committed runs of the code. We can view changes to the code, look at log files of a run, see the notebook generated by a run, and download the files that are output from a run.

Versions Tab

To run the entire notebook and record a new Version, hit the blue Commit & Run button in the upper right of the kernel. This executes all the code, shows us the completed notebook (or any errors if there are mistakes), and saves any files that are created during the run. When we commit the notebook, we can then access any predictions our models made and submit them for scoring.

Introductory Notebook Outline

The first notebook is meant to get you familiar with the problem. We start off much the same way as any data science problem: understanding the data and the task. For this problem, there is 1 main training data file (with the labels included), 1 main testing data file, and 6 additional data files. In this first notebook, we use only the main data, which will get us a decent score, but later work will have to incorporate all the data in order to be competitive.

To understand the data, it’s best to take a couple minutes away from the keyboard and read through the problem documentation, such as the column descriptions of each data file. Because there are multiple files, we need to know how they are all linked together, although for this first notebook we only use the main file to keep things simple. Reading through other kernels can also help us get familiar with the data and which variables are important.

Once we understand the data and the problem, we can start structuring it for a machine learning task This means dealing with categorical variables (through one-hot encoding), filling in the missing values (imputation), and scaling the variables to a range. We can do exploratory data analysis, such as finding correlations with the label, and graphing these relationships.

Correlation Heatmap of Variables

We can use these relationships later on for modeling decisions, such as including which variables to use. (See the notebook for implementation).

Distribution of Ages Colored by Label (left) and Rates of Default by Age Group (right)

Of course, no exploratory data analysis is complete without my favorite plot, the Pairs Plot.

Pairs Plot of Features (red indicates loans that were not repaid in the kde and scatter plots)

After thoroughly exploring the data and making sure it’s acceptable for machine learning, we move on to creating baseline models. However, before we quite get to the modeling stage, it’s critical we understand the performance metric for the competition. In a Kaggle competition, it all comes down to a single number, the metric on the test data.

While it might make intuitive sense to use accuracy for a binary classification task, that is a poor choice because we are dealing with an imbalanced class problem. Instead of accuracy, submissions are judged in terms of ROC AUC or Receiver Operating Characteristic curve Area Under the Curve. I’ll let you do the research on this one, or read the explanation in the notebook. Just know that higher is better, with a random model scoring 0.5 and a perfect model scoring 1.0. To calculate a ROC AUC, we need to make predictions in terms of probabilities rather than a binary 0 or 1. The ROC then shows the True Positive Rate versus the False Positive Rate as a function of the threshold according to which we classify an instance as positive.

Usually we like to make a naive baseline prediction, but in this case, we already know that random guessing on the task would get an ROC AUC of 0.5. Therefore, for our baseline model, we will use a slightly more sophisticated method, Logistic Regression. This is a popular simple algorithm for binary classification problems and it will set a low bar for future models to surpass.

After implementing the logistic regression, we can save the results to a csv file for submission. When the notebook is committed, any results we write will show up in the Output sub-tab on the Versions tab:

Output from running the complete notebook

From this tab, we can download the submissions to our computer and then upload them to the competition. In this notebook, we make four different models with scores as follows:

  • Logistic Regression: 0.671
  • Random Forest: 0.678
  • Random Forest with Constructed Features: 0.678
  • Light Gradient Boosting Machine: 0.729

These scores don’t get us anywhere close to the top of the leaderboard, but they leave room for plenty of future improvement! We also get an idea of the performance we can expect using only a single source of data.

(Not surprisingly, the extraordinary Gradient Boosting Machine (using the LightGBM library) performs the best. This model wins nearly every structured Kaggle competition (where the data is in table format) and we will likely need to use some form of this model if we want to seriously compete!)

Read Automatic By TracePress.com Company
if this Post need Change Tell Us!

About sara

Check Also

Data Notes: Impact of Game of Thrones on US Baby Names

Are Game of Thrones fans naming their babies Sansa and Tyrion? Enjoy these new, intriguing …

Leave a Reply

Your email address will not be published. Required fields are marked *

قالب وردپرس