Topic 2: Using scikit-learn for ML Models

1. Introduction to scikit-learn

Scikit-learn is one of the most popular Python libraries for machine learning. It provides simple and efficient tools for data analysis and modeling and is built on the foundational structures of NumPy, SciPy, and matplotlib.

2. Key Features of scikit-learn

  • Versatile Algorithms: Offers a vast array of algorithms for classification, regression, clustering, dimensionality reduction, and more.

  • Consistent API: Once you understand the basic use of scikit-learn for one type of model, switching to a new model or algorithm is straightforward.

  • Integration with Pandas: Scikit-learn works well with Pandas dataframes, which simplifies data manipulation.

3. Basic Workflow in scikit-learn

The general workflow with scikit-learn is:

  1. Load Data: Using pandas or any other method.
  2. Preprocess Data: Scale or normalize if necessary.
  3. Split Data: Separate data into training and test sets.
  4. Choose a Model: Select a machine learning model.
  5. Train the Model: Using the training set.
  6. Evaluate the Model: With the test set.
  7. Tune and Optimize: Adjust hyperparameters and re-evaluate.

4. Example: Classification with Scikit-learn

from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Load dataset iris = datasets.load_iris() X, y =, # Preprocess and split data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3) # Initialize and train classifier clf = SVC(kernel="linear"), y_train) # Evaluate the classifier accuracy = clf.score(X_test, y_test) print(f"Accuracy: {accuracy*100:.2f}%")

5. Model Evaluation

Scikit-learn provides numerous tools for model evaluation:

  • Accuracy Score: Measures the ratio of correct predictions to total predictions.
  • Confusion Matrix: A table used to describe the performance of a classification model.
  • Cross-Validation: For more robust evaluation by partitioning the dataset and training/testing multiple times.
  • Metrics: Such as precision, recall, F1 score, etc.

6. Hyperparameter Tuning

Hyperparameters are parameters not learned from the data. They’re set prior to the learning process and can impact model performance.

  • Grid Search: Exhaustively tests a predefined set of hyperparameters.

  • Randomized Search: Tests a fixed number of hyperparameter combinations from specified distributions.

7. Pipelines

Scikit-learn provides a pipeline utility to help with the sequence of data processing and modeling. This helps ensure that procedures are executed in the right order.

from sklearn.pipeline import Pipeline from sklearn.decomposition import PCA # Create a pipeline pipe = Pipeline([ ('scale', StandardScaler()), ('reduce_dim', PCA()), ('classify', SVC()) ]), y_train)

In summary, scikit-learn is an incredibly versatile and robust library that has streamlined the process of implementing machine learning models. Its user-friendly API and comprehensive documentation make it a favorite among both newcomers and experts in the field of machine learning.