Topic 2 Data collection and preprocessing

Data collection is a crucial step before creating a machine learning model. It doesn’t matter how well-designed your model is if it learns nothing from the data. Data doesn’t have to be perfect, but it should be collected properly. Flaws in data collection can result in garbage data with missing values, bias, and high correlations, causing problems in model building.

Where to Find Data?

You can access various data on the web for free:

1. Web Scraping: Extract structured data from websites automatically. It’s useful for tasks like price comparison and gathering information like company names, emails, and phone numbers.

2. Web Crawling: Inspired by spiders, web crawling involves indexing and extracting information from web pages. Search engines like Google use web crawling to provide search results.

Collecting Data from Web Scraping and Crawling

Data gathered from the web can be used in natural language processing and image classification. For instance, if you’re building a dog breed classifier, you can find dog breed images online.

Some sources offer official APIs for easier data access.

Data Pre-processing

After collecting data, you need to refine and format it for model building. Data can be structured (tables, CSV files) or unstructured (text, images, audio). Machines understand only 1s and 0s, so data needs to be formatted accordingly.

Types of Data

  1. Categorical Data: Represented by sets of values (e.g., True/False, days of the week).
  2. Numerical Data: Continuous or integer values (e.g., height, weight).

Data Pre-processing Techniques

  1. Data Quality Assessment: Identify and handle missing, null, and duplicated values.
  2. Feature Aggregation: Combine similar data for a high-level view and reduce the number of data objects.
  3. Feature Engineering: Select and manipulate data items, either with or without replacement.
  4. Dimensionality Reduction: Reduce data size while retaining useful features, making models more understandable and visualizable.
  5. Feature Encoding: Encode data based on its type (nominal, ordinal, interval, ratio).

Splitting the Data

  • Training Data: Used to train the model and detect overfitting or underfitting.
  • Testing Data: Used to test the model’s predictions based on training.
  • Validation: Used to find the model’s hyperparameters.