Data collection is a crucial step before creating a machine learning model. It doesn’t matter how well-designed your model is if it learns nothing from the data. Data doesn’t have to be perfect, but it should be collected properly. Flaws in data collection can result in garbage data with missing values, bias, and high correlations, causing problems in model building.
Where to Find Data?
You can access various data on the web for free:
1. Web Scraping: Extract structured data from websites automatically. It’s useful for tasks like price comparison and gathering information like company names, emails, and phone numbers.
2. Web Crawling: Inspired by spiders, web crawling involves indexing and extracting information from web pages. Search engines like Google use web crawling to provide search results.
Collecting Data from Web Scraping and Crawling
Data gathered from the web can be used in natural language processing and image classification. For instance, if you’re building a dog breed classifier, you can find dog breed images online.
Some sources offer official APIs for easier data access.
After collecting data, you need to refine and format it for model building. Data can be structured (tables, CSV files) or unstructured (text, images, audio). Machines understand only 1s and 0s, so data needs to be formatted accordingly.
Types of Data
Data Pre-processing Techniques
Splitting the Data