The performance of machine learning models is influenced not only by hyperparameters and the model structure but also by the treatment of different variables fed into the system. Categorical data preprocessing is of great significance in data science, given most machine learning algorithms exclusively process numerical data. Hence, it becomes crucial to convert our categorical variables into numerical representation, helping the machine learning model comprehend and derive valuable insights.
Allocating 70-80% of their time on data cleaning and processing, a data scientist cannot ignore the need for categorical data transformation – a step that also bolsters model precision and facilitates feature engineering. The question arising at this point is – what steps should be taken next? What kind of encoding technique should be opted for categorical data?
Categorical data
A term discussed in this post, is a defined set of finite values often labeled as 'categories' or 'strings'. Several examples of categorical variables includes an individual's home city (Delhi, Bangalore, etc.), company departments (Human Resources, Finance, etc.), highest educational qualification (Diploma, Bachelor’s, Master’s, etc.). Each of these examples displays well-defined possible values. Importantly, categorical data is typically divided into Nominal Data and Ordinal Data. Nominal data signifies that the categories don't possess any intrinsic order. On the other hand, ordinal data signifies categories that follow a particular sequence.
While encoding ordinal data, acknowledging the ordering of categories is vital. Considering our previous example, an individual’s highest qualification can provide considerable insights about their credentials, impacting their candidature for a particular job role. As for nominal data, we should be aware of whether a property is present or absent. For instance, an individual's city of residence is a crucial data point, but it lacks a sense of sequential ordering, whether someone dwells in Delhi or Bangalore doesn't matter in terms of series or order.
Nominal and ordinal data contrasts with numerical data, which only comprises numerical values such as integers or floating-point numbers. Contrarily, categorical data contains marked values instead of numerical values, restricted to usually a small fixed number. Few examples include the "pet" attribute taking on values like "Snake" and "Turtle", "color" attribute with values like "purple", "yellow", "black", or "place" attribute with values like "third", "fourth", etc.
Note that each value is associated with a distinct category, and some categories might have a naturally inherent relationship. For instance, the "place" variable's values have a natural order, defining it as an ordinal variable. A numerical variable can be transformed into an ordinal variable by dividing the numerical range into bins and assigning values to these bins.
Certain algorithms like decision trees can operate directly with categorical data, without requiring any data transformations. However, many machine learning algorithms necessitate numeric data for input and output variables, often more due to efficient implementation limitations than algorithm-specific constraints. For example, scikit-learn demands all input data to be numeric. This infers the need to encode categorical data into numerical form. Moreover, if the categorical variable is an output variable, the model's predictions may need to be translated back into categorical form for application or display purposes.
Conclusion
For categorical feature engineering, it's fundamental to encode categorical data. Deciding on the appropriate coding scheme gets driven by the nature of the dataset and the chosen model.