There's more...
Astute readers might have noticed that the read.csv function call included stringsAsFactors = F as its final parameter. By default, R converts strings to a datatype, known as factors in many cases. Factors are the names for R's categorical datatype, which can be thought of as a label or tag applied to the data. Internally, R stores factors as integers with a mapping to the appropriate label. This technique allows older versions of R to store factors in much less memory than the corresponding character.
Categorical variables do not have a sense of order (where one value is considered greater than another). In the following snippet, we create a quick toy example converting four values of the character class to factor and do a comparison:
colors <- c('green', 'red', 'yellow', 'blue')
colors_factors <- factor(colors)
colors_factors
[1] green red yellow blue
Levels: blue green red yellow
colors_factors[1] > colors_factors[2]
[1] NA
Warning message:
In Ops.factor(colors_factors[1], colors_factors[2]) :
>not meaningful for factors
However, there is an ordered categorical variable, also known in the statistical world as ordinal data. Ordinal data is just like categorical data, with one exception. There is a sense of scale or value to the data. It can be said that one value is larger than another, but the magnitude of the difference cannot be measured.
Furthermore, when importing data into R, we often run into the situation where a column of numeric data might contain an entry that is non-numeric. In this case, R might import the column of data as factors, which is often not what was intended by the data scientist. Converting from factor to character is relatively routine, but converting from factor to numeric can be a bit tricky.
R is capable of importing data from a wide range of formats. In this recipe, we handled a CSV file, but we could have used a Microsoft Excel file as well. CSV files are preferred as they are universally supported across operating systems and are far more portable. Additionally, R can import data from numerous popular statistical programs, including SPSS, Stata, and SAS.