Chapter 3 Data

Before doing any analysis, the factors within the dataset were first checked for missing or invalid data. The individual factors can be described as follows: 15 continuous, 10 nominal, and 1 integer.

Seven of the factors contained missing or improperly coded data. In this dataset in partular, all missing data has been coded with the value of ?. In all cases below, the records containing the missing data have been removed.

N Factor Number of records missing a value
2 normalized-losses 41
6 num-of-doors 2
19 bore 4
20 stroke 4
22 horsepower 2
23 peak-rpm 2
26 price 4

Of the original 205 records, 46 were removed because they contained missing data in one or more of the above listed factors, which in this case was uniformly coded as a ?. This resulted in a dataset of 159 records of clean data. No other factors needed cleaning up, as the data was properly coded for each record.

Table 3.1: Data Dictionary
N Description Values
1 symboling -3, -2, -1, 0, 1, 2, 3
2 normalized-losses continuous from [65 to 256]
3 make alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo 
4 fuel-type diesel, gas
5 aspiration std, turbo
6 num-of-doors four, two
7 body-style hardtop, wagon, sedan, hatchback, convertible
8 drive-wheels 4wd, fwd, rwd. 
9 engine-location front, rear
10 wheel-base continuous from [86.6 to 120.9]
11 length continuous from [141.1 to 208.1]
12 width continuous from [60.3 to 72.3]
13 height continuous from [47.8 to 59.8]
14 curb-weight: continuous from [1488 to 4066]
15 engine-type dohc, dohcv, l, ohc, ohcf, ohcv, rotor
16 num-of-cylinders eight, five, four, six, three, twelve, two
17 engine-size continuous from [61 to 326]
18 fuel-system 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi
19 bore continuous from [2.54 to 3.94]
20 stroke continuous from [2.07 to 4.17]
21 compression-ratio continuous from [7 to 23]
22 horsepower continuous from [48 to 288]
23 peak-rpm continuous from [4,150 to 6,600]
24 city-mpg continuous from [13 to 49]
25 highway-mpg continuous from [16 to 54]
26 price continuous from [5,118 to 45,400]

Of the remaining data, the reccords were split into two equal groups named training, and test where each held 50% of the reccords in the dataset.

The objective factor in the dataset is determined to be symboling. For the remainder of the factors, the summary statistics were examined for continuous variables (Minimum, 1st Quartile, Median 3rd Quartile, and Maximum). For discrete factors we simply looked at the raw counts of each. A few cars had unique values sjch as the lone three cylinder hatchback from chevrolet, and the lone mixed fuel injection (mfi) dodge hatchback. They do not at first analysis seem to be outliers but were re-evaluated after looking at if they skewed our resulting models. A rough summary of the factors summary statistics is as follows:

##    symboling       normalized_losses      make     fuel_type   aspiration 
##  Min.   :-2.0000   Min.   : 65.0     toyota :31   diesel: 15   std  :132  
##  1st Qu.: 0.0000   1st Qu.: 94.0     nissan :18   gas   :144   turbo: 27  
##  Median : 1.0000   Median :113.0     honda  :13                           
##  Mean   : 0.7358   Mean   :121.1     subaru :12                           
##  3rd Qu.: 2.0000   3rd Qu.:148.0     mazda  :11                           
##  Max.   : 3.0000   Max.   :256.0     volvo  :11                           
##                                      (Other):63                           
##  num_of_doors       body_style drive_wheels engine_location
##  four:95      convertible: 2   4wd:  8      front:159      
##  two :64      hardtop    : 5   fwd:105                     
##               hatchback  :56   rwd: 46                     
##               sedan      :79                               
##               wagon      :17                               
##                                                            
##                                                            
##    wheel_base         length          width           height     
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :49.40  
##  1st Qu.: 94.50   1st Qu.:165.7   1st Qu.:64.00   1st Qu.:52.25  
##  Median : 96.90   Median :172.4   Median :65.40   Median :54.10  
##  Mean   : 98.26   Mean   :172.4   Mean   :65.61   Mean   :53.90  
##  3rd Qu.:100.80   3rd Qu.:177.8   3rd Qu.:66.50   3rd Qu.:55.50  
##  Max.   :115.60   Max.   :202.6   Max.   :71.70   Max.   :59.80  
##                                                                  
##   curb_weight   engine_type num_of_cylinders  engine_size    fuel_system
##  Min.   :1488   dohc:  8    eight:  1        Min.   : 61.0   1bbl:11    
##  1st Qu.:2066   l   :  8    five :  7        1st Qu.: 97.0   2bbl:63    
##  Median :2340   ohc :123    four :136        Median :110.0   idi :15    
##  Mean   :2461   ohcf: 12    six  : 14        Mean   :119.2   mfi : 1    
##  3rd Qu.:2810   ohcv:  8    three:  1        3rd Qu.:135.0   mpfi:64    
##  Max.   :4066                                Max.   :258.0   spdi: 5    
##                                                                         
##       bore          stroke      compression_ratio   horsepower    
##  Min.   :2.54   Min.   :2.070   Min.   : 7.00     Min.   : 48.00  
##  1st Qu.:3.05   1st Qu.:3.105   1st Qu.: 8.70     1st Qu.: 69.00  
##  Median :3.27   Median :3.270   Median : 9.00     Median : 88.00  
##  Mean   :3.30   Mean   :3.236   Mean   :10.16     Mean   : 95.84  
##  3rd Qu.:3.56   3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:114.00  
##  Max.   :3.94   Max.   :4.170   Max.   :23.00     Max.   :200.00  
##                                                                   
##     peak_rpm       city_mpg      highway_mpg        price      
##  Min.   :4150   Min.   :15.00   Min.   :18.00   Min.   : 5118  
##  1st Qu.:4800   1st Qu.:23.00   1st Qu.:28.00   1st Qu.: 7372  
##  Median :5200   Median :26.00   Median :32.00   Median : 9233  
##  Mean   :5114   Mean   :26.52   Mean   :32.08   Mean   :11446  
##  3rd Qu.:5500   3rd Qu.:31.00   3rd Qu.:37.00   3rd Qu.:14720  
##  Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :35056  
##