Chapter 3 Data
Before doing any analysis, the factors within the dataset were first checked for missing or invalid data. The individual factors can be described as follows: 15 continuous, 10 nominal, and 1 integer.
Seven of the factors contained missing or improperly coded data. In this dataset in partular, all missing data has been coded with the value of ?
. In all cases below, the records containing the missing data have been removed.
N | Factor | Number of records missing a value |
---|---|---|
2 | normalized-losses | 41 |
6 | num-of-doors | 2 |
19 | bore | 4 |
20 | stroke | 4 |
22 | horsepower | 2 |
23 | peak-rpm | 2 |
26 | price | 4 |
Of the original 205 records, 46 were removed because they contained missing data in one or more of the above listed factors, which in this case was uniformly coded as a ?
. This resulted in a dataset of 159 records of clean data. No other factors needed cleaning up, as the data was properly coded for each record.
N | Description | Values |
---|---|---|
1 | symboling | -3, -2, -1, 0, 1, 2, 3 |
2 | normalized-losses | continuous from [65 to 256] |
3 | make | alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo |
4 | fuel-type | diesel, gas |
5 | aspiration | std, turbo |
6 | num-of-doors | four, two |
7 | body-style | hardtop, wagon, sedan, hatchback, convertible |
8 | drive-wheels | 4wd, fwd, rwd. |
9 | engine-location | front, rear |
10 | wheel-base | continuous from [86.6 to 120.9] |
11 | length | continuous from [141.1 to 208.1] |
12 | width | continuous from [60.3 to 72.3] |
13 | height | continuous from [47.8 to 59.8] |
14 | curb-weight: | continuous from [1488 to 4066] |
15 | engine-type | dohc, dohcv, l, ohc, ohcf, ohcv, rotor |
16 | num-of-cylinders | eight, five, four, six, three, twelve, two |
17 | engine-size | continuous from [61 to 326] |
18 | fuel-system | 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi |
19 | bore | continuous from [2.54 to 3.94] |
20 | stroke | continuous from [2.07 to 4.17] |
21 | compression-ratio | continuous from [7 to 23] |
22 | horsepower | continuous from [48 to 288] |
23 | peak-rpm | continuous from [4,150 to 6,600] |
24 | city-mpg | continuous from [13 to 49] |
25 | highway-mpg | continuous from [16 to 54] |
26 | price | continuous from [5,118 to 45,400] |
Of the remaining data, the reccords were split into two equal groups named training, and test where each held 50% of the reccords in the dataset.
The objective factor in the dataset is determined to be symboling
. For the remainder of the factors, the summary statistics were examined for continuous variables (Minimum, 1st Quartile, Median 3rd Quartile, and Maximum). For discrete factors we simply looked at the raw counts of each. A few cars had unique values sjch as the lone three cylinder hatchback from chevrolet, and the lone mixed fuel injection (mfi) dodge hatchback. They do not at first analysis seem to be outliers but were re-evaluated after looking at if they skewed our resulting models. A rough summary of the factors summary statistics is as follows:
## symboling normalized_losses make fuel_type aspiration
## Min. :-2.0000 Min. : 65.0 toyota :31 diesel: 15 std :132
## 1st Qu.: 0.0000 1st Qu.: 94.0 nissan :18 gas :144 turbo: 27
## Median : 1.0000 Median :113.0 honda :13
## Mean : 0.7358 Mean :121.1 subaru :12
## 3rd Qu.: 2.0000 3rd Qu.:148.0 mazda :11
## Max. : 3.0000 Max. :256.0 volvo :11
## (Other):63
## num_of_doors body_style drive_wheels engine_location
## four:95 convertible: 2 4wd: 8 front:159
## two :64 hardtop : 5 fwd:105
## hatchback :56 rwd: 46
## sedan :79
## wagon :17
##
##
## wheel_base length width height
## Min. : 86.60 Min. :141.1 Min. :60.30 Min. :49.40
## 1st Qu.: 94.50 1st Qu.:165.7 1st Qu.:64.00 1st Qu.:52.25
## Median : 96.90 Median :172.4 Median :65.40 Median :54.10
## Mean : 98.26 Mean :172.4 Mean :65.61 Mean :53.90
## 3rd Qu.:100.80 3rd Qu.:177.8 3rd Qu.:66.50 3rd Qu.:55.50
## Max. :115.60 Max. :202.6 Max. :71.70 Max. :59.80
##
## curb_weight engine_type num_of_cylinders engine_size fuel_system
## Min. :1488 dohc: 8 eight: 1 Min. : 61.0 1bbl:11
## 1st Qu.:2066 l : 8 five : 7 1st Qu.: 97.0 2bbl:63
## Median :2340 ohc :123 four :136 Median :110.0 idi :15
## Mean :2461 ohcf: 12 six : 14 Mean :119.2 mfi : 1
## 3rd Qu.:2810 ohcv: 8 three: 1 3rd Qu.:135.0 mpfi:64
## Max. :4066 Max. :258.0 spdi: 5
##
## bore stroke compression_ratio horsepower
## Min. :2.54 Min. :2.070 Min. : 7.00 Min. : 48.00
## 1st Qu.:3.05 1st Qu.:3.105 1st Qu.: 8.70 1st Qu.: 69.00
## Median :3.27 Median :3.270 Median : 9.00 Median : 88.00
## Mean :3.30 Mean :3.236 Mean :10.16 Mean : 95.84
## 3rd Qu.:3.56 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:114.00
## Max. :3.94 Max. :4.170 Max. :23.00 Max. :200.00
##
## peak_rpm city_mpg highway_mpg price
## Min. :4150 Min. :15.00 Min. :18.00 Min. : 5118
## 1st Qu.:4800 1st Qu.:23.00 1st Qu.:28.00 1st Qu.: 7372
## Median :5200 Median :26.00 Median :32.00 Median : 9233
## Mean :5114 Mean :26.52 Mean :32.08 Mean :11446
## 3rd Qu.:5500 3rd Qu.:31.00 3rd Qu.:37.00 3rd Qu.:14720
## Max. :6600 Max. :49.00 Max. :54.00 Max. :35056
##