Importing Data in R

9Aug - by neuralsculpt - 0 - In BigData

Importing Data in R

a. Using the Combine Command

As we know c() function is used to concatenate or combine items in R as specified below:

1
2
>c(item.1, item.2, item.n)    //(The c() function combines all the specified items in one object)
>sample.name = c(item.1, item.2, item.n)    //(The concatenated values can be assigned to a named object, as shown in the command)

Everything in the parentheses is joined to create a single item. Usually, the joined items are assigned to a named object.

b. Entering Numerical Items as Data

Numerical data can be simply enhanced by typing the values separated by commas into the c() command.

Let us create a data set. Below command is used for the same:

1
>data1 = c(3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9)

It creates a new object to hold the data and then type the values in the parenthesis. The values are separated by using commas. The result is not automatically displayed. To see the dataset, type its name in the R console as follows:

1
2
>data1    //This command will display entries of data
[1] 3 5 7 5 3 2 6 8 5 6 9    //Displays the contents of object data1

As R supports different types of data, all data types can be imported into it for computation.

Existing data objects can be incorporated with existing values to make new ones, simply by incorporating them as if they were values themselves. In the following example, we take the numerical sample made earlier and incorporate it into a larger sample.

1
2
>data2 = c(data1, 4, 5, 7, 3, 4)
data2    //Displays the contents of object data2

Below is displayed as output:

1
[1] 3 5 7 5 3 2 6 8 5 6 9 4 5 7 3 4

c. Entering Text Items as Data

Data that is not numerical can be differentiated from numbers by using quotes. There is no difference between using single and double quotes; R converts them all to double. Either or both can be used as long as the surrounding quotes for any single item match.

As numerical data can be imported, text values can also be imported and manipulated in R.

1
2
>day1 = c('Mon', 'Tue', 'Wed', 'Thu')
>day1

This displays the contents of day1 as below:

1
[1] "Mon" "Tue" "Wed" "Thu"

As we have joined numerical data, in the same manner we can join text data as well as shown below:

1
2
>day1 = c(day1, 'Fri')
>day1

This displays the updated contents of object day1

1
[1] "Mon" "Tue" "Wed" "Thu" "Fri"

When text and numbers are combined, entire data object becomes a text variable and the numbers are also converted to text.

The c()command is a quick way of getting a series of values stored in a data object. This command is useful when the samples are small, but it can be tedious when a lot of typing is involved.

d. Using the scan() Command

When using the c() command, you may find typing all the commas to separate the values a little tedious. Instead, you can use the scan()command to do the same job, but without the commas. In addition to using the scan()command to enter text into datasets, it can be used with the clipboard and to take data from files.

Unlike the c()command, the scan()command uses empty parentheses. The command then prompts you to enter the desired data. The entered data can be stored in a new variable.

Let us see this with the help of an example:

1
>file_name = scan()    //This is the syntax for using scan command

You can also use the scan()command to enter text into datasets. Simply entering the items in quotes will generate an error message. The modified syntax for entering text as data is as follows:

1
2
3
>scan(what = 'character')
>day1
[1] "Mon" "Tue" "Wed" "Thu" "Fri"

Note:

file: the name of a file

what: type of data, including logical, integer, numeric, complex, character, raw

In R, the user must specify that the items entered are characters, and not numbers. To do so, the (what = ‘character’) part must be added.

e. Using the Clipboard to Make Data

Another way of importing data interactively into R is to use the Clipboard to copy and paste data.

The scan() command can be used with programs, such as a spreadsheet for entering data into R.

The steps to import data are:

  1. If the spreadsheet data is in the form of numbers, simply type the command in R as usual before switching to the spreadsheet containing the data.
  2. Highlight the necessary cells in the spreadsheet and copy them to the clipboard.
  3. Then return to R and paste the data from the clipboard into R. As usual, R waits until a blank line is entered before ending the data entry so you can continue to copy and paste more data as required.
  4. Enter a blank line to complete data entry.

If the data is text, add the what = ‘character instruction to the scan() command. If the file can be opened in a spreadsheet, proceed with the aforementioned four steps. If the file opens in a text editor or word processor, see how the data items are separated before continuing.

If the data is separated by simple spaces, simply copy and paste. If the data is separated by some other character, R needs to be told which character is used as the separator.

f. Using Scan()to retrieve data from CSV file

The scan() command can be used to retrieve data from a CSV file, as follows:

1
2
>File_Name = scan(sep = ',')    //sep is used for separator to show the type of separator
1: 23,17,12.5,11,17,12,14.5,9

Output is displayed as below:

1
9: 11,9,12.5,14.5,17,8,21

The separator must be enclosed in quotes. You need to press enter to finish the data entry.

g. Reading a File of Data from a Disk

The scan() command can be used to retrieve data file in the memory of the system.

Scan() can read data into a vector or list from the console or file. To read a file with the scan()command, simply add file = ‘filename’ to the command as shown below:

1
>Object_Name = scan(file = 'File_Name.txt')

The filename must be enclosed in quotes.

R looks for the data file in the default directory. To get the current working directory, getwd() command is used as below:

1
>getwd()

This shows the current working directory as below:

1
[1] "C:/Documents and Settings/Administrator/My Documents"

The directories listed in the example are separated by forward slashes. The backslash character is not used.

The working directory can be altered in R. In case you want to load files by just typing their names from any directory, the task becomes easier if the working directory is permanently set as different directory. The directory can be altered using the setwd() command as below:

1
2
>setwd('Desktop')
>getwd()

This change the current working directory to “desktop” and displays new directory as below:

[1] “/Users/markgardener/Desktop”

In the Windows and Mac operating systems, there is an alternative method that enables file selection. The instruction file.choose()can be included as part of scan()command. This opens a browser-type window where users can navigate and select the file to read.

1
2
>Object_Name = scan(file.choose())
>Object_Name

The output is displayed as below:

1
[1] 23.0 17.0 12.5 11.0 17.0 12.0 14.5 9.0 11.0

The file.choose()instruction does not work on the Linux operating system. The file.choose() instruction files from different directories can be selected without having to alter the working directory or typing the names in full.

h. Reading Bigger Data Files

Let us now see how to read bigger data files in R:

The scan()command is helpful in reading simple vectors. It is possible to enter large amounts of data directly into R from complicated data files that contain multiple items. It is more likely that the data would be stored in a spreadsheet. R provides the means to read data that is stored in a range of text formats, all of which the spreadsheet is able to create.

  • Command to read from CSV file: > read.csv() or read.csv2()
  • Command to read from tables: > read.table()
  • Command to read from Tab separated value files: > delim()

The difference between read.csv() and read.csv2() in R is in their usage. The former function is used if the separator is a ‘,’ while the latter is used if the separator is ‘;’ to separate the values in your data file.

i. Missing Values in Data Files

In the real world, samples are often of unequal size.  So now we are going to see how R handles missing values in data files:

Let us consider two samples, mow and unmow.

The mow sample contains five values, whereas the unmow sample contains four values. When this data is read into R from a spreadsheet or text file, the program recognizes multiple columns of data and sets them accordingly.

R converts data into a neat rectangular item and fills in any gaps with NA.

NOTE: The NA item is a special object in its own right as “Not Applicable” or “Not Available.”

For example:

1
2
>Grass = read.csv (file.choose())
Grass

Mow     unmow

1    12    8

2    15    9

3    17    7

4    11    9

5    15    NA

The dataset has been called grass and R has filled in the gap by using NA.

R always pads out the shorter samples by using NA to produce a rectangular object. This is called a data frame in R. The R data frame is an important kind of object because it is used so often in statistical data manipulation.

.