2016年4月24日 星期日

R_how to import data

For CSV 

There is a function in utils package, reads a file in csv format and creates a data frame from it .
read.csv(file)

#Import swimming_pools.csv , named pools
pools<-read.csv ("swimming_pools.cv")

Be careful!!! If the strings are imported as characters , not as factors, the argument (stringAsFactors) must be set to FALSE.  It is only TRUE for the strings ,import represent categorical variables in R.

pools <- read.csv("swimming_pools.csv", stringsAsFactors = TRUE)
str(pools)
'data.frame': 20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ..
pools <- read.csv("swimming_pools.csv", stringsAsFactors = FALSE)
> str(pools)
'data.frame': 20 obs. of  4 variables:
 $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
For TXT

There is another function to import this file.
read.delim(file, header = TRUE, sep = "\t")

header =TRUE (the first row contains the field names)
sep ="\t"(fields in a record are delimited by tabs)

#Import hotdogs.txt names hotdogs
hotdogs<-read.delim("hotdog.txt", header=FALSE , sep="\t")

or
hotdogs<-read.table("hotdog.txt", header=FALSE , sep="\t") (especially for dealing with more exotic file formats.)

    a) The name of column also can be changed by adding col.names

> hotdogs <- read.delim("hotdogs.txt", header = FALSE)
> names(hotdogs)
[1] "V1" "V2" "V3"
> hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))
> names(hotdogs)
[1] "type"     "calories" "sodium"  
a) The type of column also can be changed by adding colClass
hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium")) 
# Display structure of hotdogs 
str(hotdogs)
'data.frame': 54 obs. of  3 variables:
 $ type    : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ calories: int  186 181 176 149 184 190 158 139 175 148 ...
 $ sodium  : int  495 477 425 322 482 587 370 322 479 375 ...
hotdogs <- read.delim("hotdogs.txt", header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL", "numeric"))
# Display structure of hotdogs
 str(hotdogs)
'data.frame': 54 obs. of  2 variables:
 $ type  : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sodium: num  495 477 425 322 482 587 370 322 479 375 ...




2016年4月16日 星期六

2015 Flight (in AUD)

Plot One


theme_set(theme_minimal(12))

plot1 <- qplot(x = Bkg_Subclass, y = farePerPerson, 
               data = cleared_data, geom = 'boxplot',
color = Bkg_Subclass) +
coord_cartesian(ylim = c(0, 15000))+
ggtitle('by Booking Subclass') +
xlab('Booking Subclass') +
ylab('Fares (in AUD)') +
theme(legend.position = 'none')

plot2 <- qplot(Bkg_Subclass, data = cleared_data, fill = Season ) +
  ggtitle('Number of booking by booking subclass') +
  xlab('Booking Subclass') + 
  ylab('Number of booking')

grid.arrange(plot2, plot1, ncol = 1)

Description One

Passengers prefer to buy the lowest price of tickets,however , the median of net revenue is also lowest.The proportion of season2 in almost every booking subclass is the highest, passengers likely go to travel in Season 2.

Plot Two



theme_set(theme_minimal(12))

plot1 <- qplot(x = month_pnr_create, y = farePerPerson, 
               data = cleared_data, geom = 'boxplot',
               fill = month_pnr_create) +
coord_cartesian(ylim = c(0, 2500))+
ggtitle('by month(create booking)') +
xlab('Month(Create booking)') +
ylab('Fares (in AUD)') +
theme(legend.position = 'none')

plot2 <- qplot(farePerPerson, data = cleared_data, binwidth = 400,
color = month_pnr_create, geom = 'density') +
coord_cartesian(xlim = c(0, 2000))+
guides(color = guide_legend(title = 'Month(create booking)', reverse = F)) +
xlab('Fare/person (AUD)') +
ylab('Density') +
ggtitle('Density of Fare/person (AUD) by Month(create booking)')

grid.arrange(plot1, plot2, ncol = 1)

Description Two

It finds that the relationship between months and price of the fare. For the peak periods, the price will increase with higher variance.The mean of fare is under AUD 1000 in the whole year.(except January) and the all distribution of fare is positive skew.

Plot three


theme_set(theme_minimal(16))
qplot(x=netPerPerson, y=farePerPerson, 
      color=Bkg_Subclass, data=remove_NA_net)+
geom_point(alpha = 0.5, position = 'jitter') +
ggtitle('Fare by Net revenue and booking subclass') +
theme(plot.title = element_text(size = 16))

Description Three

This graph shows the relationship between fare and net revenue booking subclass.Obviously ,the price of fare is higher, the net revenue is higher. Also, there are two recognised lines on the graph that mean there is at least two formula for calculating fare by net revenue for different situations.

Reflection

The data set contains booking information on almost 90 thousand transactions from around 2015. I started by understanding the individual variables in the data set and created a linear model to predict net revenue of ticket, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually , I explored the number of passengers and the amount of net revenue/fare per passenger across many variables. At first , I was wonder why so many bookings created in February and October, it is because the tickets in these two months are the cheapest in a year.It is easy to understand that the booking class is mainly class V/L/S/M/K , which concentrate on economy class with the lowest median of fare.Also, it is strong demand for travel originating from HongKong ,benefited from the weakness of the Euros and the Australian dollars in the first half of 2015.It reflects demand on regional routes is strong, particular in economy class. There was a strong economy class demand on long-haul routes.For this, I  strongly recommend increase flights to the popular destination over the peak period.Also, using larger aircraft such as Boeing 777-300ER on the popular flight a day will increase capacity.