- Complete the data analysis required by the specification.
Write up your analysis using your favourite word processing/typesetting program, making sure that all of the working is shown and that is it presented well.
identify what the public associates with the company name. The company wants the four pieces of analysis to be performed.
What do the results of the test tell us about the company tweets and random sample of tweets
- Compute the proportion of company tweets in each cluster.
What do these results tell us about the tweet topics from the company and public about the company
identify any problems with the analytical process used in each part and how the results may have been effected by these problem (do not include programming problems).
Top 10 words from the random sample
The given random sample of tweets was loaded and the top 10 words by fre quencies in the list were obtained by fifirst constructing a document -term Matrix and then ranking the words according their sum in the term-document matrix.We note that the term-document matrix was cleaned of symbols such as , /, k| , numbers, spacesandcommonstopwordsinEnglishlike0 the0 ,0 we0 etc.T heRcodeusedwas :
%load the library "tm"
>library(tm)
%Loading the csv file
>randomSample=read.csv("/home/prajnan/Downloads/randomSample1.csv",header=T)
%Creating a corpus from the file, i.e. data frame
>sam=Corpus(DataframeSource(randomSample))
%define a function that converts words to spaces
>toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
%convert | to space
>sam <- tm_map(sam, toSpace, "|")
%convert’/’ to space
>sam <- tm_map(sam, toSpace, "/")
%convert ’@’ to space
> sam <- tm_map(sam, toSpace, "@")
%convert text to lower case
>sam <- tm_map(sam, content_transformer(tolower))
%removing numbers
>sam <- tm_map(sam, removeNumbers)
%removing common stopwords
>sam <- tm_map(sam, removeWords, stopwords("english"))
%removing punctuation marks
>sam <- tm_map(sam, removePunctuation)
%finally removing the white space
>sam <- tm_map(sam, stripWhitespace)
%constructing the term-document matrixfor the random sample
>dtm=TermDocumentMatrix(sam)
%holding the matrix in a variable
1>m=as.matrix(dtm, sparse=TRUE)
%calculating row sums of the matrix
>v <- sort(rowSums(m),decreasing=TRUE)
%construct a table that ranks words according to their frequency ranks
>d <- data.frame(word = names(v),freq=v)
%the first 10 high frequency words
>head(d, 10)
Since the random sample given was a very large one having 6990 rows,therefore we split the given csv fifile into two parts consisting of 3995 rows and perform the analyses on the two parts separately. On inspecting the term document matrix, we fifind that the matrix has a large sparsity index, i.e., most of its entries are zero. This, is why we set ‘sparse=TRUE‘ while we defifine the matrix variable from the term-document matrix.
The output obtained in the two cases are:
for q1(1).png for q1(1).png
Thus, by comparing the two outputs, we can say that the words ‘http, tco,
just, https, new, like, via, get, can and time‘ in order are the top ten ten
words. Thus, the random Sample shows that the 5 most frequently used words
are related to web/internet related common terms and the next 5 terms are
common words used in daily talks.
2for q1(2).png for q1(2).png
The company chosen for the analysis is MercedesBenz. We analyse the the tweets by using the ‘twitteR‘ package. The R program written is:
library(httr)
library(devtools)
library(httk)
library(httpuv)
library(twitteR)
%authorizing with twitter
setup_twitter_oauth("api_key","api_secret", access_token=NULL,access_secret=NULL)
%searching twitter
tw=searchTwitter("MercedesBenz",n=1000,lang="en")
%convert the output to a data frame
Comparison of tweet topics between the company and the public
rf <- do.call("rbind", lapply(tw, as.data.frame))
%convert the dataframe to a csv file
write.csv(df,file="AboutCompanyTweets.csv")
The vector was formed by taking the row sums and it was merged with the vector obtained in the previous part to form a matrix on which the chi-square test was performed. the R Code used was:
3library(httr)
library(tm)
randomSample=read.csv("/home/prajnan/Downloads/randomSample1.csv",header=T)
sam=Corpus(DataframeSource(randomSample))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sam <- tm_map(sam, toSpace, "|")
sam <- tm_map(sam, toSpace, "/")
sam <- tm_map(sam, toSpace, "@")
sam <- tm_map(sam, content_transformer(tolower))
sam <- tm_map(sam, removeNumbers)
sam <- tm_map(sam, removeWords, stopwords("english"))
sam <- tm_map(sam, removePunctuation)
sam <- tm_map(sam, stripWhitespace)
dtm=TermDocumentMatrix(sam)
m=as.matrix(dtm, sparse=TRUE)
% forming the first vector
v1 <- sort(rowSums(m),decreasing=TRUE)
randomSample=read.csv("/home/prajnan/Downloads/randomSample2016.csv",header=T)
sam=Corpus(DataframeSource(randomSample))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sam <- tm_map(sam, toSpace, "|")
sam <- tm_map(sam, toSpace, "/")
sam <- tm_map(sam, toSpace, "@")
sam <- tm_map(sam, content_transformer(tolower))
sam <- tm_map(sam, removeNumbers)
sam <- tm_map(sam, removeWords, stopwords("english"))
sam <- tm_map(sam, removePunctuation)
sam <- tm_map(sam, stripWhitespace)
dtm=TermDocumentMatrix(sam)
m=as.matrix(dtm, sparse=TRUE)
%forming the second vector
v <- sort(rowSums(m),decreasing=TRUE)
a=read.csv("/home/prajnan/AboutCompanyTweets.csv")
sam=Corpus(DataframeSource(a))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sam <- tm_map(sam, toSpace, "|")
sam <- tm_map(sam, toSpace, "/")
sam <- tm_map(sam, toSpace, "@")
sam <- tm_map(sam, content_transformer(tolower))
sam <- tm_map(sam, removeNumbers)
sam <- tm_map(sam, removeWords, stopwords("english"))
sam <- tm_map(sam, removePunctuation)
sam <- tm_map(sam, stripWhitespace)
sam <- tm_map(sam, stemDocument)
dtm=TermDocumentMatrix(sam)
dtm <- removeSparseTerms(dtm, sparse=0.95)
4m2 <- as.matrix(dtm)
%forming the third vector
w=sort(rowSums(m2),decreasing=TRUE)
%combining first and third vectors
Mat=cbind(v,w)
%combining second and third vectors
Mat1=cbind(v1,w)
%performing the chi-square tests separately
chisq.test(Mat, correct=TRUE)
chisq.test(Mat,correct=TRUE)
The output of the above code is shown in fifigure
square test.png square test.png
square-2.png square-2.png
The p value for the given χ2 statistic in both cases is very low, i.e. less than 2.2 × 10−16, which at the default signifificance levels, leads us to the conclusion that there is no correlation between the tweets about the company with that in the random tweets.
The userTimeline function in R was used to obtain the latest tweets from the company. The tweets were combined with the tweets about the Company ob tained in part 2 and then, the resulting fifile was later clustered according to the k− means algorithm with k = 2. The R code used were:
library(httr)
library(devtools)
library(httk)
library(httpuv)
5library(twitteR)
setup_twitter_oauth("api_key","api_secret", access_token=NULL,access_secret=NULL)
tw1=userTimeline("MercedesBenz",n=1000)
df <- do.call("rbind", lapply(tw1, as.data.frame))
write.csv(df,file="FromCompanyTweets.csv") library(httr)
library(tm)
a=read.csv("/home/prajnan/AboutCompanyTweets.csv")
b=read.csv("/home/prajnan/CompanyTweets.csv")
%combining the tweets from and about the company
d=rbind(a,b)
write.csv(d,file="d.csv")
randomSample=read.csv("/home/prajnan/d.csv",header=T)
sam=Corpus(DataframeSource(randomSample))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sam <- tm_map(sam, toSpace, "|")
sam <- tm_map(sam, toSpace, "/")
sam <- tm_map(sam, toSpace, "@")
sam <- tm_map(sam, content_transformer(tolower))
sam <- tm_map(sam, removeNumbers)
sam <- tm_map(sam, removeWords, stopwords("english"))
sam <- tm_map(sam, removePunctuation)
sam <- tm_map(sam, stripWhitespace)
sam <- tm_map(sam, stemDocument)
dtm=TermDocumentMatrix(sam)
dtm <- removeSparseTerms(dtm, sparse=0.95)
m2 <- as.matrix(dtm)
%transforming the matrix for clustering
m3 <- t(m2)
%setting the seed
set.seed(122)
%setting k
k <- 2
%performing k-means clustering
kmeansResult <- kmeans(m3, k)
%getting the output
round(kmeansResult$centers, digits = 3)
The result of the clustering is shown in Figure
fr part 3.png fr part 3.png
The clustering results state that ‘fals‘ and ‘true‘ are the main words which have an average frequency of 4 and 0 ; and 3 and 1 respectively.
The results above point out the fact that the company MercedesBenz is as such not that popular among the general public and that there are very few words, that the company and public mainly speak about commonly.
References
[1]Cluster Analysis using term frequencies-Available at https://r.789695.n4.nabble.com/Cluster
analysis-using-term-frequencies-td4705033.html(Accessed:08/10/2017)
[2]Chi-Square Goodness of fifit test-Available at https://stattrek.com/chi-square
test/goodness-of-fifit.aspx?Tutorial=AP(Accessed:08/10/2017)
[3]R-Companion,Chi-square test of Independence-Available at https://rcompanion.org/rcompanion/b05.html(Accessed :
08/10/2017)
[4]Report-1:Introduction to k-means clustering with twitter data-Available at
https://rstudio-pubs-static.s3.amazonaws.com/5983af66eca6775f4528a72b8e243a6ecf2d.html(Accessed :
08/10/2017)
[5]R-Data Mining.com-R and Data Mining,Twitter Data Analysis with R
Available at https://www.rdatamining.com/docs/twitter-analysis-with-r(Accessed:08/10/2017)
[6]STHDA-Wiki,Text mining and word cloud fundamentals with R: 5 simple
steps you should know-Available at https://www.sthda.com/english/wiki/text
mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know(Accessed:08/10/2017)
[7]StackOverflflow, twitteR authentication with R:error 401 -Available at https://stackoverflflow.com/questions/29504484/twitter
package-for-r-authentication-error-401(Accessed:06/10/2017)
To export a reference to this article please select a referencing stye below:
My Assignment Help. (2021). Analysis Of Random Sample And MercedesBenz Tweets. Retrieved from https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html.
"Analysis Of Random Sample And MercedesBenz Tweets." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html.
My Assignment Help (2021) Analysis Of Random Sample And MercedesBenz Tweets [Online]. Available from: https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html
[Accessed 19 August 2024].
My Assignment Help. 'Analysis Of Random Sample And MercedesBenz Tweets' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html> accessed 19 August 2024.
My Assignment Help. Analysis Of Random Sample And MercedesBenz Tweets [Internet]. My Assignment Help. 2021 [cited 19 August 2024]. Available from: https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html.