Analysis of Random Sample and MercedesBenz Tweets

Complete the data analysis required by the specification.

Write up your analysis using your favourite word processing/typesetting program, making sure that all of the working is shown and that is it presented well.

identify what the public associates with the company name. The company wants the four pieces of analysis to be performed.

What do the results of the test tell us about the company tweets and random sample of tweets

Compute the proportion of company tweets in each cluster.

What do these results tell us about the tweet topics from the company and public about the company

identify any problems with the analytical process used in each part and how the results may have been effected by these problem (do not include programming problems).

Top 10 words from the random sample

The given random sample of tweets was loaded and the top 10 words by fre quencies in the list were obtained by fifirst constructing a document -term Matrix and then ranking the words according their sum in the term-document matrix.We note that the term-document matrix was cleaned of symbols such as , /, k| , numbers, spacesandcommonstopwordsinEnglishlike0 the0 ,0 we0 etc.T heRcodeusedwas :

%load the library "tm"

>library(tm)

%Loading the csv file

>randomSample=read.csv("/home/prajnan/Downloads/randomSample1.csv",header=T)

%Creating a corpus from the file, i.e. data frame

>sam=Corpus(DataframeSource(randomSample))

%define a function that converts words to spaces

>toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

%convert | to space

>sam <- tm_map(sam, toSpace, "|")

%convert’/’ to space

>sam <- tm_map(sam, toSpace, "/")

%convert ’@’ to space

> sam <- tm_map(sam, toSpace, "@")

%convert text to lower case

>sam <- tm_map(sam, content_transformer(tolower))

%removing numbers

>sam <- tm_map(sam, removeNumbers)

%removing common stopwords

>sam <- tm_map(sam, removeWords, stopwords("english"))

%removing punctuation marks

>sam <- tm_map(sam, removePunctuation)

%finally removing the white space

>sam <- tm_map(sam, stripWhitespace)

%constructing the term-document matrixfor the random sample

>dtm=TermDocumentMatrix(sam)

%holding the matrix in a variable

1>m=as.matrix(dtm, sparse=TRUE)

%calculating row sums of the matrix

>v <- sort(rowSums(m),decreasing=TRUE)

%construct a table that ranks words according to their frequency ranks

>d <- data.frame(word = names(v),freq=v)

%the first 10 high frequency words

>head(d, 10)

Since the random sample given was a very large one having 6990 rows,therefore we split the given csv fifile into two parts consisting of 3995 rows and perform the analyses on the two parts separately. On inspecting the term document matrix, we fifind that the matrix has a large sparsity index, i.e., most of its entries are zero. This, is why we set ‘sparse=TRUE‘ while we defifine the matrix variable from the term-document matrix.

The output obtained in the two cases are:

for q1(1).png for q1(1).png

Thus, by comparing the two outputs, we can say that the words ‘http, tco,

just, https, new, like, via, get, can and time‘ in order are the top ten ten

words. Thus, the random Sample shows that the 5 most frequently used words

are related to web/internet related common terms and the next 5 terms are

common words used in daily talks.

2for q1(2).png for q1(2).png

The company chosen for the analysis is MercedesBenz. We analyse the the tweets by using the ‘twitteR‘ package. The R program written is:

library(httr)

library(devtools)

library(httk)

library(httpuv)

library(twitteR)

%authorizing with twitter

setup_twitter_oauth("api_key","api_secret", access_token=NULL,access_secret=NULL)

%searching twitter

tw=searchTwitter("MercedesBenz",n=1000,lang="en")

%convert the output to a data frame

Comparison of tweet topics between the company and the public

rf <- do.call("rbind", lapply(tw, as.data.frame))

%convert the dataframe to a csv file

write.csv(df,file="AboutCompanyTweets.csv")

The vector was formed by taking the row sums and it was merged with the vector obtained in the previous part to form a matrix on which the chi-square test was performed. the R Code used was:

3library(httr)

library(tm)

randomSample=read.csv("/home/prajnan/Downloads/randomSample1.csv",header=T)

sam=Corpus(DataframeSource(randomSample))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

sam <- tm_map(sam, toSpace, "|")

sam <- tm_map(sam, toSpace, "/")

sam <- tm_map(sam, toSpace, "@")

sam <- tm_map(sam, content_transformer(tolower))

sam <- tm_map(sam, removeNumbers)

sam <- tm_map(sam, removeWords, stopwords("english"))

sam <- tm_map(sam, removePunctuation)

sam <- tm_map(sam, stripWhitespace)

dtm=TermDocumentMatrix(sam)

m=as.matrix(dtm, sparse=TRUE)

% forming the first vector

v1 <- sort(rowSums(m),decreasing=TRUE)

randomSample=read.csv("/home/prajnan/Downloads/randomSample2016.csv",header=T)

sam=Corpus(DataframeSource(randomSample))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

sam <- tm_map(sam, toSpace, "|")

sam <- tm_map(sam, toSpace, "/")

sam <- tm_map(sam, toSpace, "@")

sam <- tm_map(sam, content_transformer(tolower))

sam <- tm_map(sam, removeNumbers)

sam <- tm_map(sam, removeWords, stopwords("english"))

sam <- tm_map(sam, removePunctuation)

sam <- tm_map(sam, stripWhitespace)

dtm=TermDocumentMatrix(sam)

m=as.matrix(dtm, sparse=TRUE)

%forming the second vector

v <- sort(rowSums(m),decreasing=TRUE)

a=read.csv("/home/prajnan/AboutCompanyTweets.csv")

sam=Corpus(DataframeSource(a))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

sam <- tm_map(sam, toSpace, "|")

sam <- tm_map(sam, toSpace, "/")

sam <- tm_map(sam, toSpace, "@")

sam <- tm_map(sam, content_transformer(tolower))

sam <- tm_map(sam, removeNumbers)

sam <- tm_map(sam, removeWords, stopwords("english"))

sam <- tm_map(sam, removePunctuation)

sam <- tm_map(sam, stripWhitespace)

sam <- tm_map(sam, stemDocument)

dtm=TermDocumentMatrix(sam)

dtm <- removeSparseTerms(dtm, sparse=0.95)

4m2 <- as.matrix(dtm)

%forming the third vector

w=sort(rowSums(m2),decreasing=TRUE)

%combining first and third vectors

Mat=cbind(v,w)

%combining second and third vectors

Mat1=cbind(v1,w)

%performing the chi-square tests separately

chisq.test(Mat, correct=TRUE)

chisq.test(Mat,correct=TRUE)

The output of the above code is shown in fifigure

square test.png square test.png

square-2.png square-2.png

The p value for the given χ2 statistic in both cases is very low, i.e. less than 2.2 × 10−16, which at the default signifificance levels, leads us to the conclusion that there is no correlation between the tweets about the company with that in the random tweets.

The userTimeline function in R was used to obtain the latest tweets from the company. The tweets were combined with the tweets about the Company ob tained in part 2 and then, the resulting fifile was later clustered according to the k− means algorithm with k = 2. The R code used were:

library(httr)

library(devtools)

library(httk)

library(httpuv)

5library(twitteR)

setup_twitter_oauth("api_key","api_secret", access_token=NULL,access_secret=NULL)

tw1=userTimeline("MercedesBenz",n=1000)

df <- do.call("rbind", lapply(tw1, as.data.frame))

write.csv(df,file="FromCompanyTweets.csv") library(httr)

library(tm)

a=read.csv("/home/prajnan/AboutCompanyTweets.csv")

b=read.csv("/home/prajnan/CompanyTweets.csv")

%combining the tweets from and about the company

d=rbind(a,b)

write.csv(d,file="d.csv")

randomSample=read.csv("/home/prajnan/d.csv",header=T)

sam=Corpus(DataframeSource(randomSample))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

sam <- tm_map(sam, toSpace, "|")

sam <- tm_map(sam, toSpace, "/")

sam <- tm_map(sam, toSpace, "@")

sam <- tm_map(sam, content_transformer(tolower))

sam <- tm_map(sam, removeNumbers)

sam <- tm_map(sam, removeWords, stopwords("english"))

sam <- tm_map(sam, removePunctuation)

sam <- tm_map(sam, stripWhitespace)

sam <- tm_map(sam, stemDocument)

dtm=TermDocumentMatrix(sam)

dtm <- removeSparseTerms(dtm, sparse=0.95)

m2 <- as.matrix(dtm)

%transforming the matrix for clustering

m3 <- t(m2)

%setting the seed

set.seed(122)

%setting k

k <- 2

%performing k-means clustering

kmeansResult <- kmeans(m3, k)

%getting the output

round(kmeansResult$centers, digits = 3)

The result of the clustering is shown in Figure

fr part 3.png fr part 3.png

The clustering results state that ‘fals‘ and ‘true‘ are the main words which have an average frequency of 4 and 0 ; and 3 and 1 respectively.

The results above point out the fact that the company MercedesBenz is as such not that popular among the general public and that there are very few words, that the company and public mainly speak about commonly.

References

[1]Cluster Analysis using term frequencies-Available at https://r.789695.n4.nabble.com/Cluster

analysis-using-term-frequencies-td4705033.html(Accessed:08/10/2017)

[2]Chi-Square Goodness of fifit test-Available at https://stattrek.com/chi-square

test/goodness-of-fifit.aspx?Tutorial=AP(Accessed:08/10/2017)

[3]R-Companion,Chi-square test of Independence-Available at https://rcompanion.org/rcompanion/b05.html(Accessed :

08/10/2017)

[4]Report-1:Introduction to k-means clustering with twitter data-Available at

https://rstudio-pubs-static.s3.amazonaws.com/5983af66eca6775f4528a72b8e243a6ecf2d.html(Accessed :

08/10/2017)

[5]R-Data Mining.com-R and Data Mining,Twitter Data Analysis with R

Available at https://www.rdatamining.com/docs/twitter-analysis-with-r(Accessed:08/10/2017)

[6]STHDA-Wiki,Text mining and word cloud fundamentals with R: 5 simple

steps you should know-Available at https://www.sthda.com/english/wiki/text

mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know(Accessed:08/10/2017)

[7]StackOverflflow, twitteR authentication with R:error 401 -Available at https://stackoverflflow.com/questions/29504484/twitter

package-for-r-authentication-error-401(Accessed:06/10/2017)

Cite This Work

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2021). Analysis Of Random Sample And MercedesBenz Tweets. Retrieved from https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html.

"Analysis Of Random Sample And MercedesBenz Tweets." My Assignment Help, 2021, https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html.

My Assignment Help (2021) Analysis Of Random Sample And MercedesBenz Tweets [Online]. Available from: https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html
[Accessed 19 August 2024].

My Assignment Help. 'Analysis Of Random Sample And MercedesBenz Tweets' (My Assignment Help, 2021) <https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html> accessed 19 August 2024.

My Assignment Help. Analysis Of Random Sample And MercedesBenz Tweets [Internet]. My Assignment Help. 2021 [cited 19 August 2024]. Available from: https://myassignmenthelp.com/free-samples/300958-social-web-analytics/typesetting-program.html.

Get instant help from 5000+ experts for