Text mining with tm.plugin.webmining package using GoogleFinanceSource function

I am studying text mining on the online book http://tidytextmining.com/. In the fifth chapter: http://tidytextmining.com/dtm.html#financial

the following code:

library(tm.plugin.webmining)
library(purrr)

company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook",
             "Twitter", "IBM", "Yahoo", "Netflix")
symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX")

download_articles <- function(symbol) {
    WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
                             symbol = symbol) %>%
    mutate(corpus = map(symbol, download_articles))

gives me the error:

StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document

Any hints? Someone suggested to remove company and symbol related to "Twitter", but it still doesn't work and returns the same error. Many thanks in advance


I am having the same issue, however, have narrowed it down slightly. This snippet of the code results in the same error.

GoogleFinanceSource("NASDAQ:MSFT")
StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document

I also saw where others have suggested removing Twitter. I get the point it would have failed as Twitter is not on NASDAQ. I tried the suggested "NYSE:TWTR" and got the same result, however.

I attempted to use GoogleNewsSource to see if I would get the same issue and got a different error which this article on github suggests is being caused by the parser. I wonder if these two issues could be related. github.com/mannau/tm.plugin.webmining/issues/14.

GoogleNewsSource("Microsoft")
Unknown IO error failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"
Error: 1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=Microsoft&ie=utf-8&num=100&output=rss"

That all being said, I have found a work around using a modified ticker list and YahooFinanceSource as follows:

company <- c("Microsoft", "Apple", "Google")
symbol <- c("MSFT", "AAPL", "GOOG")

download_articles <- function(symbol) {
    WebCorpus(YahooFinanceSource(symbol))
}

stock_articles <- data_frame(company = company,
                         symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))

[PDF] Introduction to the tm.plugin.webmining Package, library(tm.plugin.webmining) tm.plugin.webmining depends on numerous packages, most importantly tm by Feinerer et al. (2008) for text mining capabilities and  tm.plugin.webmining facilitates the retrieval of textual data through various web feed formats like XML and JSON. Also direct retrieval from HTML is supported. As most (news) feeds only incorpo-rate small fractions of the original text tm.plugin.webmining goes a step further and even retrieves and extracts the text of the original text source.


The problem is the package tm.plugin.webmining is out of date.

Only the YahooFinanceSource and YahooNewsSource are alive at the time of this reply.


Here is a quick reference and test.

From the Vignette page written by the author, there should be 8 possible source sites:

  1. GoogleBlogSearchSource
  2. GoogleFinaceSource
  3. GoogleNewsSource
  4. NYTimesSource
  5. ReutersNewsSource
  6. YahooFinanceSource
  7. YahooInplaySource
  8. YahooNewsSource

But according to the Github page, the first one "GoogleBlogSearchSource" has already been proven to be discontinued. For the 7 sources remained, I did a simple test to see if they work:

library(tm)
library(tm.plugin.webmining)

googlefinance <- WebCorpus(GoogleFinanceSource("A"))
googlenews <- WebCorpus(GoogleNewsSource("A"))
nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
reutersnews <- WebCorpus(ReutersNewsSource("A"))
yahoofinance <- WebCorpus(YahooFinanceSource("A"))
yahooinplay <- WebCorpus(YahooInplaySource())
yahoonews <- WebCorpus(YahooNewsSource("M"))

The result shows that all the yahoo's sourses are technically still running, but the YahooInplaySource returns 0 documents no matter what parameter I chose.

> googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlefinance <- WebCorpus(GoogleFinanceSource("A"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlenews <- WebCorpus(GoogleNewsSource("A"))
Unknown IO errorfailed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
Error in inherits(x, "WebSource") : 
  1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
> nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
Error in inherits(x, "WebSource") : object 'nytimes_appid' not found
> reutersnews <- WebCorpus(ReutersNewsSource("A"))
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error in inherits(x, "WebSource") : 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
> yahoofinance <- WebCorpus(YahooFinanceSource("A"))
> yahoofinance
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 16
> yahooinplay <- WebCorpus(YahooInplaySource())
> yahooinplay
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("A"))
> yahoonews
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("M"))
> yahoonews
<<WebCorpus>>
Metadata:  corpus specific: 3, document level (indexed): 0
Content:  documents: 10

Also it worth to be mentioned that even though YahooFinanceSourse is working, it won't return the similar content as GoogleFinanceSource was supposed to do. If you want to play with the examples in , I think you may use YahooNewsSource with a customized list of queries.

[PDF] Package 'tm.plugin.webmining', fractions of the original text tm.plugin.webmining even retrieves and WebCorpus GoogleFinanceSource GoogleNewsSource NYTimesSource Function extracts main HTML Content using its Document Object Model. Keeping these issues in mind, tm.plugin.webmining is well suited for the retrieval and processing of small to medium sized text corpora. By using the full meta data and textual contents, quite interesting text mining experiments can be done using the full capabilities of the tm package. References Ingo Feinerer, Kurt Hornik, and David Meyer.


In the line of code below, try to change the default ie = "utf-8" to ie = "ansi". Try and apply it to your script, it should work.

WebCorpus(GoogleFinanceSource("NASDAQ:MSFT", params = list(hl = "en", q = "NASDAQ:MSFT", ie = "ansi", start = 0, num = 20, output = "rss")))

Text mining with tm.plugin.webmining package using , Text mining with tm.plugin.webmining package using GoogleFinanceSource function. 2017-12-13 09:59 Scipione Sarlo imported from Stackoverflow. tm.plugin.webmining-package: Retrieve structured, textual data from various web sources; trimWhiteSpaces: Trim White Spaces from Text Document. WebCorpus: WebCorpus constructor function. WebSource: Read Web Content and respective Link Content from feedurls. YahooFinanceSource: Get feed data from Yahoo! Finance.


Problems text mining using the 'rJava' and 'tm.plugin.webmining , I am currently following the book "Text Mining with R: A Tidy Data Approach" and am on the part that uses the 'tm.plugin.webmining' package to  tm.plugin.webmining facilitates the retrieval of textual data through various web feed formats like XML and JSON. Also direct retrieval from HTML is supported. As most (news) feeds only incorporate small fractions of the original text tm.plugin.webmining goes a step further and even retrieves and extracts the text of the original text source.


tm.plugin.webmining package, Facilitate text retrieval from feed formats like XML (RSS, ATOM) and JSON. tm.​plugin.webmining-package, Retrieve structured, textual data from various web sources parse, Wrapper/Convenience function to ensure right encoding for different Platforms GoogleFinanceSource, Get feed Meta Data from Google Finance. tm.plugin.webmining facilitates the retrieval of textual data through various web feed formats like XML and JSON. Also direct retrieval from HTML is supported. As most (news) feeds only incorporate small fractions of the original text tm.plugin.webmining goes a step further and even retrieves and extracts the text of the original text source.


[TeX] Short Introduction to tm.plugin.webmining, tm.plugin.webmining depends on numerous packages, most importantly tm RCurl functions are used for web data retrieval and XML for the extraction of It can therefore be used like a "normal" Corpus using tm's text mining <<echo=T, eval=F>>= googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:​MSFT"))  # ' @param parser function to be used to split feed content into chunks, returns list of content elements # ' @param encoding specifies default encoding, defaults to 'UTF-8' # ' @param curlOpts a named list or CURLOptions object identifying the curl options for the handle.