2019-10-09
read_html not retrieving all data from simple html page, instead returning incomplete html?
stackoverflow
Question

read_html() usually returns all the page html for a given url.

But when I try on this url, I can see that not all of the page is returned.

Why is this (and more importantly, how do I fix it)?

Reproducible example

page_html <- "https://raw.githubusercontent.com/mjaniec2013/ExecutionTime/master/ExecutionTime.R" %>% 
  read_html

page_html %>% html_text %>% cat
# We can see not all the page html has been retrieved

# And just to be sure
page_html %>% as.character

Notes

  • It looks like github is okay with bots visiting, so I don't think it's an issue to do with a github
  • I tried the same scrape with ruby's Nokogiri library. It gives exactly the same result at read_html. So it looks like it's not something that's specific to R or read_html()
Answer
1

This looks like a bug associated when there's an assignment operator in the text of the page.

fakepage <- "<html>the text after the assignment operator <- will be lost</html>"

read_html(fakepage) %>%
  html_text()

[1] "the text after the assignment operator "

As the page you're after is a plain text file, you can use readr::read_file() in this instance.

readr::read_file("https://raw.githubusercontent.com/mjaniec2013/ExecutionTime/master/ExecutionTime.R")
read_html not retrieving all data from simple html page, instead returning incomplete html?
See more ...