The Exhaust Pipe: My new favorite way to scrape a wiki table

Patryk Soika

Dmytro Perepolkin’s package “polite” has the goal of promoting responsible web etiquette.

I’m posting this very short example as a template of sorts, mostly for my own benefit, mostly so I don’t lose it.


if("polite" %in% installed.packages() == F) {
  devtools::install_github("dmi3kno/polite")
}
library(polite)
library(rvest)
library(tidyverse)

url = "https://en.wikipedia.org/wiki/Comparison_of_UPnP_AV_media_servers"
xpath = '//*[@id="mw-content-text"]/div/table'

dframe <-
  url %>%
  bow() %>%
  scrape() %>%
  html_node(xpath = xpath) %>%
  html_table() %>%
  as_tibble()


dframe %>% filter(`Unix-like` == "Yes",
                  License != "Prop.",
                  `Still Supported` == "Yes") %>%
  select(-Windows,-Audio,-Images,-"OS X",-`Multilingual[1]`) %>%
  DT::datatable(options = list(
    pageLength = -1,
    lengthChange = FALSE,
    searching = FALSE,
    paging = FALSE,
    ordering = FALSE
  ))

The package does a lot more, but this is my most basic, template for using it. I know that # once you bow(), you don’t have to do it again in a session. You can simply nod() and then scrape() again.

You can read about the rest of the “polite” features at Dmytro Perepolkin’s git repository, at https://github.com/dmi3kno/polite.

References:

Perepolkin, Dmytro. Be Nice on the Web. R, 2018. https://github.com/dmi3kno/polite. (accessed October 30, 2018).
Wikipedia contributors, “Comparison of UPnP AV media servers,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Comparison_of_UPnP_AV_media_servers&oldid=837290833 (accessed October 30, 2018).

Comment on this article

Citation

For attribution, please cite this work as

Soika (2018, Oct. 30). The Exhaust Pipe: My new favorite way to scrape a wiki table. Retrieved from https://friendimaginary.github.io/posts/2018-10-30-my-new-favorite-way-to-scrape-a-wiki-table/

BibTeX citation

@misc{soika2018my,
  author = {Soika, Patryk},
  title = {The Exhaust Pipe: My new favorite way to scrape a wiki table},
  url = {https://friendimaginary.github.io/posts/2018-10-30-my-new-favorite-way-to-scrape-a-wiki-table/},
  year = {2018}
}