There are various great frameworks around to scrape information from websites;
Since v12 of Mathematica you have an integrated scraping API at your disposal as well. Like many other things in Mathematica it’s all nicely integrated and super-easy to use.
Below are some examples: scraping a traditional website (orbifold.net), Pinterest and Facebook.
Note that scraping does not mean you can bypass authentication or that it is legal to harvest terrabytes. The scandal should come to mind.
Start by opening a browser session:
The remarkable thing is that you don’t need to install anything and it takes only two lines (three for a slow site) of code.
Fetching the title of the page is just as easy:
Let’s try to extract article titles from the ‘knowledge representation’ section:
In this case the URL contains, due to the underlying WordPress system, a size specification we better remove to have the full image. This can be done using string matching:
Simulating a search action is easy as well, you only need to find a unique identification of the element. You can use CSS or XPath, in this case the id of the search box is sufficient:
What is the URL of this search?
When all done you need to close the session like so:
In this case you need to log in first and it’s slightly trickier than above because of the dynamic nature and cryptic names used. In order to find the correct XPath to an element you need to experiment a bit and use the devtools in chrome (see adjacent image).
Start with a new session and navigate to Pinterest:
To be sure that you find the correct item via XPath you can fetch the element’s text:
So, let’s simulate a click and navigate to the login form:
Once on the login form you need to fetch the various elements to submit your credentials:
Once logged in you can search for something:
The more verbose and basic way does however work:
The images are nicely downloaded but need some cropping which requires as good as no code in Mathematica. In fact you have auto-cropping here and it would take a whole lot of code in Python in order to achieve the same.
Here again, the most difficult part is to pinpoint the most appropriate path to a particular HTML element. The Chrome (or Firefox) devtools help a lot. At best you find you find a particular id or CSS class, at worst you need to use the exact XPath the devtools give you.
Like before we start with a session and loading the Facebook homepage.
Submitting credentials is also easy:
Next, we’ll search for someone and download his/her friends:
There are many ways you can articulate the power of Mathematica on the scraped data. The integrated machine learning and data visualization comes immediately to mind.
Drawing a graph of a scraped website is something you can do with the new API but there are also other ways to do it, Below are three lines of code which output a graph of orbifold’s site up to level two.
The simplicity, like many other things in Wolfram’s language, is baffling.
Note also that it would take only to turn all of this into a REST API.