There are various great frameworks around to scrape information from websites;

scrapy is a popular crawling and scraping tool for Python
Google’s puppeteer is an automation framework you can use to scrape data
cheerio and various other NodeJS framework also help with acquiring data

Since v12 of Mathematica you have an integrated scraping API at your disposal as well. Like many other things in Mathematica it’s all nicely integrated and super-easy to use.
Below are some examples: scraping a traditional website (orbifold.net), Pinterest and Facebook.
Note that scraping does not mean you can bypass authentication or that it is legal to harvest terrabytes. The Facebook-Cambridge Analytica scandal should come to mind.

Scraping orbifold.net

Start by opening a browser session:

Scraping_1.png

Scraping_2.gif

Let’s take a screenshot of the homepage. A small pause is sometimes necessary for sites taking a while to load. See this thread;

Scraping_3.gif

Scraping_4.gif

The remarkable thing is that you don’t need to install anything and it takes only two lines (three for a slow site) of code.
Fetching the title of the page is just as easy:

Scraping_5.png

Scraping_6.png

Let’s try to extract article titles from the ‘knowledge representation’ section:

Scraping_7.gif

Scraping_8.png

Extracting the URLs of the images is not immediately possible with the Mathematica API, you need to use JavaScript instead. The trick here is that you need a return statement to transfer the JS data to Mathematica:

Scraping_9.png

In this case the URL contains, due to the underlying WordPress system, a size specification we better remove to have the full image. This can be done using string matching:

Scraping_10.png

Scraping_11.gif

Simulating a search action is easy as well, you only need to find a unique identification of the element. You can use CSS or XPath, in this case the id of the search box is sufficient:

Scraping_12.gif

Scraping_13.png

What is the URL of this search?

Scraping_14.png

Scraping_15.png

When all done you need to close the session like so:

Scraping_16.png

Scraping Pinterest

In this case you need to log in first and it’s slightly trickier than above because of the dynamic nature and cryptic names used. In order to find the correct XPath to an element you need to experiment a bit and use the devtools in chrome (see adjacent image).

Scraping_17.gif

Start with a new session and navigate to Pinterest:

Scraping_18.gif

To be sure that you find the correct item via XPath you can fetch the element’s text:

Scraping_19.png

Scraping_20.png

So, let’s simulate a click and navigate to the login form:

Scraping_21.png

Once on the login form you need to fetch the various elements to submit your credentials:

Scraping_22.gif

Once logged in you can search for something:

Scraping_23.png

The result is a collection of images we wish to download. In principle this should work in Chrome but somehow it does not. That is, the XPath instruction below does work in devtools but the Javascript is not properly executed via the following line:

Scraping_24.png

Scraping_25.png

The more verbose and basic way does however work:

Scraping_26.gif

The images are nicely downloaded but need some cropping which requires as good as no code in Mathematica. In fact you have auto-cropping here and it would take a whole lot of code in Python in order to achieve the same.

Scraping_27.png

Scraping_28.gif

To compare, you can find on Github various (lengthy) Python samples achieving the same.

Scraping Facebook

Here again, the most difficult part is to pinpoint the most appropriate path to a particular HTML element. The Chrome (or Firefox) devtools help a lot. At best you find you find a particular id or CSS class, at worst you need to use the exact XPath the devtools give you.

Like before we start with a session and loading the Facebook homepage.

Scraping_29.gif

Submitting credentials is also easy:

Scraping_30.gif

Scraping_31.png

Next, we’ll search for someone and download his/her friends:

Scraping_32.gif

Scraping_33.png

Crawler

There are many ways you can articulate the power of Mathematica on the scraped data. The integrated machine learning and data visualization comes immediately to mind.
Drawing a graph of a scraped website is something you can do with the new API but there are also other ways to do it, Below are three lines of code which output a graph of orbifold’s site up to level two.
The simplicity, like many other things in Wolfram’s language, is baffling.
Note also that it would take only one extra line of code to turn all of this into a REST API.

Scraping_34.gif

Scraping_35.gif