Scraping Customer Reviews from DM.de

I have been trying to scrap user reviews from DM website without any luck. An example page: https://www.dm.de/l-oreal-men-expert-men-expert-vita-lift-vitalisierende-feuchtigkeitspflege-p3600523606276.html

I have tried to load the product-detail pages with beautifulsoup4 and scrapy.

from bs4 import BeautifulSoup
import requests
url = "https://www.dm.de/l-oreal-men-expert-men-expert-vita-lift-vitalisierende-feuchtigkeitspflege-p3600523606276.html"
response = requests.get(url)
print(response.text)  

Running the code shows no content of the reviews- like you'd get from amazon.de! It only shows the scripts from the website.

EDIT: From the Dev tool, it can be seen that, the reviwes are stored in JSON in the following folder. This exactly what I am trying to extract.

JSON file to Extract

As most modern websites it seems dm.de only loads content through javascript after the page initially loaded. This is problematic because pythons requests library and scrapy only deal with http, but do not load any javascript.

The same thing happens on amazon, but there it is detected and you get a javascript-free version.

You can try this for yourself by disabling javascript in your browser and then opening the site you want to scrape.

Solutions include using a scraper that supports javascript, or scrape using an automated browser (using a full browser also supports js of course). Selenium with chromium worked well for me.

Amazon Review Scraper, and ratings of the product and save it as a CSV excel file. In this tutorial, we will show you how to scrape the reviews from Trustpilot.com, a consumer review website hosting reviews of businesses worldwide. We will use the URL below to scrape consumers' reviews about NIKE product:

I don't have time to play around with the params, but it's all there in the request url to get back that json.

import requests
import json

url = "https://api.bazaarvoice.com/data/batch.json?"
num_reviews = 100

query = 'passkey=caYXUVe0XKMhOqt6PdkxGKvbfJUwOPDhKaZoAyUqWu2KE&apiversion=5.5&displaycode=18357-de_de&resource.q0=reviews&filter.q0=isratingsonly%3Aeq%3Afalse&filter.q0=productid%3Aeq%3A596141&filter.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&sort.q0=submissiontime%3Adesc&stats.q0=reviews&filteredstats.q0=reviews&include.q0=authors%2Cproducts%2Ccomments&filter_reviews.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&filter_reviewcomments.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&filter_comments.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&limit.q0=' +str(num_reviews) + '&offset.q0=0&limit_comments.q0=3&callback=bv_1111_19110'

url = "https://api.bazaarvoice.com/data/batch.json?"
request_url = url + query

response = requests.get(request_url)
jsonStr = response.text
jsonStr = response.text.split('(',1)[-1].rsplit(')',1)[0]
jsonData = json.loads(jsonStr)

reviews = jsonData['BatchedResults']['q0']['Results']

for each in reviews:
    print ('Rating: %s\n%s\n' %(each['Rating'], each['ReviewText']))

Output:

Rating: 5
Immer wieder zufrieden

Rating: 5
ich bin mit dem Produkt sehr zufrieden und kann es nur weiterempfehlen.

Rating: 5
Super Creme - zieht schnell ein - angenehmer Geruch - hält lange vor - nicht fettend - ich hatte schon das Gefühl, dass meine Falten weniger geworden sind. Sehr zu empfehlen

Rating: 5
Das Produkt erfüllt meine Erwärtungen in jeder Hinsicht-ich kaufe es gerne immer wieder

Rating: 5
riecht super, zieht schnell ein und hinterlsst ein tolles Hautgefhl

Rating: 3
ganz ok...die Creme fühlt sich nur etwas seltsam an auf der Haut...ich konnte auch nicht wirklich eine Verbesserung des Hautbildes erkennen

Rating: 4
Für meinen Geschmack ist das Produkt zu fettig/dick zum auftauen.

Rating: 1
Ich bin seit mehreren Jahren treuer Benutzer von L'oreal Produkten und habe bis jetzt immer das blaue Gesichtsgel verwendet. Mit dem war ich mehr als zufrieden. Jetzt habe ich die rote Creme gekauft und bin total enttäuscht. Nach ca. einer Stunde entwickelt sich ein sehr seltsamer Geruch, es riecht nach ranssigem Öl! Das ist im Gesicht nicht zu ertragen.

....

Edit:

Ton of cleaning up to do to make this more compact, but here's the basic query:

import requests
import json

url = "https://api.bazaarvoice.com/data/batch.json"
num_reviews = 100

payload = {
'passkey': 'caYXUVe0XKMhOqt6PdkxGKvbfJUwOPDhKaZoAyUqWu2KE',
'apiversion': '5.5',
'displaycode': '18357-de_de',
'resource.q0': 'reviews',
'filter.q0': 'productid:eq:596141',
'sort.q0': 'submissiontime:desc',
'stats.q0': 'reviews',
'filteredstats.q0': 'reviews',
'include.q0': 'authors,products,comments',
'filter_reviews.q0': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q0': 'contentlocale:eq:de*,de_DE',
'filter_comments.q0': 'contentlocale:eq:de*,de_DE',
'limit.q0': str(num_reviews),
'offset.q0': '0',
'limit_comments.q0': '3',

'resource.q1': 'reviews',
'filter.q1': 'productid:eq:596141',
'sort.q1': 'submissiontime:desc',
'stats.q1': 'reviews',
'filteredstats.q1': 'reviews',
'include.q1': 'authors,products,comments',
'filter_reviews.q1': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q1': 'contentlocale:eq:de*,de_DE',
'filter_comments.q1': 'contentlocale:eq:de*,de_DE',
'limit.q1': str(num_reviews),
'offset.q1': '0',
'limit_comments.q1': '3',

'resource.q2': 'reviews',
'filter.q2': 'productid:eq:596141',
'sort.q2': 'submissiontime:desc',
'stats.q2': 'reviews',
'filteredstats.q2': 'reviews',
'include.q2': 'authors,products,comments',
'filter_reviews.q2': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q2': 'contentlocale:eq:de*,de_DE',
'filter_comments.q2': 'contentlocale:eq:de*,de_DE',
'limit.q2': str(num_reviews),
'offset.q2': '0',
'limit_comments.q2': '3',

'callback': 'bv_1111_19110'}


response = requests.get(url, params = payload)
jsonStr = response.text

jsonStr = response.text.split('(',1)[-1].rsplit(')',1)[0]
jsonData = json.loads(jsonStr)

reviews = jsonData['BatchedResults']['q0']['Results']
for k, v in jsonData['BatchedResults'].items():
    for each in v['Results']:
        print ('Rating: %s\n%s\n' %(each['Rating'], each['ReviewText']))

Medical Review of Reviews, Pyometritis in Cancer of Uterus ( La pyome- trie dans la cancer de l'uterus), Lomon —197a Uterus and Cervix Uteri, Microscopic Diagnosis of Diseases of, from Scraping, J. J. Uterus, Considerations of Cancer of, D. M. Gibson— 50, Oct., p. In this web scraping tutorial, we will build an Amazon Product Review Scraper, which can extract reviews from products sold on Amazon into an Excel spreadsheet. If you are interested in scraping Amazon prices and product details, you can read this tutorial – How To Scrape Amazon Product Details and Pricing using Python.

I have tried a lot to properly scrape DM product detail pages with scrapy and bs4 but failed to get a 100% accurate scraper. That's why I have decided to move to selenium. It is slow but gives 100% accurate scraping result.

    try:
        driver.get(url)
        print("Current URL is Valid --> OK")
        print("Current URL : ", url)
    except Exception as e:
        print("URL : ", url, " -->> is Invalid!!!")
        print("Error Occured : ", e)
        driver.quit()

    driver.maximize_window()
    driver.set_page_load_timeout(10)

    ## close overlay and cookies
    time.sleep(round(random.uniform(1.0,1.5),2))  # give time to properly load the page initially
    try:
        driver.find_element_by_xpath('//*[@id="custom-layer-wrapper"]/section/header/button').click()
        driver.find_element_by_xpath('//*[@id="overlays"]/div[2]/div/div/div[2]/button').click()
    except Exception as e:
        print(e)

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.65);") # scroll down to next review page button
    time.sleep(round(random.uniform(4.5,5.5),2))  # give time to properly load the page initially

    while True:
        try:
            # iterate through each comment page
            response = driver.execute_script("return document.documentElement.outerHTML")  # Export rendered HTML
            # now extract the reviews
            soup = BeautifulSoup(response, 'lxml')
            soup = soup.find('ol', {'class': 'bv-content-list-reviews'})
            # product_title = product_title + soup.find('div',{'data-dmid' : 'detail-page-headline'}).text

            tempR = soup.find_all('div', {'class': 'bv-content-summary-body-text'});reviews = reviews + tempR
            tempS = soup.find_all('span', {'class': 'bv-content-rating bv-rating-ratio'});stars = stars + tempS
            tempT = soup.find_all('div', {'class': 'bv-content-title-container'});titles = titles + tempT
            tempU = soup.find_all('div', {'class', 'bv-content-author-name'}); users = users + tempU;
            tempH = soup.find_all('div', {'class', 'bv-content-tag-dimensions'}); hauttyps = hauttyps + tempH;
            tempD = soup.find_all('div', {'class', 'bv-content-datetime'}); dates = dates + tempD;
            # for item in driver.find_elements_by_css_selector('[itemprop="dateCreated"]'):
            #     dates.append(item.get_attribute('content'))

            tempUp = soup.find_all('button', {'class': 'bv-content-btn-feedback-yes'}); helpUp = helpUp + tempUp;
            tempDown = soup.find_all('button', {'class': 'bv-content-btn-feedback-no'}); helpDown = helpDown + tempDown;

            ## Go to next Review page
            # button_next = driver.find_element_by_xpath('//*[@id="BVRRContainer"]/div/div/div/div/div[3]/div/ul/li[2]/a/span[2]')
            # button_next = driver.find_element_by_css_selector('#BVRRContainer > div > div > div > div > div.bv-content-pagination > div > ul > li.bv-content-pagination-buttons-item.bv-content-pagination-buttons-item-next > a > span.bv-content-btn-pages-next')
            button_next = driver.find_element_by_partial_link_text('►')
            button_next.location_once_scrolled_into_view
            button_next.click()
            time.sleep(round(random.uniform(2.5,3.0),2))  # give time to properly load the page initially
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.90);") # scroll down to next review page button
            time.sleep(round(random.uniform(4.5,5.0),2))  # give time to properly load the page initially

        except Exception as e:
            print(e)
            print("----REACHED THE LAST PAGE-----")
            break

    time.sleep(3)  #
    driver.quit()

Electric Railway Review, The cast iron scraper bars with tapered slots which receive the tapered lugs on B. Albright, E. H. Baker and D. M. Brady. STREET. RAILWAY. REVIEW. 559. Scraping customer review links. Customer reviews will be present in each page of the products. But these are just few. We want all the customer reviews for the products. So, we have to scrape the

The Street Railway Review, The cast iron scraper bars with tapered slots which receive the tapered lugs on the scrapers cause the Badges—W. B. Albright, E. H. Baker and D. M. Brady. In general, you can scrape consumer reviews from a website, although it depends on which site you scrape from, what you need to do (signing up, agreeing to TOS, logging in) in order to retrieve the reviews, how much of a load you’re putting on the host server, and what you do with the reviews afterwards.

Emerging Trends in Electrical, Communications, and Information , Discussions from multiple forum of the same disease can be scraped and analyzed Identifying the best feature combination for sentiment analysis of customer reviews. and communication technologies for disaster management (​ICT-DM). While rating indicates an overall customer sentiment, it is the reviews that contain highly valuable information describing major pain points of customers’ experience. By scraping all those reviews we can collect a decent amount of quantitative and qualitative data, analyze it and identify areas for improvement.

Diagnostic Surgical Pathology of the Head and Neck E-Book, be done noninvasively on exfoliative cells collected by scraping the lesion surface, this technique Tromp DM, de Leeuw JR, et al: Laryngeal cancer patients: Analysis of patient delay at A review of the literature and clinical case reports]. Here are a few things to consider before making your choice: 1. How many reviews do you need? Some companies charge per review so that can be bad if you need a lot of them or if you don’t know how many there are.

Comments
  • What is your expected data? is it the stars or written reviews?
  • @Nick From the Dev tool, it can be seen that, the reviwes are stored in JSON in the following folder. This exactly what I am trying to extract. pasteboard.co/IpLjENQ.jpg
  • thanks a lot. Do you have any github repos that I can have a look at?
  • @xollad I can't make the code I wrote available to you I'm afraid, but to get you started selenium-python.readthedocs.io should work. If you'd prefer a language other than python, selenium supports a variety of languages, just google for it. Also consider marking my answer with a checkmark if it helped you out :)
  • This is marvelous!
  • Do I understand correctly you just unaccepted my answer in which I gave an explanation and pointed you towards Selenium, and accepted your own answer instead, where you say you solved the issue using my suggestion, and then don't give credit? This migt be just petty, but I'm actually butthurt aout stuff like that.