How to get the page source with Mechanize/Nokogiri
I'm logged into a webpage/servlet using Mechanize.
I have a page object:
jobShortListPg = agent.get(addressOfPage)
When I use:
I get the "mechanized" version of the page which I don't want:
#<Mechanize::Page::Link "Home" "blahICScriptProgramName=WEBLIB_MENU.ISCRIPT3.FieldFormula.IScript_DrillDown&target=main0&Level=0&RL=&navc=3171">
How do I get the HTML source of the page instead?
ruby - How to Get the Page Source with Mechanize/Nokogiri, Helpful references: ruby.bastardsbook:Parsing HTML with Nokogiri, Stack Overflow:Getting attribute's value in Nokogiri to extract link URLs,� How to Get the Page Source with Mechanize/Nokogiri (2) Use .body. puts jobShortListPg.body I'm logged into a webpage/servlet using Mechanize. I have a page object
content method of the page object.
Scraping web pages with ruby, mechanize and nokogiri -- How To , Web scraping is an approach for extracting data from websites that don't have an API. Mechanize uses the nokogiri gem internally to parse HTML responses. There are other tools built on top of Mechanize, like Wombat, but since my task is so simple I figured I could just write everything I needed with Mechanize and Nokogiri. It's usually a better idea to work with simple tools when you're first grasping concepts so you don't get lost in the weeds of some high powered framework.
require 'nokogiri' doc = Nokogiri::HTML(<<EOT) <html> <head></head> <body>foo</body> </html> EOT doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" + # "<html>\n" + # " <head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head>\n" + # " <body>foo</body>\n" + # "</html>\n"
doc.to_s # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" + # "<html>\n" + # " <head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head>\n" + # " <body>foo</body>\n" + # "</html>\n"
If it distracts you to see the embedded new-lines, this might help:
puts doc.to_s # >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> # >> <html> # >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head> # >> <body>foo</body> # >> </html>
HOWTO scrape websites with Ruby & Mechanize � ReadySteadyCode, Crawling pages with Mechanize and Nokogiri It's late and my brain is firing on about 1 cylinder so it took longer than I expected to get everything working. doc = page.parser # Same as Nokogiri::HTML(page.body) p doc. Choose the Show page source option. Opera. To view the source code of a web page in Opera, follow the steps below. Press Ctrl+U on your keyboard. Or. Right-click on a blank part of the web page and select Page source from the pop-up menu that appears. Or. Open Opera and navigate to the web page of your choice.
Crawling pages with Mechanize and Nokogiri, By the end of this guide, you should be able to fetch pages, click links, fill out and submit forms, scrape data, and Mechanize returns a page object whenever you get a page, post, or submit a form. Mechanize uses nokogiri to parse HTML. The problem is that browsers often add them when rendering the page and display them when you look at the page's source, so don't trust the browser's HTML source view. Instead ALWAYS use wget or curl or nokogiri at the command-line to view the actual page source to verify the actual markup. – the Tin Man Feb 14 at 2:27
GUIDE - RDoc Documentation, How to Get the Page Source with Mechanize/Nokogiri. 2. Getting link from Mechanize/Nokogiri. 0. Mechanize and Nokogiri printing to terminal instead of file.
Following links in mechanize is a hassle because you need the have the link object. Sometimes it is easier to get them all and find the link you want from the text. for link in br.links(): print link.text, link.url Follow link and click links is the same as submit and click
- In Ruby we use snake_case, not camelCase for variables or method names. ItIsAReadabilityThing.
- I still get the mechanized version of the page :(.
- Are you sure you did
.bodyon the page that was returned by
.get? I get a pure string back.
- Silly me, I was working on the wrong file. Did .body on the right file and it worked! Thank you Dogbert!
- Mechanized version is better than source. they have many helpers in them
- @carbonr unless calling an API and the source is the JSON you actually need