How to get the page source with Mechanize/Nokogiri

Related searches

I'm logged into a webpage/servlet using Mechanize.

I have a page object:

jobShortListPg = agent.get(addressOfPage)

When I use:

puts jobShortListPg

I get the "mechanized" version of the page which I don't want:

#<Mechanize::Page::Link "Home" "blahICScriptProgramName=WEBLIB_MENU.ISCRIPT3.FieldFormula.IScript_DrillDown&target=main0&Level=0&RL=&navc=3171">

How do I get the HTML source of the page instead?


Use .body:

puts jobShortListPg.body

ruby - How to Get the Page Source with Mechanize/Nokogiri, Helpful references: ruby.bastardsbook:Parsing HTML with Nokogiri, Stack Overflow:Getting attribute's value in Nokogiri to extract link URLs,� How to Get the Page Source with Mechanize/Nokogiri (2) Use .body. puts jobShortListPg.body I'm logged into a webpage/servlet using Mechanize. I have a page object


Use the content method of the page object.

jobShortListPg.content

Scraping web pages with ruby, mechanize and nokogiri -- How To , Web scraping is an approach for extracting data from websites that don't have an API. Mechanize uses the nokogiri gem internally to parse HTML responses. There are other tools built on top of Mechanize, like Wombat, but since my task is so simple I figured I could just write everything I needed with Mechanize and Nokogiri. It's usually a better idea to work with simple tools when you're first grasping concepts so you don't get lost in the weeds of some high powered framework.


In Nokogiri use to_s or to_html on the main document Node:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <head></head>
  <body>foo</body>
</html>
EOT

doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html>\n" +
#    "  <head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head>\n" +
#    "  <body>foo</body>\n" +
#    "</html>\n"

or:

doc.to_s
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html>\n" +
#    "  <head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"></head>\n" +
#    "  <body>foo</body>\n" +
#    "</html>\n"

If it distracts you to see the embedded new-lines, this might help:

puts doc.to_s

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >>   <body>foo</body>
# >> </html>

HOWTO scrape websites with Ruby & Mechanize � ReadySteadyCode, Crawling pages with Mechanize and Nokogiri It's late and my brain is firing on about 1 cylinder so it took longer than I expected to get everything working. doc = page.parser # Same as Nokogiri::HTML(page.body) p doc. Choose the Show page source option. Opera. To view the source code of a web page in Opera, follow the steps below. Press Ctrl+U on your keyboard. Or. Right-click on a blank part of the web page and select Page source from the pop-up menu that appears. Or. Open Opera and navigate to the web page of your choice.


Crawling pages with Mechanize and Nokogiri, By the end of this guide, you should be able to fetch pages, click links, fill out and submit forms, scrape data, and Mechanize returns a page object whenever you get a page, post, or submit a form. Mechanize uses nokogiri to parse HTML. The problem is that browsers often add them when rendering the page and display them when you look at the page's source, so don't trust the browser's HTML source view. Instead ALWAYS use wget or curl or nokogiri at the command-line to view the actual page source to verify the actual markup. – the Tin Man Feb 14 at 2:27


GUIDE - RDoc Documentation, How to Get the Page Source with Mechanize/Nokogiri. 2. Getting link from Mechanize/Nokogiri. 0. Mechanize and Nokogiri printing to terminal instead of file.


Following links in mechanize is a hassle because you need the have the link object. Sometimes it is easier to get them all and find the link you want from the text. for link in br.links(): print link.text, link.url Follow link and click links is the same as submit and click