This is something I’ve been trying to do for a while, but never took the time to do it.
I was always curios to see if it was possible to get the rendered source code of a web site.
Recently I was faced with a situation that I needed to get some data from a wesbite. However the website didn’t provide an API. So I thought, let me get the source code parse it and extract the data from it.
What I wanted was to get the rendered source code, the one with the constructed DOM object, not the one served from their server.
What made me realize the task was possible was the inspect element of google chrome.
Using the inspect element it is possible to see the rendered html of any website.
If chrome can do it so do I
With that mind set I first started hacking the inspect element tool of chrome.
So without any ideas where else should I do? Of course, google it
The first results from my search mentioned the use of iframes to load a page, and then access the DOM object of the iframe from the parent page. I gave it a try, and it worked!
The way I did it was I created two files, one file was the parent page, the other was the one would be loaded in the iframe. I was able to access DOM object of the iframe and perform any operation wanted.
However, here it comes the trick part. I can only access the DOM of an iframe, if the page loaded in the iframe belongs to the same domain as the parent page. So when I tested with 411.ca, for example, it threw a permission denied error.
With that option out of the way, I started looking for different ways to do what I wanted.
One idea I had was to make an ajax call, and then load the response(the html source code) in the iframe of my page. So I wouldn’t have the problem of permission denied when trying to access the DOM object of the iframe, since I would loaded the source code from the parent page.
It is easier said then done. The same problem I had trying to access the DOM of the page loaded from another domain in the iframe, I had trying to request the page from another domain. I couldn’t make a XHR to a page on a different domain.
Even tough I hit another wall, I felt that I was getting closer to a solution.
I filtered my searches to how to make an ajax request between different domains. I ended up here:
This library is amazing. It is written in perl, so it uses the LWP::UserAgent and HTTP::Request classes to perform the operations. I haven’t fully understood the library yet, but by looking at the source code, it look like it creates a different header for the request so it matches the one from the url being requested. So by doing that the cross-domain restriction doesn’t apply. Again, I’m not sure, for more information just contact Bart Van der Donck , he is the guy who wrote the library.
Another tool I also found it very cool was the HTML to DOM parser provided by Mozilla.
This tool is very interesting, it uses components classes to perform the parsing. I’m still trying to understand how the safely parsing is actualy done. However I just want to document a few problems I had trying to make the parser work:
The first time I ran the code from the examples I got the message “Permission denied for ‘localhost’ to get property XPCComponents classes”
Again with the help of google I found that by adding this code before trying to load a component it would give access permission.
The only problem is that it only worked on firefox, it didn’t work on chrome or IE
In the end, putting all the pieces together I was able to request a page from a different domain, load it in my website, and manipulate the DOM object of the requested page.
Here a few screenshots:
I’m using google’s page as an example. You can notice that not all the images are loaded correctly. The reason why, is that is not the browser who is making the request. The request is made using the ajax-cross-domain library and the response(html source code) is loaded in the iframe. So when the the browser tries to load the images of the page it resolves relative paths using the localhost domain, and not google’s domain. This could be fixed if before loading the response in the iframe, the html code be parsed and all the relative paths be replaced by absolute paths using the original domain. So if there is an image with the path like this: “/images/img1.jpg” it would be replaced by “htt://www.google.com/images/img1.jpg”. So then the browser would go to google’s server and get the image. In the screen shot below, what happens is the path for the images is “/images/navlogo91.png”. So the browser resolves the relative paht to “http://mch.local/images/navlogo91.png”
Here is a screenshot of the source code of the same Ajax search seen through the browser “View Source Code”
Highlighted are the only two occurances of Ajax on the page, and they are no from the search result.