JavaScript: How to Extract Text From Contents of an External Web Page

What’s the Challenge?

There is a software security model that prevents scripts from an existing page to fetch data from another page that does not belong to the same origin as the existing page. Hence, Ajax $.getJSON & $.get will return nothing if the value of the URL parameter doesn’t match the site origin. (source: https://en.wikipedia.org/wiki/Same-origin_policy).

It’s quite tricky to work around this design. Often, one would opt to use other non-browser dependent languages such as PowerShell, Python, and Bash. However, this article’s purpose would be null and void if that was the chosen route. Moreover, the objective here is to understand the security models and methods assertions in JavaScript.

Where’s the Answer?

The key is to enable cross-origin requests on the server. Currently, there are several options to accomplish this:

a. JSONP (JavaScript Object Notation Padded)
The normal functions to make API calls are ‘fetch‘ and ‘XMLHttpRequest‘ that are subject to same-origin policies, and they return JSON objects. Conversely, JSONP calls return padded JSON objects. Since most servers allow padding functions with arbitrary names, JSONP padding functions can make cross-site calls to bypass same-origin restrictions. However, the remote-server must be setup to respond to the call or request with the content with “text/javascript” header marker. Following that logic, such remote-server can also be local-host with a different port to appear as different-origin to satisfy the browser’s security requirements. The limitation of JSONP is that it can only allow the browser client to make GET requests. Here is an illustration from freecodecamp.org:

// localhost:3000
<html>
<body>
<script type="text/javascript">
// 1️⃣ function declared here and passed as a query parameter in our script src
function excitedGreeting(name) {
console.log("Hello " + name + "!!!");
}
</script>
<script type="text/javascript" src="http://localhost:8000/api?callback=excitedGreeting"></script>
</body>
</html>

// localhost:8000
app.get("/api", function(req, res) {
// 2️⃣ The callback query parameter is used to construct our JavaScript file
const callbackFunction = req.query.callback;
const data = getImportantInformationFromDatabase();
// data === "World";

res.setHeader("Content-Type", "text/javascript");
res.send(`${callbackFunction}(${data});`);
});

// localhost:8000/api
excitedGreeting("World");

b. CORS (Cross Origin Resource Sharing):
This requires the destination server to allow a certain foreign clients to fetch. The clients would then connect to either its internal proxy or the server’s proxy via URL before using the fetch method. An example of this would be Rorschach’s script available at https://github.com/krakenjs/fetch-robot. Quoting the man:

In the parent window kimconnect.com:

<script src="https://rawgit.com/krakenjs/fetch-robot/master/dist/fetch-robot.min.js"></script>

<script>
// Create a proxy instance and open the iframe

let proxy = fetchRobot.connect({ url: 'https://www.dragoncoin.com/fetch-robot-proxy' });

// Use `proxy.fetch` in the same way as `fetch`

proxy.fetch('https://www.dragoncoin.com/api/foo', { method: 'POST' })
.then(response => response.text())
.then(console.log);
</script>

In the child window niteowl.com/fetch-robot-proxy:

<!-- Add a fetch polyfill for older browsers -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/fetch/2.0.3/fetch.min.js"></script>
<script src="https://rawgit.com/krakenjs/fetch-robot/master/dist/fetch-robot.min.js"></script>

<script>
// Enable requests to be passed through the current frame using fetchRobot

fetchRobot.serve({

allow: [
{
path: [
'/api/foo',
'/api/bar'
],

headers: [
'x-csrf'
]
},

{
origin: new RegExp('^https://(kimconnect|someothersite)\.com$'),

path: [
'/api/baz',
],

headers: [
'x-custom-header'
],

credentials: 'include'
}
]
});
</script>
Demo Something Simple

Since displaying HTML contents from external sites is not recommended nor technically easy. Why don’t we take a look at a simple code to parse contents of a page within the same site (same-origin).

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.0/jquery.min.js"></script>
<script>
$(document).ready(function(){
$("button").click(function(){
$.get("https://kimconnect.com/", function(data){
console.log(data);
});
});
});
</script>

<button>HTTP GET request</button>

The button below will output into the console log. Nothing will appear on the screen until you turn on developer’s mode to view logs.

Leave a Reply

Your email address will not be published. Required fields are marked *