Building a Web Scraper

ISYS 403.  Business oriented programming.  Goal: Program a news aggregator. Project statement: Program a news aggregator that scrapes a news page and posts the top news stories.  The assignment gives a view into how search engines and other scrapes work online.

Many sources publish the news: local radio and TV stations, the Associated Press, national and international sources, and aggregators.  Aggregators like Google News don’t actually create news stories; instead, they parse the news stories created by other sources, discover like stories, and publish a combined view.

The problem is each site on the web publishes in HTML — a plain-text, free-flowing format.  You’ll have to code a technique called page scraping.  Page scraping is programatically going through a retrieved HTML source file and picking specific data pieces, such as news headlines and links, from the HTML.

Regular expressions are one of the best text parsing techniques available.  Beyond parsing HTML, they are useful for searching through all types of free-form text.

Completed project using businessweek.com/technology:

I thought this project was awesome!  I've been wanting to see how Java interfaces with the web and this was it.  My first time using java to hit the web and it really wasn't too bad.  Downloading the URL's HTML to a string was surprisingly simple.  My code looked something like this:

        String lineOfHTML = "";         String content = "";         //download home page         try {             URL url = new URL("http://www.businessweek.com/technology/");             BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));             while ((lineOfHTML = reader.readLine()) != null) {                 content += lineOfHTML;             }             reader.close();         }  catch (IOException e) {             e.printStackTrace();         }

As you can see the hard part really isn't that hard.  I used the URL library to grab the webpage and a buffered reader to read in all the HTML, then just stored it as a string.  The samples out there on bing made it really easy to.  Try searching "download webpage in java" on bing.  You'll get a lot of hit, trust me.

Parsing the HTML to find the new stories meant looking up businessweek's HTML source code, finding the tags surrounding news stories, and using regex to grab the information we wanted, which was the name, link, and description in our case. My regex expression ended up looking like this in order to get the top two stories: String regex = ".*?href=\"(.*?)\".*?>(.*?).*?

(.*?)

.*?href=\"(.*?)\".*?>(.*?).*?

(.*?)

";  Fun eh?

So that was it.  Put it together and you have the news parser as pictured above!