Recursively Find Hyperlinks In A Website

I was trying to write a script to crawl a website and fetch all the hyper links pointing to all the a particular file type e.g. .pdf or .mp3. Somehow the following command did not work for me.

wget -r -A .pdf <URL>

It did not go recursively and download all PDF files. I may have to ask in  stackoverflow.

Anyway I wrote my script in python and it worked well. At least for the site I was trying crawl. The following scripts give all the absolute URLs pointing to the desired type of files in the whole website. You may have to add few more strings in excludeList configuration variable to suite your target site else you have end up infinite loop.

import re
import urllib2
import urllib

## Configurations
# The starting point
baseURL = <home page url>
maxLinks = 1000
excludeList = ["None","/","./","#top"]
fileType = ".pdf"
outFile = "links.txt"

#Gloab list of links already visited , don't want to get into loop
vlinks = []
#This is where output is stored the list of files
files = []

# A recursive function which takes a url and adds the outpit links in the global
# output list.

def findFiles( baseURL ):
    #URL encoding
    baseURL = urllib.quote(baseURL, safe="/:=&?#+!$,;'@()*[]")
    print "Scanning URL "+baseURL

    #Check maximum number of links you want to store
    print "Number of link stored - " + str(len(files))
    if(len(files) > maxLinks):

    # the current page
    website = ""
        website = urllib2.urlopen(baseURL)
    except urllib2.HTTPError, e:
        print baseURL + " NOT FOUND"
    # HTML content of the current page
    html =
    # fetch the anchor tags using regular expression from the html
    # Beautifull Soup does it wonderfully in one go
    links = re.findall('(?<=href=["\']).*?(?=["\'])', html)
    for link in links:
        #print link
        url = str(link)
        # Found the file type, then store and move to the next link
            print "file link stored" + url
            f = open(outFile, 'a')
        # Exlude external links and self links , else it will keep looping
        if not (url.startswith("http") or ( url in excludeList ) ):
            #Build the absolute URL and show it !
            print "abs url = " + baseURL.partition('?')[0].rpartition('/')[0]+"/"+url
            absURL =  baseURL.partition('?')[0].rpartition('/')[0]+"/"+ url
            #Do not revisit the URL
            if not (absURL in vlinks):

#Finally call the function
print files

Getting started with XBMC

XBMC is a free and open source software media player for various OS platforms especially mobile. This is very useful to convert your TV dongles e.g. Android PC or Apple TV or Raspberry PI to a media center in your TV. This can not only organize and play your local media but also can stream movies and TV series. I downloaded the Android APK and installed my RocketChip MK806 Android TV.

To begin with I scanned all my mp3 and videos in my SD card using XBMC and all of them were ready to play. I found couple of plugins which listed almost all recent movies and TV series very well organized by season and episodes. As of now I have installed

  1. Mash Up ( Installation Steps )
  2. 1 Channel ( Installation Steps )

Making it Full screen : However you can notice that the android navigation bar at the bottom always appears (Even during movie play). This is sometimes distracting. So I found an app Fullscreen which can help you get rid of this navigation bar.

Start on Boot : Also If XBMC is the only app you are going to use every time you boot your android TV then it makes sense to have it launched automatically every time you boot your device. I found this app Android Startup Manager. In fact you could disable all the user and system app which you think is not of your use in the android. For me the only app I want android to run is XMBC because all other app I use in my phone.

Remote Control : Having a remote control app in your mobile for XMBC is very important, because some of the feature does not work from a wireless mouse. The official remote app is good but I use the app Yaste because it has a feature of seek bar of current video being played. However I am unable to make it auto discover the IP of the XBMC device yet.

Moving the blog again

A year ago I moved from free shared Linux hosting to paid one to GoDaddy. Everything was good, there was no downtime like free hosting solutions. WHen I was in free hosting sometimes my site got blocked by antivirus software because somebody else would have hosted such content in the same server.  However I did not earn any revenue from this blog so paying for hosting was not really my favorite idea.

I read several articles in the internet why GoDaddy is not a very good choice for hosting blogs. I had also used GoDaddy for hosting the website of local chapter of IEEE section ( but faced a lot of problem during renewal. First of all only the 1st year hosting price was attractive but the renewal charge was almost four times than that I initially paid for starting the hosting. I had my credit card in GoDaddy payment methods and now they wouldn’t let me remove it until I gave details of another card. This inspired me to close the account itself. Also lately my site got blocked by websense several times , probably because of their blacklisted servers.

I had considered to move to (the PaaS solution ) but they were charging a lot for assigning the domain name and they did not have a domain transfer facility so that I could delete all my account and GoDaddy and only pay wordpress for the domain name. I was also not happy with dealing with GoDaddy customer support who kept on replying me  with some templates from their manual book instead of looking into my hosting problem.

I came across an old post of one of my friend which shows how an wordpress blog could be installed in heroku even in the free tier. Ofcourse there were some limitations but I wanted to get rid of current hosting. Apart from the steps mentioned in the blog post I had to do some additional steps to overcome the limitations of heroku and have my blog up and running.

  1. Heroku now have ClearDB for MySQL so I did not have to go for heroku’s posgreSQL service as mentioned in the blog post.
  2. Because mine is not a new blog and I was moving all the contents from Linux host. I used the wordpress export file and wordpress-importer plugin to migrate the database. For images stored in wp-content  I downloaded via ftp and pushed using git from my local directory. However I was able to reduce the size of the content by using some orphan image checker plugin and deleting unattached files.
  3. I had to create  the .htaccess at the root directory of my blog at heroku to make sure the permalink’s of post and pages are redirected to the appropriate query strings. This is normally created by the wordpress installer itself but heroku file system can not be altered permanently unless attached to storage service which is paid.
  4. The DB size was small , so I added a plugin to optimize the wordpress  database once in a while. It also made the website fast. Initially I saw my export file is huge but later realized that there was a post in which I had dragged and dropped a lot of images from my desktop. All those were stored as data-uri scheme (plenty of junk characters in the post text itself) instead of separate image files.
  5. The MySQL user created by default by clearDB did not have insert and update grants in the database so the database upgrade did not go through after I manually upgraded the wordpress to 3.6.
  6. GoDaddy DNS manager did not have the option to forward the  the domain name (with masking )  to the heroku’s url where my blog was hosted. It was accepting only the IP address which heroku did not give me. So I forwarded it to a subdomain ( and created a CNAME record to forward the domain to the subdomain and subsequently to the heroku URL.

Following are the challenged I am ready to face during maintenance of the site but I think its worth it, if I could save some money. On the bright side you get used to git commands because of frequent usage :) .

  1. Every plugin/theme needs to added via the local git repository to the wp-contents/plugins directory because anything uploaded via the admin portal will not be persisted in storage. It may seem so temporarily but eventually they will go missing when heroku moves the app as part of load balancing. Paid Amazon S3 storage is recommended but as of now I am pushing everything via git. I will try to see of I can use Dropbox for this.
  2. Similarly pictures in the posts should be uploaded through git, otherwise it may be hosted in some 3rd party site and embedded in the blog. This is a better option because it saves the blog’s bandwidth and if blog is moved , the HTML code still referees to the same image.

Thanks to my colleague Sridhar  who initially game the idea of heroku. I think I will be happy with it for sometime till my hunger to pay around it stops. If I could generate some money out of it , Google Compute Engine and AWS/VPS are definitely some good areas to play around. I also like the idea of static blogs so that I don’t waste computing (DB operations and PHP interpretation ) every time some user requests a page from my blog.

Screen scraping using YQL

I have been using Yahoo pipes for a long time now. I have done some screen scraping mash up using them. While yahoo pipe provides a component to fetch the HTML content from a URL, it is bit difficult to cut a specific part because it totally relies on a string to match. I came across YQL console where I could write SQL like queries and fetch the HTML content of any URL. The best part was that it supports XPath expressions for selecting the exact node of the HTML to extract data. For example I write the following query to get the stock price from the web page

select *
from html
where url =""
and xpath='//td[p/text()="LAST TRADE PRICE"]/following-sibling::td[2]/p'

See above code running in YQL Console.

Similarly this query can be made little bit complex and parametrized for the stock symbol to form the appropriate url

select *
from html
where url in (
    select url
    from uritemplate
    where template="
    /trading_stock_quote.asp?Symbol={item}" and item=@item)
and xpath='//td[p/text()="LAST TRADE PRICE"]/following-sibling::td[2]/p | //td[p/text()="LAST TRADED TIME"]/following-sibling::td[1]/p'

See the above code in YQL console. However this will not directly run from the console. One could just create a query alias and pass the required query string like the following.

This could have been done using the built in YQL component in the Yahoo Pipes itself but it would be an extra layer if you just need to get the required content from the HTML instead of having to play around any feed (for which Pipes is still the best choice). Of course there some limits/quota while using such YQL queries, which I need to explore in coming days.

For screen scraping I could directly use Google App Engine’s URLFetch or curl in PHP servers but this would unnecessarily transfer the whole content consuming quota and leading to time lag.

Sharing Photosphere

I have been always a fan of panorama images. There are a lot of photo stitching software which can join many overlapping images to create a single one. I have used Photosynth earlier with a lot of satisfaction. There a lot of phones and digital cameras which can do it right out of the camera in panoramic mode, in which you have to slowly move the capturing device and it will continuously take and stitch photos to create larger panorama.  While there ares some websites like CleVR and GigaPan can help sharing the horizontal panoramas and let embed in various sites, spherical panoramas like the ones taken from photosphere app in Android 4.2 cameras could not be embedded in a straight forward way without doing some HTML coding. Finally I found a site called for this purpose. Following is the embedded photo I took uploaded in this site.

(Click this link for a wider view. Because of low width of the blog, embedded one may not look good)

This time the players were moving, I will try to get a more stable image when I go outdoors next time :)
Here is a horizontal panorama embedded in photosynth. (Needs Microsoft Silverlight plugin )

Play music remotely from phone using Windows 7

I was searching for a Bluetooth transmitter which can transmit the music played in my cellphone to the speaker which mounted on my wall, so that I could control what is being played from my phone without using a very long aux cable. I had earlier posted about this but today I found it really useful to play and control music while lying down on bed. After pairing the phone you just have to click on the following setting in window to transmit music from phone to the speaker via the laptop. I could tune to any internet radio station , play any music on my phone and also take calls in the phone and everybody in the house could hear it.


Night Trek at Khajaguda (!Supermoon)

Yes it was supermoon night but cloud cover ruined it it. While it is proved that supermoon is one of the most over credited astronomical event , still its one of the reason to go out and enjoy the nature at night. Thanks to city light, tough there was no moonlight , we could see the path and do the trek without a torch. We climbed some rocks as well (scrambling). It was monsoon and great time to trek.Thanks to organizers of hats club for holding this meetup and Rajesh for the photos from his new Sony DSLR.


Google AppEngine Channel API

I am not a big fan of page refresh for getting new data from server. AJAX has been the go to technology for these kind of requirements. However its unidirectional. For two way communication where server push is required , we have seen many technologies like comet etc. Websocket and WebRTC has been really cool technologies which helps sending data from server and client real time. These needs special server code for handling such requests. I have already worked on JSR 356 for websocket and glassfish reference implementation (tyrus) earlier. However in Google Appengine provides channel APIs for bidirectional communication. While client to server communication is still over HTTP GET or POST, sever creates a specific “channel” and enables itself to push data any time to specific clients. Under the hood , its actually the client which keeps polling with GET requests for new data to be sent by the server. In any case this API can be use fill in real time game servers.

The server side code

I added the following code to the example given in the previous post to use channel APIs

    private void boradCastNotes(Request request,Note note) {
        ServletContext context = hsr.getSession().getServletContext();
        HashMap<String,ChannelPresence> liveUsers = (HashMap<String,ChannelPresence>)context.getAttribute("liveUsers");
        if(liveUsers != null){
            ChannelService channelService = ChannelServiceFactory.getChannelService();
            ObjectMapper mapper = new ObjectMapper();
            System.out.println("List of connected client ... ");
            String noteStr = null;
            try {
                noteStr = mapper.writeValueAsString(note);
            } catch (IOException e) {
                e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
            for(ChannelPresence cp : liveUsers.values()){
                System.out.print(" Sending message to client --> " +noteStr);
                String channelMessageStr="{"command":"note","data":"+noteStr+"}";
                System.out.print(" Sending message to client after wrapping --> " +channelMessageStr);
                channelService.sendMessage(new ChannelMessage(cp.clientId(),channelMessageStr));


Client Code

In HTML I added an extra button which will send the user entered data using an AJAX request and some JavaScript code to create the channel using the token issues by the server earlier.

<script language="JavaScript">

    //Function called when update button us pressed
    //This will maken an AJAX POST request
    //To the webservice created using sitebricks
    postnote = function(){
        var noteObj = new Object();
    //Generic method for sending any Ajax request
    sendMessage = function(path, method,param) {
        var xhr = new XMLHttpRequest();, path, true);
        //Callback when response is received from the server
        xhr.onload = function () {

    onOpened = function() {

    onMessage = function(message){
        //When message is recived from the server
        // contains the actual string send by the
        //Java code
        //Now convert the JSON string to a Javascript Object
        var data = eval("(" + + ")");
        console.log("Received data from Server "+data)
        //Adds a new note row in the tables

    insertRow = function(data){
        var table=document.getElementById("noteTable");
        var row=table.insertRow(1);
        var cell1=row.insertCell(0);
        var cell2=row.insertCell(1);
        cell1.innerHTML=new Date(;

    //Opens a channel with server with the given token (provided by the server)
    //Internally it keeps polling the server for new messages
    channel = new goog.appengine.Channel('${token}');
    socket =;
    //Define all the listeners
    socket.onopen = onOpened;
    socket.onmessage = onMessage;
    socket.onerror = onError;
    socket.onclose = onClose;

Other features

To track the clients which gets connected to the server , listeners can be added to a specific urls.I track the clients to broadcast the messages to all the connected clients.To enable tracking following needs to be added to appengine-web.xml


Then write POST endpoint handlers

public class TrackerServlet extends HttpServlet {
    protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        ChannelService channelService = ChannelServiceFactory.getChannelService();
        ChannelPresence presence = channelService.parsePresence(req);
        System.out.print("Client trying to connect with ID " + presence.clientId());
        //Save the new client in servlet context
        ServletContext context = getServletContext();
        //Object liveUsers = context.getAttribute("liveUsers");
        HashMap<String, ChannelPresence> liveUsers = (HashMap<String, ChannelPresence>) context.getAttribute("liveUsers");
        if (null == liveUsers) {
            System.out.println("Initialising client list");
            liveUsers = new HashMap<String, ChannelPresence>();
            context.setAttribute("liveUsers", liveUsers);
        if(liveUsers.containsKey(presence.clientId())) {
            System.out.println("Err.... this guy was already connected ! ");
        } else {
            liveUsers.put(presence.clientId(), presence);
            System.out.println(" New client connected with ID  " + presence.clientId());

Similarly remove the client from servlet context when client gets disconnected.

public class TrackerServlet1 extends HttpServlet {
    protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
        ChannelService channelService = ChannelServiceFactory.getChannelService();
        ChannelPresence presence = channelService.parsePresence(req);
        System.out.print("Client disconnected with ID " + presence.clientId());

        ServletContext context = getServletContext();
        HashMap<String, ChannelPresence> liveUsers = (HashMap<String, ChannelPresence>) context.getAttribute("liveUsers");
        if (null != liveUsers) {
            if (liveUsers.containsKey(presence.clientId())) {
                System.out.println("Client was disconnected");
            } else {
                System.out.println("Client was not connected");
        } else {
            System.out.println("No client was ever connected");




I have been trying to read and use all these technology for my project but I did not find any single article/blog which includes setup instructions for all of them.

Most of these technologies are from Google and optimizes for their PaaS solution.

  • Google AppEngine – The platform as a service supporting several languages including  Python, Java.
  • Maven  – Build tool like ant but lets you pull the libraries from the original repository dynamically at the build time.
  • Guice – Dependency Injection tool like spring without XML configuration. Configuration is done in code itself.
  • Sitebricks -  Libraries for creating REST webservices and dynamic HTML page (separating HTML and Data ).
  • Objectify – Library to interact with Google AppEngine datastore and automatic memcached management.

Following are the major steps I took for creating a working project.

  1. Since I was trying to create a web app with maven I needed to create the folder structure for web project containing WEB-INF etc. I could do this manually too. Details .

    $ mvn archetype:generate -DgroupId=com.neil -DartifactId=NoteWebApp -DarchetypeArtifactId=maven-archetype-webapp -DinteractiveMode=false

  2. Added the Google appengine dependency in the POM.xml in the recently created Maven project.
  3. Add Google app engine plugin for the maven tools to be used for running devserver and uploading the build to appengine cloud.
    Here we could also mention the debug port which any idea can connect.

    • Local server can be started using “mvn appengine:devserver”. Details.
    • Local app can be deployed in appengine server at “mvn appengine:update”

    Sometimes “port in use” error occurs if previous debug port or server port is not closed gracefully then we have to query for the process using the port and kill it.

    sudo lsof -i :8080 # checks port 8080 in mac
    kill -9 2828

  4. Add the appengine-web.xml in the same directory as web.xml which will hold the appengine configuration.
    <appengine-web-app xmlns="">
    <!-- create this unique id in appengine console in web-->
    <!-- I only keep only one version during dev and keep overwriting it. -->
    <!-- I have no idea about it -->
  5. Add dependency for Guice
  6. Add a listener servlet in web.xml which will get executed when container comes up. Also add a filter which will redirect all the urls to the Guice filter
    which in turn will have the information which class to be executed depending on the request URL.This configuration will be done in the listener.

  7. As promised in the last step I will create the listener with the information of url mappings
    public class MyGuiceServletConfig extends GuiceServletContextListener {
        protected Injector getInjector() {
            return Guice.createInjector(
                    //Keep sending Guice the modules
                    new SitebricksModule() {
                        protected void configureSitebricks() {
                            //Should change this to logger, this is just to proove that
                            //sitebrick scans the classes for annotations like @At etc.
                            System.out.println("****** Scan complete ******");
                    , new ServletModule() {
                        protected void configureServlets() {
                            //Servlet classes have to be singleton to be consistent with servlet specification
                            //In tranditional cases web.xml config tell the container to do so I guess.
                            //Analogous to typcial servlet URL mappings
  8. Following is a typical servlet class but will will avoid this and use templates and webservices to render data in HTML or JSON format.
    public class NotebookServlet extends HttpServlet {
        //Register the entity class for data persistance  service
        static {
         * Get requests come here
         * @param req
         * @param resp
         * @throws ServletException
         * @throws IOException
        public void doGet(
                HttpServletRequest req, HttpServletResponse resp)
                throws ServletException, IOException {
            //Render the form for adding notes
                    "<form method=post  action="/servlet" >" +
                            "<input name="note.text" size="20" type=text/>" +
                            "<input type=submit value="Add Note">" +
            resp.getWriter().println("List of notes");
            //load all the data from datastore
            List<Note> notes = ObjectifyService.ofy().load().type(Note.class).list();
            //Render the notes in each row of the table.
            resp.getWriter().println("<table><tr style="background:grey"><th>Date</th><th>Note</th></tr>");
            for (Note noteEntry : notes) {
                resp.getWriter().println("<td>" + noteEntry.getDate().toString() + "</td><td>" + noteEntry.getText() + "</td>");
         * Handles the form submit post requests and redirect to the same get request to display the list of
         * notes
         * @param req
         * @param resp
         * @throws ServletException
         * @throws IOException
        public void doPost(HttpServletRequest req, HttpServletResponse resp)
                throws ServletException, IOException {
            Note note = new Note();
            note.setDate(new Date());
            doGet(req, resp);
  9. Create the entity to map the database
    public class Note {
        private Long id;
        private Date date;
        public Long getId() {
            return id;
        public void setId(Long id) {
   = id;
        private String text;
    //rest of the getters and setters
  10. Create the class with sitebrick and annotate as service. annotate the url mapping and get post methods
    public class NotebookService {
        private Note note = new Note();
        //Register the entity class for the Objectify persistance service.
        public NotebookService() {
        Reply<List<Note>> showNotes() {
            //Prepare the HTTP headers
            Map<String, String> headers = new HashMap<String, String>();
            headers.put("Content-Type", "application/json");
            //Fetch data from database
            List<Note> notes = ObjectifyService.ofy().load().type(Note.class).list();
            //Convert the entity object to JSON and return.
            return Reply.with(notes).as(Json.class).headers(headers);
        public Note getNote() {
            return note;
        public void setNote(Note note) {
            this.note = note;
         * Post request endpoint here inserts data in database
         * and returns the JSON of the single entry which was newly
         * created
         * @param request the body of the request containing data is
         *                obtained from here.
         * @return
        public Reply postNote(Request request) {
            Map<String, String> headers = new HashMap<String, String>();
            headers.put("Content-Type", "application/json");
            //Read JSON data and create entity object
            note =;
            //System generated date instead of user
            note.setDate(new Date());
            //Store data
            //Just return the newly added data
            return Reply.with(note).as(Json.class).headers(headers);
            //to redirect to a url but this class is only for
            //webservice call, we don't ahve to redirect to any page
            return Reply.saying().redirect("/servlet");
  11. For using sitebricks with HTML template create another class and HTML
    public class Webnote {
        //When HTML page is rendered , this instance variable would be used for the placeholders to populate
        private List<Note> notes;
        //Following instance variable will be populated when form is submitted from the HTML template
        //The name of the input fields will be mapped to the entity components
        private Note note = new Note();
        public void get() {
            this.notes = ObjectifyService.ofy().load().type(Note.class).list();  //load from db
        public Note getNote() {
            return note;
        public void setNote(Note note) {
            this.note = note;
        public String post() {
            //Date is not provided by the form, server date is populayted
            note.setDate(new Date());
            //Redirect to same class and render the same HTML template
            return "webnotes";
        public List<Note> getNotes() {
            return notes;
        public void setNotes(List<Note> notes) {
            this.notes = notes;
        HTML Template for sitebricks. This has the html form and table to enter and display the data
        However theer are placeholders for sitebricks to replace the data before serving to client
    <form method=post action="/webnotes">
        <input name=note.text size=20 type=text/>
        <input type=submit value="Add Note">
        <tr style="background:grey">
        @Repeat(items=notes, var="note")
  12. Add the objectify dependencies in POM.xml
  13. Register the objectify Entity class as service whenever we create any service which will have database interaction. I do it in constructor of the service class
        public NotebookService() {
  14. Read/write data using Objectify. Note.class is the entity class
    List<Note> notes = ObjectifyService.ofy().load().type(Note.class).list();

I have put all the code in Github and its still evolving, however you can get hold of the particular commit.