09 9 / 2013

#flower

#flower

04 9 / 2013

31 8 / 2013

04 7 / 2013

markittyblog:

Ostensibly, here are our job descriptions:

  • Unmana (me): define product requirements, manage marketing, define business strategy. Also, the apparent boss.
  • Nilesh: manage data, plan product development, manage everything else, help with marketing, do odd jobs like design and admin that no one…

19 6 / 2013

Scraping is always fun, whether you’re doing it just for fun or profit. I have created a couple of scrapers already for iTunes, paintbottle (deleted as per requested by the site-admin), cricinfo, Google’s didyoumean, etc. check em out on Github.

IMDB does not provide a API,. for fetching information about movies. So, had to choose scraping for fetching information on movies.

I did know about the couple of other unofficial API’s, but creating your own solution is always fun :)

If you don’t want to go much into the technical details, but are just looking to use it, I have hosted it at http://getimdb.herokuapp.com

The scraper is written in Python and uses lxml, for extracting content from the webpages. I m using XPath for selecting elements from the DOM.

Following are the dependencies for it:

requests==1.2.3
lxml==3.2.1

The code:


#!/usr/bin/env python

import sys
import requests
import lxml.html


def main(id):
    if id:
        hxs = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/" + id).content)
        movie = {}
        try:
            movie['title'] = hxs.xpath('//*[@id="overview-top"]/h1/span[1]/text()')[0].strip()
        except IndexError:
            movie['title']
        try:
            movie['year'] = hxs.xpath('//*[@id="overview-top"]/h1/span[2]/a/text()')[0].strip()
        except IndexError:
            try:
                movie['year'] = hxs.xpath('//*[@id="overview-top"]/h1/span[3]/a/text()')[0].strip()
            except IndexError:
                movie['year'] = ""
        try:
            movie['certification'] = hxs.xpath('//*[@id="overview-top"]/div[2]/span[1]/@title')[0].strip()
        except IndexError:
            movie['certification'] = ""
        try:
            movie['running_time'] = hxs.xpath('//*[@id="overview-top"]/div[2]/time/text()')[0].strip()
        except IndexError:
            movie['running_time'] = ""
        try:
            movie['genre'] = hxs.xpath('//*[@id="overview-top"]/div[2]/a/span/text()')
        except IndexError:
            movie['genre'] = []
        try:
            movie['release_date'] = hxs.xpath('//*[@id="overview-top"]/div[2]/span[3]/a/text()')[0].strip()
        except IndexError:
            try:
                movie['release_date'] = hxs.xpath('//*[@id="overview-top"]/div[2]/span[4]/a/text()')[0].strip()
            except Exception:
                movie['release_date'] = ""
        try:
            movie['rating'] = hxs.xpath('//*[@id="overview-top"]/div[3]/div[3]/strong/span/text()')[0]
        except IndexError:
            movie['rating'] = ""
        try:
            movie['metascore'] = hxs.xpath('//*[@id="overview-top"]/div[3]/div[3]/a[2]/text()')[0].strip().split('/')[0]
        except IndexError:
            movie['metascore'] = 0
        try:
            movie['description'] = hxs.xpath('//*[@id="overview-top"]/p[2]/text()')[0].strip()
        except IndexError:
            movie['description'] = ""
        try:
            movie['director'] = hxs.xpath('//*[@id="overview-top"]/div[4]/a/span/text()')[0].strip()
        except IndexError:
            movie['director'] = ""
        try:
            movie['stars'] = hxs.xpath('//*[@id="overview-top"]/div[6]/a/span/text()')
        except IndexError:
            movie['stars'] = ""
        try:
            movie['poster'] = hxs.xpath('//*[@id="img_primary"]/div/a/img/@src')[0]
        except IndexError:
            movie['poster'] = ""
        try:
            movie['gallery'] = hxs.xpath('//*[@id="combined-photos"]/div/a/img/@src')
        except IndexError:
            movie['gallery'] = ""
        try:
            movie['storyline'] = hxs.xpath('//*[@id="titleStoryLine"]/div[1]/p/text()')[0].strip()
        except IndexError:
            movie['storyline'] = ""
        try:
            movie['votes'] = hxs.xpath('//*[@id="overview-top"]/div[3]/div[3]/a[1]/span/text()')[0].strip()
        except IndexError:
            movie['votes'] = ""
    else:
        return "invalid id"
    return movie


if __name__ == '__main__':
    print main(sys.argv[1])

You can use it by passing any valid `imdb id` as an argument:

$ python imdb.py tt1392170

I have used Xpath for extracting data from the DOM.

This will return a `JSON` object containing the data for the movie.

You can fork the code on Github.

You can try it out at http://getimdb.herokuapp.com

If you find this post useful, please consider sharing it.


Circle me on Google+