Skip to main content

Ralsina.Me — Roberto Alsina's website

Scraping doesn't hurt

I am in gen­er­al al­ler­gic to HTM­L, spe­cial­ly when it comes to pars­ing it. How­ev­er, ev­ery now and then some­thing comes up and it's fun to keep the mus­cles stretched.

So, con­sid­er the Ted Talks site. They have a re­al­ly nice ta­ble with in­for­ma­tion about their talk­s, just in case you want to do some­thing with them.

But how do you get that in­for­ma­tion? By scrap­ing it. And what's an easy way to do it? By us­ing Python and Beau­ti­ful­Soup:

from BeautifulSoup import BeautifulSoup
import urllib

# Read the whole page.
data = urllib.urlopen('http://www.ted.com/talks/quick-list').read()
# Parse it
soup = BeautifulSoup(data)

# Find the table with the data
table = soup.findAll('table', attrs= {"class": "downloads notranslate"})[0]
# Get the rows, skip the first one
rows = table.findAll('tr')[1:]

items = []
# For each row, get the data
# And store it somewhere
for row in rows:
    cells = row.findAll('td')
    item = {}
    item['date'] = cells[0].text
    item['event'] = cells[1].text
    item['title'] = cells[2].text
    item['duration'] = cells[3].text
    item['links'] = [a['href'] for a in cells[4].findAll('a')]
    items.append(item)

And that's it! Sur­pris­ing­ly pain-free!

jjconti / 2012-02-17 22:27:

No podés traducir scrap :)

Roberto Alsina / 2012-02-18 00:52:

I am a contrarian.

Ale Sarco / 2012-02-17 23:00:

No sería más facil parsear el RSS? 
http://www.ted.com/talks/rss

Roberto Alsina / 2012-02-18 00:52:

Sí, pero no están todas ahí, creo.

Grzegorz Śliwiński / 2012-02-18 10:02:

Have you tried Scrapy? It ahs some nice features to crawl and scrape web pages ;)

Roberto Alsina / 2012-02-18 15:02:

I have heard of scrapy, but have not tried it.


Contents © 2000-2023 Roberto Alsina