How to write a kludgy news crawler in Python and challenge Google News to its limits
A kludge (or, alternatively, kluge) is a clumsy or inelegant solution to a problem or difficulty. In engineering, a kludge is a workaround, typically using unrelated parts cobbled together. Especially in computer programs, a kludge is often used to fix an unanticipated problem in an earlier kludge; this is essentially a kind of cruft.
I was searching data on my old disk and I found some interesting code I had written(rather abandoned) an year and half ago. At that time, I was very fascinated by the concept of Google News, which scanned and gathered news from almost 450 sources and mash up them together on one single page. Mnay sources, one destination. Needless to say, Google created a smash hit product.Life appeared easy, all of sudden.
Given my nature, it wasent surprising that I desired to write the next Google News Killer app. It began at night…around 10:30 to be precise. I was determined to finish the program in a nights time. Python was my original (and only) choice that seemed suitable for me to create the next biig thing. Googling around I found that a module feedparser.py makes parsing RSS feeds easy(so to say). However, there was a problem – At that time, I had no clue of what XML meant. That was only the beginning. Later, I also discovered that I had extremely limited knowledge of HTML..Then I realized that my Python basics were giving me plenty of surprises…
Bah..it looked so bad, here I was trying to write a good program, and there were tonnes of difficulties in first path itself. However, determination took over desperation, and after tweaking and pondering for well over 46 minutes, I was able to produce an extremely kludgy , extremely basic, extremely primitive Google News Killer – Wow…..the feeling was so good. Imagine – writing something out of scratch, and that too without any help(ok, I took help from Mark Pilgrims feedparser.py and python.org) I chose to call it News Crawler.
Get the python file by clicking the link – check-news Dont forget to rename it the file to check-news.py and also, make sure that identions are proper.
Now something about the code.
1. As I said earlier – the code is extremely dumb, extremely kludgy, extremely primitive, extremly basic and theres lot of shoddiness in there. Dont laugh at it even if it appears funny.
2. The code has heard nothing of security, and is meant to run under controlled environment.
3. It dosent make use of any SQL database backend, but is wise enough to store the RSS feeds on HDD before dissecting them,and extracting useful content.
4. It expects that the XML files are in Unicode format. Some rouge sites make use of shabby encoding, which raises an exception in the program.
5. I havent added any exception handling, just laziness, nothing more.
6. For reference, I have shown how we can incorporate Slashdot and Reddit feeds on single page. You can add in your favourite feed.
1. Make use of a good HTML templeting system.
2. Solve the problem of unicode.
3. Add error checking and improve its utility by making use of Pythons object oriented features.
4. Add a SQL backend system for storing the parsed RSS data. To be honest, its the toughest job to do.
5. Post up a nice powerpoint presentation describing the system. 🙂
5. PS – I will definately not do anything of above unless someone seriously decides to fund me.
After a long time I am back to programming world, I got so busy with other things that I had to abandon my dream project, but who knows, someday it may come true.. 😉