More on searching

I received a comment on my recent post about searching, I received a comment referencing a piece of technology which is new to me: Gnugol. It is a command line interface to the most common search engines. There is work on an Emacs interface for it, but I think I have another application idea.

I’ve contemplated, for a while now, how one might actually get the results one wants in a similar format. As far as I can tell, Gnugol does not yet support regex as a search parameter (though it would be trivial to apply it to the results, which is why I’m now sharing this idea) but it does provide a command line (C compiled) utility to fetch results from Google. I’ll wager that this will prove remarkably faster than having Python or PHP stream the data directly (it seems like it is faster than my browser).

So, here is the proposal. Have Python, specifically Django, perform the search using this library. Have it then spawn a thread (this way you can create a large number of streams simultaneously) for each result and actually retrieve the data from that website. Apply a regex the results. If it does not find that a page matches the regex, then dump that result and queries a new result set until the target count is matched (this way there can be 10 results/page, for example). Obviously some keywords would need to be included first. There isn’t a way to “grep the web” (yet), but this will make my dreams a reality.

Python, so you know, is my language of preference here because it does not require specialized compiling to get the thread library to work. All you need is:

import threading, time
t = threading.Thread(target=time.sleep,args=[6])

The threading code is already a part of Python’s main build. PHP, on the other hand, will require the pcntl library compiled in. That isn’t an option in most cases (and my case in particular) (there are other simulations but that is a hack).

This entry was posted in Google and tagged , , . Bookmark the permalink.

2 Responses to More on searching

  1. Dave Täht says:

    Thanks for taking a look at gnugol.

    Some comments:

    1) Gnugol is not a library (yet). It has some major growing pains left to go through before it could become one, and while we’re designing towards that as an end goal it is proving more useful at present to add features/get feedback/refactor. At present the back end portions of the client are not multi-threadable, but I hope to fix that in the next week or so.

    2) That said the existing client is remarkably useful, particularly in org-mode or elinks and the output could be made even more parsable via other tools. I’ve also thunk hard on outputting Lisp forms for it.

    3) Yes, I too wanted to grep the web. I would also like to filter it much as we now filter out spam.

    However, the C regex library doesn’t do utf-8, and there are more than a few regex libs out there to choose from, and I don’t know which to choose. I personally mostly work on embedded stuff (no swap space) so I’m resistant to anything that blows through memory arbitrarily.

    4) I prototyped the thing in python originally. It took 25% longer for python to start up and complete a query than the C on a fast X86 box, and significantly longer on the ARM. This would not be a problem for a demon that ran all the time, but for a little command line tool, C, is a universal, low level language, that can interface with anything as could an eventual library.

    5) json is also universal. There’s no need to actually use gnugol, just the gnugol concept, and the json apis made available by the major search engines, if you want to work in another language. Writing a basic seach engine client is only a few dozen lines of code in C and less in Python.

    • I’m in the middle of working through some Lisp stuff recently (Land of Lisp, actually), but when I’m done I’ll definitely start looking into this more. Creating a plug-in for Python is actually trivial (and, more importantly, more portable than trying to get make to work on Windows), especially if you already have something to start with.

      I don’t think you’d actually need to have a multi-threaded search system as long as the thing which is interfacing with it is multi-threaded.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>