PV-Spiders

Started July 2008, Ended June 2009

This is another product I created for Prime Vendor in Wilmington, NC in order to collect all the bid information offered by any government agency in the United States.

The core problem here was that there’s somewhere around hundreds of thousands of organizations in the United States government that post bids for private contractors to, well, bid upon. Many of these organizations post multiple new bids every day – some of which are streamed out of a database but, surprisingly, lots of these government agencies still relied on human beings to update their tables of offers. When your job is to create a network of ‘spiders’ that go to these pages and are just supposed to check if anything has changed since their last visit and, if so, download the new data and submit it as a new bid.. well, there are often complications.

All the people before me faced with this problem created programs that’d establish HTTP requests to these sites and then attempted to parse the html text that came back in a meaningful way where they could then either download the bid or continue on to the next stage of the website until they can, hopefully, inevitably get to the direct bid file. Many sites required you to login and provided hundreds to thousands of results over the course of several html pages – and many others were almost entirely javascript which these spiders just couldn’t go to.

Being a little lazy, I decided toforgoall of thisnastinessand instead chose to extend the built in .NET WebBrowser control to the point where you could register parsing events to given Uris and, whenever the browser control would hit one of these Uris, the event would be called and you’d have the entire page parsed for you already in the form of the DOM (Document Object Model). You could then use Linq (or old-school for-loops, should you prefer) to extract out whatever information you needed from this page in a handful lines of code. You could even executeJavaScript. This entire overhaul of the WebBrowser control took about 50 lines of code and less than an hour to develop.

Under this system handling any number of pages of any kind of complexity with or without login screens became trivial – writing a spider for a new government site became a task consuming only 5-15 minutes. Over the course of the year that I worked there, the shear number of sites we were actively spidering grew to such a large number that it was too much for one machine to handle executing them each in serial. The first solution to this problem was to add in a multi-threaded download manager to the browser where the spiders could parse and traverse these pages as quickly as they could and delegate the actual downloading of bids to this background worker. It did not take long for this to beprohibitiveas well and the need for a complete distributed spider network that spanned several machines became apparent.

Subsequently, that’s exactly what I created next – an entire network of spiders all working independently constantly monitoring government agencies for new contract bids and downloading them the instant they became available. The bids were then streamed in real-time to all of our clients for them to bid upon the ones of their choice.

As an extension of this we also had an internal website of all the contractors in the United States that our sales team would use in order to produce new leads and sell more software, but most of the list lacked any contact information. To facilitate this I added in another button on the side that, when clicked, would spin up a spider on one of the servers in the back and google the company with some extra keywords to hone in on the company and throw out invalid results. The spider would then return back a ranked list of the top 5 results, of which usually the first was the correct one, and allow the user to select which was correct. This would then update the web-site address of the company listing and, with a couple of clicks, they’d also have a phone number and e-mail address if the company posted it on their site.

I left prime-vendor in June of 2009, at which point this project was taken over by a young developer I interviewed for my position.