Since some of you have checked in after I dropped off the - TopicsExpress

Ben Wills

Since some of you have checked in after I dropped off the planet... Ive been working on a thing, every day (16-30 hour days), for the last two months. Im writing a lot of Perl and regexes every day and, this week, seeing if I can rewrite this thing in C. I know more about both cryptogrpahic and non-cryptographic hash functions than I ever would have imagined, including how fast they are and collision rates (and the mucking up of the SHA-3 voting process). I know when to use a Boyer-Moore search function vs character-by-character analysis. (for what Im doing, any search algorithm is going to be too slow, particularly with the fact that Im running multiple searches concurrently, they needles are small strings, and theres a pre-load time for Boyer-Moore-Horspool to build its hash. Let alone a char-by-char analysis uses virtually no memory, where BMH creates an additional large data structure.). And as much as everyone says that you shouldnt parse HTML with regexes, I can tell you that, without a doubt, if youre smart about it and can code around the fact that regexes are essentially stateless (even though they support recursion, seeking, and backtracking), you can parse html 100 times faster with regexes than libxml, et al. I also rewrote a socket module for Perl thats lighter and faster than every other implementation on CPAN...because, even as lightweight as some of them are, all the others were too slow and used too many resources. Six weeks ago, I knew none of that, never considered writing Perl and had never written a worthwhile regex. This week, Im seeing if I can rewrite all of this in C because my 24 core box is pegged on CPU, not its gigabit network connection, when its scraping /and/ analyzing urls at a rate of 50 million urls a day. There are a number of things in the previous sentence that crack my prior self-identity. Also, gethostbyaddr is slow, non-configurable, and its default 10 second timeout gets in the way. So I had to write something to open a raw UDP socket connection to various DNS servers and read the packet data to get host IPs faster with a smaller timeout and revolving DNS servers for lookups. (And, yes, I looked at DNS caches, but 1, it adds another layer of complexity in an unnecessary way when public DNS servers are as fast as they are, 2, if I used one, Id use djbdns...Definitely have a man crush on Daniel Bernstein lately, and 3, we arent hitting the same hosts often enough for a significant performance gain. Though I do have some ideas around all of this in terms of pre-loading hosts and IPs into a cache, before the crawler gets to them...but thats a different thing.) Oh, and all of the article extraction tools out there (Boilerpipe, Readability, etc...) that separate the main content of a web page from things like navigation, theyre simply way too slow. So I wrote my own analyzer in a very different way and it is much faster and, in my opinion, far more accurate (and tunable in a way that makes clear sense, making it more useful over time). I dont even know who I am any more. I am not smart enough to know these things and keep waiting for everything to just explode in a bad way. But thats the thing Im working on. Im excited for it. And it should be released in mid-January. What it does is, in short...if youre a marketer, sales person, PR person, etc...you can add 100,000 urls a day (with a million URL index cap) to your own fully-searchable database, for under a hundred dollars a month. (500,000 urls a day for $297, with a 5 million URL index cap) Throw in a bunch of reports from Open Site Explorer, for example, and youll have a boatload of data about each URL in only 5 minutes...contact information, social accounts, etc, in a fully-searchable database with more fine tuning than you get with Google, for any kind of outreach, content research, link building, or finding sales and marketing leads. And by fully-searchable, that also means that, when you search on it, youre getting your results in milliseconds. No matter how large your database is. Theres no waiting or time delay while your report is being generated. You have a very large and fast search engine for your market research. Ill leave it at that and will say more in January.

Posted on: Wed, 10 Dec 2014 19:08:34 +0000

Since some of you have checked in after I dropped off the - TopicsExpress

Trending Topics

Recently Viewed Topics