[Totally off-topic and utterly gaga] This is way off-topic, I - TopicsExpress



          

[Totally off-topic and utterly gaga] This is way off-topic, I know, but for three days solid now Ive been grappling with a really perplexing issue with some PHP coding. The purpose of the code is to crawl a web page (ported from a .NET crawler and indexer that I wrote a while back). (1) I manually go into a MySQL database, wipe out the last modified date for the web address that Im interested in and replace it with a question mark (or, on occasions with WTF?!?). That should make my program unconditionally fetch the web page even if it hasnt changed since the last time I fetched it. The server should respond 200 OK and send me the page. (2) The program loads the site and document records from the MySQL database. site_record and url_record are global arrays, since all manner of routines change the details, but the date in question is only assigned to a variable in one line of several thousand lines of code. I get it to echo the last modified date to the screen and though it should read ?, it already reads Thursday in some month in some year, at some time, which is the last modified date of the web page that we havent even fetched yet. It isnt fetched until Ive dealt with robots.txt and any sitemap that the web site may contain. The date has already been changed ... Unless, that is, I add a die(WTF!); statement to terminate the program at this point. If I do, then the last modified date is correctly shown as still being ? (3) I then actually fetch the web page. If the last modified date was a ? then I ask the server to unconditionally send the page. If its a valid date, I ask it it to send it only if it has been modified since that date. (4) Since the last modified date has somehow changed to Thursday in some month in some year, at some time, the server responds with a 304 Not Modified header instead of returning the page. The funny thing is that I only actually check the servers last modified response if it sends a 200 OK along with the page, I skip this if its a 304 Not Modified. So the program should have no way of knowing what the last modified date actually is. Theres only one place in the program where the last modified date is assigned and later updated in the database. If I echo What the hell am I doing here? just before that statement, that message does not show. If I put die(!!!) after it, the program is not terminated. But if I alter the date and prefix it with (say) WTF?!? -- you guessed it -- when I look in the database when the run is complete, thats whats I find in there. And quite how it changes the database before Ive even loaded the document record, let alone fetched the document and got the last modified date, I just dont know. The calls to the database are not asynchronous: PHP waits until a query finishes before moving on. How can a piece of code that isnt called change a database entry before it even happens? Its not simply something to do with caching, because if I edit the web page that Im fetching, the new last modified date immediately shows up and the program uses that. Ive also tried disabling the PHP cache / accelerator, APC; alternatives to using global variables for the site and document record arrays; closing and re-opening database connections, and much else; and Ive got echoed messages all over the place so that I can see exactly which bits of code are beng called and what value key variables are. Of course, this is most likely going to be one of those Oh my God, thats why! Why on earth didnt I think of that? moments, but right now (having been through the code and tried all manner of alternative approaches, including renaming an array that happened to have the same name as one used in a library Im using), Im flummoxed. Maybe God has a sense of humour after all? Either that, or its those pesky gremlins at it again.
Posted on: Thu, 13 Nov 2014 18:05:50 +0000

Trending Topics



Recently Viewed Topics




© 2015