Consent to the use of Personal Data and Cookies
If you give us your consent, data may be shared with Google.
If you give us your consent, data may be shared with Google.
I have come to the realization that the flock function in PHP basically is atomic – if you use it right that is.
This article assumes that the underlying system has a proper implementation of file locking, so you should probably be using a Linux server with a file system known to support locking. PHP's official documentation only mentions Windows and NFS as having some quirks with locking, how to deal with such is not covered in this article. Here I make the assumption that you run Linux with some sort of nice playing file system. Okay. Good!
One concern with file locking is that the way it is implemented depends on a call to fopen and flock in quick succession, which, theoretically, means that flock is not atomic. In other words, it creates a small window for another process to write to the file. At least in theory.
In practice we are skilled web developers, and we know to limit access to our file through a single access point. That is, a single unique script or PHP class that is handling file system access properly, and obtaining an exclusive lock. If we do that, we can say that our implementation is "atomic" in the sense that nothing else will disrupt our work; no other process, binary, script or otherwise, should be trying to access the file in a manner that corrupts.
It is important to remember that other processes could choose to ignore our lock entirely anyway, because the lock obtained with flock is only advisory. However, since all access is controlled from a single script, that should not be a problem. Ideally we still need to prevent access on the file-system level, possibly with file permissions, but that is another, separate, discussion.
Now, our script is still making a call to fopen and flock in quick succession, so a small window in between those two actions still exist, hence flock is not atomic. This is important to know if your application depends on the order of access being precise.
However, since all access takes place through our file handler, nothing will in practice be able to disrupt our implementation.
Consider the following scenario:
Event A. Caller 1 sends a request to our file, which goes through our file handler script; something delays this caller after the fopen call. Maybe the processes are running in parallel, and this caller happened to be a little slower, for whatever reason.
Event B. Caller 2 now also sends a request to our file, which also goes through our file handler script; this caller is not delayed, and it somehow makes a successful call to both fopen and flock.
Event C. Caller 1 is no longer delayed and now makes a call to flock to obtain an exclusive lock, what happens? It will either have to wait for caller 2 to release the lock, or the call will just get a PHP timeout error if the lock is not released in time, so no harm should be done for most applications.
Again, it is important to consider that the order of access might be important. E.g. If your application deals with finances, then the order is likely important – but then you should be using a database anyway!
When we say something is atomic, we just mean that we can no longer divide it into smaller parts. This is not true for flock in the sense that the order of access may be changed under certain conditions. But, it is important to understand, this does not mean we can not use it for reliably locking files.
If the small window between fopen and flock is relevant to your application, flock is not atomic in this sense, and probably to be considered unsafe.
This is still likely to be an extremely rare event to happen, because the window is truly miniscule, and of of course, it would actually not matter for most applications even if it did happen.
Our imagined scenario where access to a given file is strictly controlled through the same script shows that, theoretically, the worst to happen would be that the order of caller access is changed. In fact, it might even be different scripts; what matters is that they will be calling flock to obtain a file lock before making changes to the file.
The order of access is probably irrelevant for most applications. What matters is that we can reliably lock the file and expect nothing else to corrupt the data while the file is locked; this can only be done when access is strictly controlled from a single access point – or the scripts at least has to use the same locking mechanism, because these file locks, unfortunately, are only advisory.
Of course, once a PHP script exits, a file lock will be released regardless of whether it is done explicitly in the script, so we do not need to worry about that.
However, there is one issue with this setup; something might still cause the PHP process to exit while data is being written to a file. This can perhaps happen if PHP runs out of memory, and the process is killed with an out of memory message (OOM). In such a case a lock may or may not be released in a timely manner, and indeed, you might corrupt the file that is being written to.
Therefor, in order to avoid file corruption in the event of a failure, it is probably best to do the following:
1. Obtain an exclusive lock using fopen and flock as you would normally.
2. Make a copy of the file before writing any data to it.
3. Write the data to the file-copy.
4. if the write operation was successful, replace the original file with the copy.
Of course, this is an extremely costly set of actions that will probably not be practical in a busy environment. So what can be done instead?
We know that appending data to the end of a file is basically always safe, and yes, you do still need to lock the file. E.g:
file_put_contents($file, $input, FILE_APPEND | LOCK_EX);
So, when making changes to a file, they should probably always be appended. You will have to implement a unique "string" that indicates both the beginning and end of an entry in your file, and if, then, the "end" is never reached for a given entry while reading from the file, it should just be ignored by processes reading the file.
The disadvantage is that if you make changes to, let's say an article for a website, it will have to be appended, and you will have to parse the entire file in order to find the most recent version of the article. It also has the advantage, however, that you can keep and retrieve previous versions if needed.
This is probably impractical when dealing with binary data; on the web, that would probably mostly be images and videos, and for such files, you would need to know how such data can be properly verified after writing it. However, this is still of much less concern than if working with a flat-file database of sorts.
You should also keep in mind that file access can be somewhat slow, so once you reach a few millions of lines in a file, access may slow down noticeably. Note. Even for a file with millions of lines, you should still be able to parse it in less than a second. This can depend on various factors, such as disk performance line lengths, and buffer sizes, but you should get the idea.
But, at this point, you will have to come up with ways to access the data faster. E.g. By caching the file in memory, or by dividing the data into multiple files, or even storing each article in a separate files. Concurrency should also be factored into this, because if one user access a page in 0.1 second, two simultaneous are likely to cause a slowdown for each that can be significant. Not necessarily doubling the access time tp 0.2 seconds for each user, but it can cause noticeable slowdown, for each user. You may need to test this on your specific setup to find out what happens.
Databases also store data in the file system, but they also leverage caching, indexing, and other strategies to ensure quick access and reliable locking.
Once you begin to need optimization on this level, the use of a database will start to become relevant, because that is essentially what you are on the path onto making on your own, and that is just re-inventing the wheel. Remember, free and open source databases can fulfill those needs, and although it is fun to experiment on your own, it is mostly going to be a waste of time in this case, and you are probably never going to use those specialized skills for anything doing your web development career. Also, a PHP based database is going to be significantly slower than one implemented in C++ and other compiled programming languages.
More in: PHP File Handling