Ocaml run external program
There are also a large number of third-party libraries, for a wide range of applications, from networking to graphics. You should understand the following:. The OCaml compilers know where the standard library is and use it systematically try: ocamlc -where. You don't have to worry much about it. The other libraries that ship with the OCaml distribution str, unix, etc. Third-party libraries may be installed in various places, and even a given library can be installed in different places from one system to another.
If your program uses the unix library in addition to the standard library, for example, the command line would be:. Note that. The file unix. If your program depends upon third-party libraries, you must pass them on the command line. You must also indicate the libraries on which these libraries depend.
You must also pass the -I option to ocamlopt for each directory where they may be found. This becomes complicated, and this information is installation dependent. So we will use ocamlfind instead, which does these jobs for us.
The ocamlfind front-end is often used for compiling programs that use third-party OCaml libraries. Library authors themselves make their library installable with ocamlfind as well. You can install ocamlfind using the opam package manager, by typing opam install ocamlfind. To create a directory or remove an empty directory, we have mkdir and rmdir :.
The second argument of mkdir determines the access rights of the new directory. Note that we can only remove a directory that is already empty. To remove a directory and its contents, it is thus necessary to first recursively empty the contents of the directory and then remove the directory.
The Unix command find lists the files of a hierarchy matching certain criteria file name, type and permissions etc. In this section we develop a library function Findlib. The paths found under the root r include r as a prefix. Each found path p is given to the function action along with the data returned by Unix. The function action returns a boolean indicating, for directories, whether the search should continue for its contents true or not false.
Whenever an error occurs the arguments of the exception are given to the handler function and the traversal continues. However when an exception is raised by the functions action or handler themselves, we immediately stop the traversal and let it propagate to the caller. A directory is identified by the id pair line 12 made of its device and inode number.
The list visiting keeps track of the directories that have already been visited. In fact this information is only needed if symbolic links are followed line It is now easy to program the find command. The essential part of the code parses the command line arguments with the Arg module. Although our find command is quite limited, the library function FindLib. Use the function FindLib. The function getcwd is not a system call but is defined in the Unix module.
First describe the principle of your algorithm with words and then implement it you should avoid repeating the same system call. Here are some hints.
We move up from the current position towards the root and construct backwards the path we are looking for.
The root can be detected as the only directory node whose parent is equal to itself relative to the root. To find the name of a directory r we need to list the contents of its parent directory and detect the file that corresponds to r. The openfile function allows us to obtain a descriptor for a file of a given name the corresponding system call is open , however open is a keyword in OCaml. The first argument is the name of the file to open. These flags determine whether read or write calls can be done on the descriptor.
The call openfile fails if a process requests an open in write resp. Most programs use 0o for the third argument to openfile. This means rw-rw-rw- in symbolic notation. With the default creation mask of 0o , the file is thus created with the permissions rw-r--r With a more lenient mask of 0o , the file is created with the permissions rw-rw-r If the file will contain executable code e. If the file must be confidential e. The last group of flags specifies how to synchronize read and write operations.
By default these operations are not synchronized. The system calls read and write read and write bytes in a file. The first argument is the file descriptor to act on. The third argument is the position in the string of the first byte to be written or read. The fourth argument is the number of the bytes to be read or written.
After the system call, the current position is advanced by the number of bytes read or written. For writes, the number of bytes actually written is usually the number of bytes requested.
However there are exceptions: i if it is not possible to write the bytes e. The reason for iii is that internally OCaml uses auxiliary buffers whose size is bounded by a maximal value. If this value is exceeded the write will be partial. To work around this problem OCaml also provides the function write which iterates the writes until all the data is written or an error occurs.
For reads, it is possible that the number bytes actually read is smaller than the number of requested bytes. For example when the end of file is near, that is when the number of bytes between the current position and the end of file is less than the number of requested bytes.
In particular, when the current position is at the end of file, read returns zero. For example, read on a terminal returns zero if we issue a ctrl-D on the input. Another example is when we read from a terminal. In that case, read blocks until an entire line is available.
If the line length is smaller than the requested bytes read returns immediately with the line without waiting for more data to reach the number of requested bytes. This is the default behavior for terminals, but it can be changed to read character-by-character instead of line-by-line, see section 2.
The following expression reads at most characters from standard input and returns them as a string. The system call close closes a file descriptor. Once a descriptor is closed, all attempts to read, write, or do anything else with the descriptor will fail. Descriptors should be closed when they are no longer needed; but it is not mandatory.
On the other hand, the number of descriptors allocated by a process is limited by the kernel from several hundreds to thousands.
Doing a close on an unused descriptor releases it, so that the process does not run out of descriptors. First we open a descriptor in read-only mode on the input file and another in write-only mode on the output file. This is unsatisfactory: if we copy an executable file, we would like the copy to be also executable.
We will see later how to give a copy the same permissions as the original. If read returns zero, we have reached the end of file and the copy is over. Otherwise we write the r bytes we have read in the output file and start again.
Finally, we close the two descriptors. Example of errors include inability to open the input file because it does not exist, failure to read because of restricted permissions, failure to write because the disk is full, etc. Why not read byte per by byte, or megabyte per by megabyte? The reason is efficiency. The amount of data transferred is the same regardless of the size of the blocks.
For small block sizes, the copy speed is almost proportional to the block size. By profiling more carefully we can see that most of the time is spent in the calls to read and write. We conclude that a system call, even if it has not much to do, takes a minimum of about 4 micro-seconds on the machine that was used for the test — a 2.
For larger blocks, between 4KB and 1MB, the copy speed is constant and maximal. Here, the time spent in system calls and the loop is small relative to the time spent on the data transfer.
Also, the buffer size becomes bigger than the cache sizes used by the system and the time spent by the system to make the transfer dominates the cost of a system call 2.
Finally, for very large blocks 8MB and more the speed is slightly under the maximum. Coming into play here is the time needed to allocate the block and assign memory pages to it as it fills up. The moral of the story is that, a system call, even if it does very little work, costs dearly — much more than a normal function call: roughly, 2 to 20 microseconds for each system call, depending on the architecture.
It is therefore important to minimize the number of system calls. In particular, read and write operations should be made in blocks of reasonable size and not character by character. But other types of programs are more naturally written with character by character input or output e. This layer uses buffers to group sequences of character by character reads or writes into a single system call to read or write.
This results in better performance for programs that proceed character by character. Moreover this additional layer makes programs more portable: we just need to implement this layer with the system calls provided by another operating system to port all the programs that use this library on this new platform.
Here is the interface:. When we open a file for reading, we create a buffer of reasonable size large enough so as not to make too many system calls; small enough so as not to waste memory. Or the buffer is empty and we call read to refill the buffer. The only asymmetry is that the buffer now contains incomplete writes characters that have already been buffered but not written to the file descriptor , and not reads in advance characters that have buffered, but not yet read.
Or the buffer is full and we empty it with a call to write and then store the character at the beginning of the buffer. The idea is to copy the string to output into the buffer. We need to take into account the case where there is not enough space in the buffer in that case the buffer needs to emptied , and also the case where the string is longer than the buffer in that case it can be written directly.
Here is a possible solution. The first argument is the file descriptor and the second one the desired position. This enumerated type specifies the kind of position:.
An error is raised if a negative absolute position is requested. The requested position can be located after the end of file. In that case, a read returns zero end of file reached and a write extends the file with zeros until that position and then writes the supplied data. Thus a call lseek is useless to set the write position, it may however be useful to set the read position. The behavior of lseek is undefined on certain type of files for which absolute access is meaningless: communication devices pipes, sockets but also many special files like the terminal.
In some implementations, lseek on a pipe or a socket triggers an error. The command tail displays the last n lines of a file. How can it be implemented efficiently on regular files? What can we do for the other kind of files? How can the option -f be implemented cf.
A naive implementation of tail is to read the file sequentially from the beginning, keeping the last n lines read in a circular buffer. When we reach the end of file, we display the buffer. When the data comes from a pipe or a special file which does not implement lseek , there is no better way.
However if the data is coming from a normal file, it is better to read the file from the end. With lseek , we read the last characters. We scan them for the end of lines. If there are at least n of them, we output and display the corresponding lines.
Otherwise, we start again by adding the next preceding characters, etc. To add the option -f , we first proceed as above and then we go back at the end of the file and try to read from there.
If read returns data we display it immediately and start again. If it returns 0 we wait some time sleep 1 and try again. In Unix, data communication is done via file descriptors representing either permanent files files, peripherals or volatile ones pipes and sockets, see chapters 5 and 6.
File descriptors provide a uniform and media-independent interface for data communication. Of course the actual implementation of the operations on a file descriptor depends on the underlying media. However this uniformity breaks when we need to access all the features provided by a given media. General operations opening, writing, reading, etc. There are also operations that work only with certain kind of media. We can shorten a normal file with the system calls truncate and ftruncate.
The first argument is the file to truncate and the second the desired size. All the data after this position is lost.
The two system calls symlink and readlink operate specifically on symbolic links:. The call symlink f1 f2 creates the file f2 as a symbolic link to f1 like the Unix command ln -s f1 f2. The call readlink returns the content of a symbolic link, i. The former are character streams: we can read or write characters only sequentially. These are the terminals, sound devices, printers, etc. The latter, typically disks, have a permanent medium: characters can be read by blocks and even seeked relative to the current position.
The usual file system calls on special files can behave differently. However, most special files terminals, tape drives, disks, etc. For example, for a tape drive, rewind or fast forward the tape; for a terminal, choice of the line editing mode, behavior of special characters, serial connection parameters speed, parity, etc.
These operations are made in Unix with the system call ioctl which group together all the particular cases. However, this system call is not provided by OCaml; it is ill-defined and cannot be treated in a uniform way.
Terminals and pseudo-terminals are special files of type character which can be configured from OCaml. This structure can be modified and given to the function tcsetattr to change the attributes of the peripheral. The first argument is the file descriptor of the peripheral. When a password is read, characters entered by the user should not be echoed if the standard input is connected to a terminal or a pseudo-terminal.
Then it defines a modified version of these in which characters are not echoed. If this fails the standard input is not a control terminal and we just read a line. Otherwise we display a message, change the terminal settings, read the password and put the terminal back in its initial state. Care must be taken to set the terminal back to its initial state even after a read failure. Sometimes a program needs to start another and connect its standard input to a terminal or pseudo-terminal.
OCaml does not provide any support for this 3. We can then open this file and start the program with this file on its standard input. Four other functions control the stream of data of a terminal flush waiting data, wait for the end of transmission and restart communication.
The function tcsendbreak sends an interrupt to the peripheral. The second argument is the duration of the interrupt 0 is interpreted as the default value for the peripheral. The function tcdrain waits for all written data to be transmitted. The function setsid puts the process in a new session and detaches it from the terminal.
Two processes can modify the same file in parallel; however, their writes may collide and result in inconsistent data. This is fine for log files but it does not work for files that store, for example, a database because writes are performed at arbitrary positions. In that case processes using the file must collaborate in order not to step on each others toes.
A lock on the whole file can be implemented with an auxiliary file see page?? For directories, we recursively copy their contents. We use it to preserve this information for copied files. The system call utime modifies the dates of access and modification. We use chmod and chown to re-establish the access rights and the owner.
We catch this error and ignore it. We begin by reading the information of the source file. If it is a symbolic link, we read where it points to and create a link pointing to the same object.
All other file types are ignored, with a warning. Copy hard links cleverly. Try to detect this situation, copy the file only once and make hard links in the destination hierarchy.
Before each copy we consult the map to see if a file with the same identity was already copied. To minimize the size of the map we remember only the files which have more than one name, i. The tar file format for t ape ar chive can store a file hierarchy into a single file.
It can be seen as a mini file system. In this section we define functions to read and write tar files. We also program a command readtar such that readtar a displays the name of the files contained in the archive a and readtar a f extracts the contents of the file f contained in a.
Extracting the whole file hierarchy of an archive and generating an archive for a file hierarchy is left as an exercise. A tar archive is a set of records.
Each record represents a file; it starts with a header which encodes the information about the file its name, type, size, owners, etc. The header is a block of bytes structured as shown in table 3. The file contents is stored right after the header, its size is rounded to a multiple of bytes the extra space is filled with zeros. Records are stored one after the other. If needed, the file is padded with empty blocks to reach at least 20 blocks.
Since tar archives are also designed to be written on brittle media and reread many years later, the header contains a checksum field which allows to detect when the header is damaged. Its value is the sum of all the bytes of the header to compute that sum we assume that the checksum field itself is made of zeros. The kind header field encodes the file type in a byte as follows 4 :.
LINK is for hard links which must lead to another file already stored within the archive. CONT is for ordinary file, but stored in a contiguous area of memory this is a feature of some file systems, we can treat it like an ordinary file. These three fields are not used in other cases. The value of the kind field is naturally represented by a variant type and the header by a record:. An archive ends either at the end of file where a new record would start or on a complete, but empty, block.
To read a header we thus try to read a block which must be either empty or complete. The end of file should not be reached when we try to read a block. To perform an operation in an archive, we need to read the records sequentially until we find the target of the operation. Usually we just need to read the header of each record without its contents but sometimes we also need to get back to a previous one to read its contents.
As such we keep, for each record, its header and its location in the archive:. We define a general iterator that reads and accumulates the records of an archive without their contents. To remain general, the accumulating function f is abstracted. This allows to use the same iterator function to display records, destroy them, etc.
It moves to offset where a record should start, reads a header, constructs the record r and starts again at the end of the record with the new less partial result f r accu. The command readtar a f must look for the file f in the archive and, if it is a regular file, display its contents.
If f is a hard link on g in the archive, we follow the link and display g since even though f and g are represented differently in the archive they represent the same file.
The fact that g or f is a link on the other or vice versa depends only on the order in which the files were traversed when the archive was created. For now we do not follow symbol links. If r is a regular file itself, r is returned. In all other cases, the function aborts. Once the record is found we just need to display its contents. We read the records in the archive but not their contents until we find the record with the target name. This second, backward, search must succeed if the archive is well-formed.
The first search may however fail if the target name is not in the archive. In case of failure, the program takes care to distinguish between these two cases. Behind this apparently trivial requirement are hidden difficulties. Symbolic links are arbitrary paths, they can point on directories which is not allowed for hard links and they may not correspond to files contained in the archive.
Nodes of this in-memory file system are described by the inode type. The info field describes the file type, limited to ordinary files, symbolic links and directories. Paths are represented by lists of strings and directories by lists that associate a node to each file name in the directory. The record field stores the tar record associated to the node.
This field is optional because intermediate directories are not always present in the archive; it is mutable because a file may appear more than once in the archive and the last occurrence takes precedence over the other. As in Unix, each directory contains a link to itself and to its parent, except for the root directory in contrast to Unix where it is its own parent.
This allows us to detect and forbid any access outside the hierarchy contained in the archive. The function find finds in the archive the node corresponding to path by starting from the initial node inode. If the search result is a link, the flag link indicates whether the link itself should be returned true or the file pointed by the link false. The function mkpath traverses the path path creating missing nodes along the path.
The function explode parses a Unix path into a list of strings. The function add adds the record r to the archive.
The archive, represented by its root node, is modified by a side effect. Write a command untar such that untar a extracts and creates all the files in the archive a except special files restoring if possible the information about the files owners, permissions as found in the archive.
The file hierarchy should be reconstructed in the current working directory of the untar command. If the archive tries to create files outside a sub-directory of the current working directory this should be detected and prohibited. This exercise combines the previous exercise exercise 2. Write a program tar such that tar -xvf a f1 f We reuse the data structures already defined above and collect them in a Tarlib module.
We define a warning function which does not stop the program or alter the return code of the program. We start with the function that writes a record header in a buffer. In particular we must pay attention to the limits imposed by the file format.
The following function creates a record header for a file. Otherwise, an error is raised: this handles the abnormal case where a file is modified during the archival process. We now tackle the creation of the archive.
The files already written in the archive are stored in a hashtable with their path so that they are not copied more than once. The data needed to write an archive is a file descriptor pointing on the file to write, the file and directory cache see above and a size variable that remembers the current archive size to pad it to a minimal size if needed. The archive type collects all this information in a record:. Here is the main function that writes an entire hierarchy starting from a file path given on the command line.
This function is not difficult but needs some care with pathological cases. In particular we saw how to detect when a file is modified the archival. A sub case of this when the archive is being archived itself….
We keep track of regular files that may have hard links in the regfiles table. A process is a program executing on the operating system. It consists of a program machine code and a state of the program current control point, variable values, call stack, open file descriptors, etc.
This section presents the Unix system calls to create new processes and make them run other programs. The system call fork creates a process. The new child process is a nearly perfect clone of the parent process which called fork.
Both processes execute the same code, are initially at the same control point the return from fork , attribute the same values to all variables, have identical call stacks, and hold open the same file descriptors to the same files.
The only thing which distinguishes the two processes is the return value from fork : zero in the child process, and a non-zero integer in the parent. By checking the return value from fork , a program can thus determine if it is in the parent process or the child and behave accordingly:. The interpreter can also be used in batch mode, for running scripts. The name of the file containing the code to be interpreted is passed as argument on the command line of the interpreter:. Thus, if you inadvertently type the opening command, you may think that the interpreter is broken because it swallows all your input without ever sending any output but the prompt.
For instance, try the infinite loop. The input is taken into account immediately with no trailing carriage return and produces the following message:. Remark that there is no notion of instruction or procedure, since all expressions must return a value. The unit value of type unit conveys no information: it is the unique value of its type. The -o cat option specifies the name for the resulting executable cat in this case , just as it does for most C compilers. Leave it out and you'll get the traditional name a.
Compiling and linking a program made up of several source files is almost as easy. There's only the issue of doing it in dependency order. Consider this trivial program that prints each command line argument on a separate line, structured as three one-line source-files or modules :. File a. Let's see what would happen:. While compiling a. We just need to compile and link in the reverse order — it's fine to do it in one invocation of ocamlc :.
But the easiest thing is to use one of the many third-party utilities like ocamake see below that compute dependencies for you:. Let's demonstrate a more complex project with my ocolumn program from earlier.
0コメント