Next: The perl Intermediate Representation Up: Internals of perl Previous: Internals of perl Contents

`perl` As a Compiler and Virtual Machine

It is a common misconception that perl is an interpreter for Perl. The misconception arises from the fact that perl has two components within the same actual binary. First, perl has a front-end compiler which includes a lexer and parser that analyzes a Perl program and produces an intermediate representation (IR) of the program, in the form of a syntax tree. Such an environment is typical of a compiler front-end as discussed in [2].

Second, perl has a back-end, which includes an implementation of the native Perl data types (such as scalar, array, and hash), as well as the Perl Virtual Machine (PVM). The PVM can take the IR generated by perl's front-end, along with the data type implementations, and evaluate the IR (thus executing the code given by the Perl programmer). Thus, perl is not an interpreter. Instead, perl is actually a combination of a compiler and a virtual machine for Perl.

When seen in this fashion, the similarities between the perl environment and the Java environment are striking. However, there are some key differences.

Those differences are as follows:

The JVM has a detailed written specification. The PVM is documented primarily only in the source code for perl itself.
The JVM has fewer operation codes (OP-codes) than the PVM. Indeed, the PVM has a separate OP-code for nearly all of the over 200 Perl builtin functions. Overall, there are a 346 different OP-codes in perl [1].
The JVM has very simple native data types, and relies on standard class libraries to provide more complex types. The PVM has a number of complex native data types (e.g., hash, scalar and array).
Java compilers and JVMs are usually implemented separately. The PVM and the front-end compiler are tightly coupled inside perl.

These differences are really drawbacks of using perl as a compiler for a new virtual machine environment. First, the lack of a written specification for the PVM leads to a tendency for changes to occur in the PVM that are only made aware to core perl developers. The development model is open, of course, but keeping up with the details of the development is a big job, regardless.

The sheer size of the PVM makes it somewhat unwieldy. Not only are there 346 OP-codes but most OP-codes have a number of flags and options. These options and flags change the semantics of how the OP-code is evaluated. Many times, the only way to truly understand how a given OP-code works requires tracing through the source code of perl.

Since the source code is available and unencumbered, these problems can be mitigated simply by reading the source and becoming familiar with the system. However, the tightly coupled nature of the front-end compiler, its IR and the PVM creates additional problems that are more difficult to overcome when porting to new virtual machines.

The IR generated by perl's front-end compiler relies on a number of complex data structures to represent Perl's native data types. Since perl has always lived in the same binary as the PVM, the PVM assumes those data structures are available^3.1. To perform a direct translation to a new virtual machine using perl's IR, equivalent data structures must be developed on that new virtual machine. As it turns out, these data structures constitute much of the semantics of Perl. Namely, nearly all variable accesses in Perl go through these data structures. Thus, using the IR to implement a port to the JVM is feasible, but a challenge.

Next: The perl Intermediate Representation Up: Internals of perl Previous: Internals of perl Contents

Verbatim copying and distribution of this entire thesis is permitted in any medium, provided this notice is preserved.

perl As a Compiler and Virtual Machine

`perl` As a Compiler and Virtual Machine