Learning Git Internals with Rust

In this latest blog on Git internals with Rust, I will focus on the ls-tree command in Git. You can find the earlier posts here:

Part 1 — git init
Part 2 — git hash-object
Part 3 — refactoring
Part 4 — git cat-file

The ls-tree command is similar to doing an ls in a terminal emulator. It lists the contents of tree objects in Git. The Git object model looks like this (copied from Part 2)

git object tree

A commit object contains (among other things) a “pointer” to a tree object in the form of a SHA-1 hash (`01a2` in the diagram). The tree object contains entries referencing other Git objects at that level of the hierarchy of objects. Consider this directory structure

my_project/
  Cargo.toml
  src/
    main.rs
    commands.rs
    commands/
      ls_tree.rs

A tree object would have Cargo.toml and src as entries. The src entry would have a type tree which could then be used to recurse down and list additional tree entries (e.g. git ls-tree -r ... similar to ls -r).

$ git ls-tree c086
100644 blob 942a0dd92845b91f0ed99178b782729f233ea780    Cargo.toml
040000 tree 92bf6c48606cbce7b9c9fcc926b4421afb6ed386    src

Tree Object Format

The contents of a tree object follows the pattern we saw in Part 3 for other objects like blob objects --- a header followed by content, with the header and content separated by a null byte (0x0), with the header containing the object type (tree) and the length of the content.

For tree objects the content contains one or more “entries”. Each entry is itself split into two parts separated by a null byte --- the file permissions (i.e. “mode”) as octal-encoded bits typical for *nix systems (e.g. 100644 for Cargo.toml in the example above) and the file name in the first part, and after the null byte the 20 byte SHA-1 hash. This is the raw bytes, not ASCII-encoded (otherwise it would be 40 bytes, right). So a tree object looks something like

raw tree object

The ls-tree command also allows printing the size of the file with the -l/--long option e.g.

$ git ls-tree -l c086
100644 blob 942a0dd92845b91f0ed99178b782729f233ea780     535    Cargo.toml
040000 tree 92bf6c48606cbce7b9c9fcc926b4421afb6ed386       -    src

You may have noticed though that neither the object type nor the object length is actually contained in the description of the tree object entries. You are correct. To retrieve this information we have to take the object hash in each entry and read the referenced object to retrieve that info.

Luckily we know how to do that already from earlier posts. So let’s get to it.

Implementing ls-tree

Refactoring reading objects

First off we’re going to refactor some of the code from earlier posts to make our lives a little easier. I have extracted the object reading code into its own module and struct, moving it from the util module.

#[derive(Debug, PartialEq)]
pub(crate) enum GitObjectType {
    Blob,
    Tree,
    Commit,
}
 
pub(crate) struct GitObject<'a> {
    pub(crate) kind: GitObjectType,
    pub(crate) sha1: &'a str,
    pub(crate) size: usize,
    pub(crate) body: Option<Vec<u8>>,
}
 
impl GitObject<'_> {
    pub(crate) fn read(obj_id: &str) -> GitResult<GitObject> {
        trace!("read({obj_id})");
        let path = find_object_file(obj_id)?;
        let file = std::fs::File::open(path)?;
        let reader = BufReader::new(file);
        let contents = GitObject::decode_obj_content(reader)?;
        let mut header_and_body = contents.splitn(2, |b| *b == 0);
        let header = header_and_body.next().unwrap();
        let body = header_and_body.next().unwrap();
        let (obj_type, size) = GitObject::get_object_header(header)?;
 
        Ok(GitObject {
            kind: obj_type.into(),
            sha1: obj_id,
            size,
            body: Some(body.to_vec()),
        })
    }
 
    fn get_object_header(content: &[u8]) -> GitResult<(String, usize)> {
        let header = &mut content.splitn(2, |x| *x == b' ');
        let obj_type = bytes_to_string(header.next().unwrap());
        let obj_len_bytes = header.next().unwrap();
        match u8_slice_to_usize(obj_len_bytes) {
            None => Err(GitError::ReadObjectError),
            Some(obj_len) => Ok((obj_type, obj_len)),
        }
    }
 
    fn decode_obj_content(mut reader: impl BufRead) -> GitResult<Vec<u8>> {
        let content: &mut Vec<u8> = &mut Vec::new();
        let _ = reader.read_to_end(content)?;
        let mut decoder = ZlibDecoder::new(&content[..]);
        let mut decoded_content: Vec<u8> = Vec::new();
        decoder.read_to_end(&mut decoded_content)?;
 
        Ok(decoded_content)
    }
}

A couple of the utility functions from util came along and are now part of the GitObject implementation. The GitObject read() method only decodes the object header, since they are all the same. The content specific to a specific type of Git object is stored as bytes in the body field of the GitObject struct. The util.rs module now essentially just contains stuff related to finding Git directories.

The other main refactoring was to move the tree object printing that was in cat-file for dumping tree objects into the new ls-tree command handling.

fn handle_cat_file_tree_object(obj: GitObject) -> GitResult<()> {
    let args = LsTreeArgs::default();
    ls_tree::print_tree_object(&args, obj)
}

I was also able to simplify the tree printing function itself, which I’ll show later.

ls-tree command

Following the pattern I’ve established I created a new ls-tree module in the commands module and a new arguments structure LsTreeArgs for the command-line options handled by clap.

#[derive(Debug, Args, Default)]
pub(crate) struct LsTreeArgs {
    /// Show only the named tree entry itself, not its children.
    #[arg(long, default_value = "false")]
    name_only: bool,
 
    /// Show only the named tree entry itself, not its children.
    #[arg(short, default_value = "false")]
    dir_only: bool,
 
    /// Recurse into sub-trees.
    #[arg(short, default_value = "false")]
    recurse: bool,
 
    /// Show tree entries even when going to recurse them. Has no effect if -r was not passed.  -d implies -t.
    #[arg(short = 't', default_value = "false")]
    show_trees: bool,
 
    /// Show object size of blob (file) entries.
    #[arg(short = 'l', long = "long", default_value = "false")]
    show_size: bool,
 
    #[arg(name = "tree-ish")]
    tree_ish: String,
 
    /// When paths are given, show them (note that this isn’t really raw pathnames, but rather a list of
    /// patterns to match). Otherwise implicitly uses the root level of the tree as the sole path argument.
    #[arg(name = "path")]
    path: Option<Vec<String>>,
}

Two things to call out here are the tree_ish option and the path option. One thing I haven’t done yet in this series is handle all the various ways you can specify specific objects in Git. All I’ve done is support partial object hashes (i.e. git cat-file 942a0 vs git cat-file 942a0dd92845b91f0ed99178b782729f233ea780). But you can specify objects in many different ways like by HEAD or relative to it like HEAD~3 or by tag etc. etc.. With ls-tree the “tree-ish” value is some identifier that resolves to a tree object. From the Git documentation:

A tree object or an object that can be recursively dereferenced to a tree
object. Dereferencing a commit object yields the tree object corresponding
to the revision's top directory. The following are all tree-ishes: a
commit-ish, a tree object, a tag object that points to a tree object,
a tag object that points to a tag object that points to a tree object, etc.

And a “commit-ish” is

A commit object or an object that can be recursively dereferenced to a commit
object. The following are all commit-ishes: a commit object, a tag object that
points to a commit object, a tag object that points to a tag object that points
to a commit object, etc.

So in my implementation I’ve tried to add this support for following these references (at least for following tags, not some things like HEAD).

The other command-line option path is not really a path to an object but more a (series of) filter on the output. So for example I could do

$ git ls-tree -r HEAD */*.rs
100644 blob b1782add9b7656ceb6c95756a7c9a43cd221d877    src/commands.rs
100644 blob 24406bb5d3df8ba3e667d6ee10d065075b174bc2    src/main.rs
100644 blob 5634f30e1edd28dd589d9dc805d792b1dda1adb5    src/object.rs
100644 blob ac364c6a99bc887e6358d73b2054f358eecc4612    src/tag.rs
100644 blob d03dfe603c78389df1575d8ca2dc3088204cfc76    src/util.rs

Without the path filter, if I used the -r recurse option I’d see

$ git ls-tree -r HEAD
100644 blob 942a0dd92845b91f0ed99178b782729f233ea780    Cargo.toml
100644 blob 89e5d2e19e146ebd87a4da93dc7050347b602549    README.md
100644 blob b1782add9b7656ceb6c95756a7c9a43cd221d877    src/commands.rs
100644 blob 6ec2255e0d5d1eafeb498c22f838a9063dc51a03    src/commands/cat_file.rs
100644 blob cf2573e62ad244f1c8ab2cf4343aea6608def1e5    src/commands/config.rs
100644 blob aff7dd1806bdedc2ae721cfdacf418771331986c    src/commands/hash_object.rs
100644 blob 8a68f0ab4175e5924f9f95a71fcb9719d1217fdc    src/commands/init.rs
100644 blob 6b10925706937c0aab8abc94d640157045f34a09    src/commands/ls_tree.rs
100644 blob 24406bb5d3df8ba3e667d6ee10d065075b174bc2    src/main.rs
100644 blob 5634f30e1edd28dd589d9dc805d792b1dda1adb5    src/object.rs
100644 blob ac364c6a99bc887e6358d73b2054f358eecc4612    src/tag.rs
100644 blob d03dfe603c78389df1575d8ca2dc3088204cfc76    src/util.rs

Also note that */*.rs did not show all the Rust files, even with the */. To fully match you’d want **/ as the filter …

$ git ls-tree -r HEAD **/*.rs
100644 blob b1782add9b7656ceb6c95756a7c9a43cd221d877    src/commands.rs
100644 blob 6ec2255e0d5d1eafeb498c22f838a9063dc51a03    src/commands/cat_file.rs
100644 blob cf2573e62ad244f1c8ab2cf4343aea6608def1e5    src/commands/config.rs
100644 blob aff7dd1806bdedc2ae721cfdacf418771331986c    src/commands/hash_object.rs
100644 blob 8a68f0ab4175e5924f9f95a71fcb9719d1217fdc    src/commands/init.rs
100644 blob 6b10925706937c0aab8abc94d640157045f34a09    src/commands/ls_tree.rs
100644 blob 24406bb5d3df8ba3e667d6ee10d065075b174bc2    src/main.rs
100644 blob 5634f30e1edd28dd589d9dc805d792b1dda1adb5    src/object.rs
100644 blob ac364c6a99bc887e6358d73b2054f358eecc4612    src/tag.rs
100644 blob d03dfe603c78389df1575d8ca2dc3088204cfc76    src/util.rs

The tree-ish support is in the ls_tree() method and handling the various output options is in the print_tree_object() method shown later.

pub(crate) fn ls_tree_command(args: LsTreeArgs) -> GitCommandResult {
    ls_tree(&args.tree_ish, &args)
}
 
pub(crate) fn ls_tree(obj_id: &String, args: &LsTreeArgs) -> GitCommandResult {
    trace!("ls_tree({obj_id})");
    match GitObject::read(obj_id) {
        Ok(obj) => match obj.kind {
            GitObjectType::Tree => {
                // format and print tree obj body
                print_tree_object(&args, obj, None)
            }
            GitObjectType::Commit => {
                // get tree object of commit and print that
                let commit = commit::Commit::from(obj);
                ls_tree(&commit.tree, args)
            }
            GitObjectType::Blob => {
                debug!("cannot ls-tree a blob");
                Err(GitError::InvalidObjectId {
                    obj_id: args.tree_ish.to_string(),
                })
            }
            _ => unreachable!("due to other branches")
        },
        Err(_) => {
            debug!("cannot read object file for id '{obj_id}'; trying as a tag ...");
            // could be that the arg_id is not an object (blob/commit/tree)
            // check for tag
            match tag::Tag::get_tag(obj_id) {
                Some(tag) => ls_tree(&tag.obj_id, args),
                None => {
                    debug!("not a tag {obj_id}");
                    // not a tree or a commit or a tag, no good
                    Err(GitError::InvalidObjectId {
                        obj_id: obj_id.to_string(),
                    })
                }
            }
        }
    }
}

I split the functions into a public ls_tree_command() called by main() and an ls_tree() that can be recursively called if the object hash is not a tree object identifier.

Commit Objects

I created a Commit object to contain the details of the commit. You can read the details about commit objects here. But to save some effort, what it says regarding the format of a commit object is:

The format for a commit object is simple: it specifies the top-level tree for
the snapshot of the project at that point; the parent commits if any; the
author/committerinformation (which uses your user.name and user.email
configuration settings and a timestamp); a blank line, and then the commit
message.

This turns out to be one of the simpler formats to read since it’s all just plain text. No fancy separators or segments or anything.

pub(crate) struct Commit {
    sha1: String,
    pub(crate) tree: String,
    parent: String,
    author: String,
    committer: String,
    comment: String,
}
 
impl From<GitObject<'_>> for Commit {
    fn from(object: GitObject) -> Self {
        let body = object.body.unwrap();
        let mut reader = body.reader();
 
        let tree = get_entry(&mut reader, "tree");
        let parent = get_entry(&mut reader, "parent");
        let author = get_entry(&mut reader, "author");
        let committer = get_entry(&mut reader, "committer");
 
        let mut comment = String::new();
        let _ = reader.read_to_string(&mut comment);
 
        Self {
            sha1: object.sha1.to_string(),
            tree,
            parent,
            author,
            committer,
            comment,
        }
    }
}
 
fn get_entry(reader: &mut impl BufRead, name: &str) -> String {
    let mut entry = String::new();
    let _ = reader.read_line(&mut entry);
    let mut n = entry.splitn(2, ' ');
    assert_eq!(n.next(), Some(name));
    n.next().unwrap().trim().to_string()
}

Printing tree objects

There are a few interesting aspects to printing the tree object, mostly around recursion and filtering. I have not implemented the filtering as of now.

pub fn print_tree_object(
    args: &LsTreeArgs,
    obj: GitObject,
    path_part: Option<String>,
) -> GitResult<()> {
    // each entry is 'mode name\0[hash:20]
    let mut body = obj.body.unwrap();
 
    loop {
        if body.is_empty() {
            break;
        }
 
        // 1. split into two buffers, `[mode_and_name]0[rest]` with the 0 discarded
        let mut split = body.splitn(2, |b| *b == 0);
        let mode_and_file = split.next().unwrap();
        let mut rest = split.next().unwrap();
 
        // 2. spit the mode_and_name buffer into the mode and the name, which are separated by ' '
        let mut split = mode_and_file.split(|b| *b == b' ');
        let mode = util::bytes_to_string(split.next().unwrap());
        let filename = util::bytes_to_string(split.next().unwrap());
 
        // 3. read the next 20 bytes from `rest` which is the object hash
        let mut hash_buf = [0u8; 20];
        rest.read_exact(&mut hash_buf)?;
 
        // 4. point body at the remaining bytes for the loop
        body = rest.to_vec();
 
        // 5. using the hash, look up the referenced object to get its type
        let hash = hex::encode(hash_buf);
        let entry_obj = GitObject::read(hash.as_str())?;
        let kind = &entry_obj.kind;
 
        let path = create_file_name(&path_part, filename);
 
        // 6. if name_only then only print the name :)
        if args.name_only {
            if *kind == GitObjectType::Tree && args.recurse {
                print_tree_object(args, entry_obj, Some(path))?;
            } else {
                println!("{}", path);
            }
 
            continue;
        }
 
        if *kind == GitObjectType::Tree && args.recurse {
            print_tree_object(args, entry_obj, Some(path))?;
        } else {
            print!("{:0>6} {} {}", mode, kind, hash);
 
            if args.show_size {
                let len = entry_obj.size;
                if entry_obj.kind == GitObjectType::Tree {
                    print!("{: >8}", "-");
                } else {
                    print!("{: >8}", len);
                }
            }
 
            println!("\t{}", path);
        }
    }
 
    Ok(())
}
 
fn create_file_name(path: &Option<String>, filename: String) -> String {
    match path {
        Some(p) => p.to_owned() + "/" + filename.as_str(),
        None => filename,
    }
}

Recall our diagram from earlier (slightly enhanced).

enhanced raw tree object

The print_tree_object() method loops over the body, advancing a “pointer” (the body variable) to the remaining data as we work through the entries. First we get the file and mode from the entry, by splitting on 0x0. Then we read exactly 20 bytes for the SHA1 hash. We need to do this regardless of whether we’re only printing the name (—name-only) to advance the pointer. It’s not really that much “overhead” to worry about conditionally advancing it … somehow we need to advance the pointer.

If we are only printing the name, we now get to the challenge of the -r recursion option. The ls-tree command does a depth first walk, so what I did is add this path_part: Option<String> parameter, where I build up the path as we recurse, with an initial value of None. If it’s not just the name we’re printing, we still have the same recursion to do, but then we format the entry appropriately, optionally adding in the object length if the -l/--long option was provided.

One little quirk I noticed when comparing my output to the real ls-tree output is that the final space separating the file name from what precedes it is that it’s a tab \t character. C’est la vie.

Conclusion

And there you have it, a mostly complete implementation of ls-tree in Rust. As I mentioned one thing not yet implemented (and perhaps never to be implemented) is the path filtering. But there’s also another interesting case with respect to the tree output, and that it’s relative --- or implicitly filtered if you will --- by what directory you are in, when listing a commit object recursively. Taking my simple example from the beginning of the post

my_project/
  Cargo.toml
  src/
    main.rs
    commands.rs
    commands/
      ls_tree.rs

if my current directory is my_project/src then even if Cargo.toml is part of the commit that I’m passing to ls-tree then only main.rs, commands.rs and commands/ls_tree.rs should be printed. I didn’t bother.

$  git ls-tree a8011
100644 blob 942a0dd92845b91f0ed99178b782729f233ea780    Cargo.toml
100644 blob 89e5d2e19e146ebd87a4da93dc7050347b602549    README.md
040000 tree 92bf6c48606cbce7b9c9fcc926b4421afb6ed386    src
$ git ls-tree -r a8011
100644 blob 942a0dd92845b91f0ed99178b782729f233ea780    Cargo.toml
100644 blob 89e5d2e19e146ebd87a4da93dc7050347b602549    README.md
100644 blob b1782add9b7656ceb6c95756a7c9a43cd221d877    src/commands.rs
100644 blob 6ec2255e0d5d1eafeb498c22f838a9063dc51a03    src/commands/cat_file.rs
100644 blob cf2573e62ad244f1c8ab2cf4343aea6608def1e5    src/commands/config.rs
100644 blob aff7dd1806bdedc2ae721cfdacf418771331986c    src/commands/hash_object.rs
100644 blob 8a68f0ab4175e5924f9f95a71fcb9719d1217fdc    src/commands/init.rs
100644 blob 6b10925706937c0aab8abc94d640157045f34a09    src/commands/ls_tree.rs
100644 blob 24406bb5d3df8ba3e667d6ee10d065075b174bc2    src/main.rs
100644 blob 5634f30e1edd28dd589d9dc805d792b1dda1adb5    src/object.rs
100644 blob ac364c6a99bc887e6358d73b2054f358eecc4612    src/tag.rs
100644 blob d03dfe603c78389df1575d8ca2dc3088204cfc76    src/util.rs
$ cd src
$ git ls-tree -r a8011
100644 blob b1782add9b7656ceb6c95756a7c9a43cd221d877    commands.rs
100644 blob 6ec2255e0d5d1eafeb498c22f838a9063dc51a03    commands/cat_file.rs
100644 blob cf2573e62ad244f1c8ab2cf4343aea6608def1e5    commands/config.rs
100644 blob aff7dd1806bdedc2ae721cfdacf418771331986c    commands/hash_object.rs
100644 blob 8a68f0ab4175e5924f9f95a71fcb9719d1217fdc    commands/init.rs
100644 blob 6b10925706937c0aab8abc94d640157045f34a09    commands/ls_tree.rs
100644 blob 24406bb5d3df8ba3e667d6ee10d065075b174bc2    main.rs
100644 blob 5634f30e1edd28dd589d9dc805d792b1dda1adb5    object.rs
100644 blob ac364c6a99bc887e6358d73b2054f358eecc4612    tag.rs
100644 blob d03dfe603c78389df1575d8ca2dc3088204cfc76    util.rs

In the next few posts in this series I’ll probably be tackling creating tree objects and commits and cloning a remote Git repo. Stay tuned!

As always, please comment if you notice anything I could do better with my Rust coding as I’m doing this to learn Rust better. If there are idiomatic things that I could do that I’m not, let me know! And any other comments on the content or possible future content, let me know!

If you enjoyed this content, and you’d like to support me, consider buying me a coffee

kidoni.dev

Table of Contents

Learning Git Internals with Rust - Part 5