Exploring Strings in Rust. A general overview for newcomers | by Ali Somay | Jan, 2022

A general overview for newcomers

Strings in Rust, image created by Başak Ünal
Thanks to Başak Ünal
  • Bytes can mean anything. We are the ones who happen to be in consensus in certain areas and give meaning to them. With this power, we can also interpret them as characters. People made tables to agree on which bytes should map to which characters in history. See ASCII or Unicode tables. The first one is a minimal table in which every different character can be expressed with a single byte but in the latter, it may even take 4 bytes to define a single character. Check this thread out. Unicode is significantly a larger table that covers a vast number of characters and still has empty room.
  • Strings are sequences of bytes that we promise to interpret complying to a character table. We make this promise by using the type system of the programming language we use. We achieve this in Rust by simply calling a chunk of memory a String or str or &str or &String or Box<str> or Box<&str> or..

The majority of people who start learning Rust are either coming from Java Script or from C or C++ background from where I look. It might be worth mentioning how do we deal with strings in some of those languages before digging into Rust types.

let string = "banAna".toLowerCase();
let concatinatedString = "I want a cherry " + string;
let shout = concatinatedString.toUpperCase() + "!!";
let string = new String("banana");

We’ve gone through the former parts to internalize a few fundamentals.

  • To do something useful with strings, we need to know where that sequence of bytes starts and ends in computer memory.
  • Simple or complex data structures may be built over that sequence of bytes to store or derive properties about them and add functionality to do useful things with them.

str

C strings do not enforce any encoding internally. They are just a sequence of plain bytes waiting to be interpreted which have a null terminator.
Java Script strings use UTF-16 encoding.
Rust strings are UTF-8 encoded.

let string: str = "banana";
Compiler error when trying to use `str`.

Box<str>

One might think being explicit about where this data is stored could solve this issue. Why not box it? In other words, allocate it on the heap and take a pointer to it.

let string: Box<str> = Box::new(“banana”);
let string: Box<str> = Box::new(*"banana");

&str

This brings us to &str. A much more common type which you’ll probably come across and use more frequently.

dbg!(std::mem::size_of::<&str>());
let string: &str = "banana";
dbg!(string.len())
// Outputs: string.len() = 6
let banana_bytes: &[u8] = &[0x62,0x61,0x6e,0x61,0x6e,0x61];
let heap_string: String = String::from("banana");
// Points to the stack
// Unwrapping is safe here because we feed the data directly.
// We know that it is valid data.
let string: &str = std::str::from_utf8(banana_bytes).unwrap();
// Points to the heap
let string: &str = &heap_string;

String

So far the most useful string type we’ve seen was &str.
In addition to the previous part, if a string slice points to the heap we can also take a mutable reference to it by writing it as &mut str.

let mut string: String = "banana".to_owned();         
let string_slice: &mut str = &mut string; s.make_ascii_uppercase();
let mut banana_string: String = String::from("banana");
let mut cherry_string: String = String::from("I want a cherry");
banana_string += " ";
cherry_string += &banana_string;
dbg!(std::mem::size_of::<String>());
pub struct String {
vec: Vec<u8>,
}

&String

This one should be easy to understand after all the knowledge we’ve acquired. It is just a pointer to a String .

dbg!(std::mem::size_of::<String>());
// 24 bytes
dbg!(std::mem::size_of::<&String>());
// 8 bytes

Box<str>

We couldn’t allocate this one last time. With the new types we know, we can allocate it on the heap like the following example:

let string: Box<str> = String::from("banana").into_boxed_str();
// or
let string: Box<str> = Box::from("banana");
// from implementation will yield the same result.
dbg!(std::mem::size_of::<Box<str>>());
// 16 bytes
dbg!(std::mem::size_of::<String>());
// 24 bytes

As mentioned before Rust does not use null-terminated strings as a default.
Instead, it uses fat pointers or heap-allocated structs to store the length information directly.
Although if we wish, we may work with null-terminated strings.
These are especially useful when working with C libraries in FFI contexts.

This article doesn’t claim to be an exhaustive list of all possible string typos and usages in Rust.

Leave a Comment