Understanding Strings in Go: A Journey through Bytes, Runes, and Unicode

Understanding Strings in Go: A Journey through Bytes, Runes, and Unicode

In this blog, we're diving deep into the fascinating world of strings in Go. Strings are a bit more complex than they first appear, and understanding their internal workings can give you a huge advantage in writing efficient and correct Go code. So, let's get into it!

Strings: Logical vs. Physical Representation

As we continue talking about types in Go, in this segment, I'd like to talk about strings. Strings are a bit of a curious type in Go because they have two natures: logical and physical. The reason for this is that strings in Go are all Unicode, which is a technique that allows us to represent international characters.

In the old days, programming languages used ASCII, which represented characters with 7 bits and essentially only covered American English characters. When we moved to international languages with accent marks and non-Roman scripts like Chinese or Arabic, we needed different techniques to represent those characters. Unicode was developed to handle this, using numbers that are bigger than what fits into a byte.

In Go, we have the concept of a rune, which is equivalent to what some languages call a wide character. A rune is a synonym for a 32-bit integer, big enough to represent any Unicode code point. However, to make programs efficient, we don't represent every character with four bytes all the time. Instead, we use UTF-8 encoding, a way to represent Unicode in bytes efficiently. Interestingly, UTF-8 was invented by some of the same people who worked on Go at Bell Labs.

Physically, strings in Go are the UTF-8 encoding of Unicode characters. There's another type in Go called the byte, which is a synonym for an 8-bit integer. A string is physically a sequence of bytes needed to encode Unicode characters. Logically, these are runes.

Let's look at some code to illustrate these concepts:

package main

import (
    "fmt"
)

func main() {
    s := "élite"
    fmt.Printf("Type: %T, Value: %v\n", s, s)
    // Output: Type: string, Value: élite
}

When you print the string, you get the Unicode output.

Understanding Runes and UTF-8 Encoding

Let's dive deeper into the concept of bytes vs. runes. I'll demonstrate this in the playground by creating a string with a non-ASCII character, like the French accented 'é':

package main

import (
    "fmt"
)

func main() {
    s := "élite"
    fmt.Printf("String: %s\n", s)
    // Output: String: élite

    bytes := []byte(s)
    fmt.Printf("Bytes: %v\n", bytes)
    // Output: Bytes: [195 169 108 105 116 101]

    runes := []rune(s)
    fmt.Printf("Runes: %v\n", runes)
    // Output: Runes: [233 108 105 116 101]

    fmt.Printf("Length of string (bytes): %d\n", len(s))
    // Output: Length of string (bytes): 6

    fmt.Printf("Length of runes: %d\n", len(runes))
    // Output: Length of runes: 5
}

When you run this program, you'll notice the length of the string is 6 bytes, not 5, because the accented 'é' is represented by two bytes in UTF-8. This illustrates the difference between logical characters (runes) and their physical byte representation.

Memory Representation of Strings

In Go, strings are immutable and represented by a string descriptor, which includes a pointer to the data and the length of the string in bytes. This allows Go to handle string operations efficiently without needing a terminating null byte.

Internal String Representation

Here's a visualization of how strings are stored in memory:

package main

import (
    "fmt"
)

func main() {
    s := "hello, world"
    h := s[:5]
    w := s[7:]

    fmt.Printf("Original: %s\n", s)
    // Output: Original: hello, world

    fmt.Printf("Substring 1: %s\n", h)
    // Output: Substring 1: hello

    fmt.Printf("Substring 2: %s\n", w)
    // Output: Substring 2: world
}

String Operations: Concatenation and Case Conversion

When you modify a string in Go, you're actually creating a new string. Here’s an example of concatenation and case conversion:

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "the quick brown fox"
    s = s + " jumps over the lazy dog"

    fmt.Println("Concatenated:", s)
    // Output: Concatenated: the quick brown fox jumps over the lazy dog

    upper := strings.ToUpper(s)
    fmt.Println("Uppercase:", upper)
    // Output: Uppercase: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
}

A Simple Search and Replace Tool

To wrap things up, let’s build a simple search and replace tool in Go. This will demonstrate some practical string manipulation.

package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"
)

func main() {
    if len(os.Args) < 3 {
        fmt.Println("Usage: go run main.go <old> <new>")
        return
    }

    oldStr := os.Args[1]
    newStr := os.Args[2]
    scanner := bufio.NewScanner(os.Stdin)

    for scanner.Scan() {
        line := scanner.Text()
        modifiedLine := strings.ReplaceAll(line, oldStr, newStr)
        fmt.Println(modifiedLine)
    }

    if err := scanner.Err(); err != nil {
        fmt.Println("Error reading input:", err)
    }
}

Run this tool from the command line, providing the word to replace and the replacement word:

go run main.go "mat" "ed" < input.txt

If your input.txt contains:

mat went to Greece.
mat is a good friend.

The output will be:

ed went to Greece.
ed is a good friend.

Strings in Go are more than just sequences of characters. They involve complex concepts like runes, UTF-8 encoding, and immutability. By understanding these details, you can write more efficient and effective Go code. Keep exploring and happy coding!