In this blog, we're diving deep into the fascinating world of strings in Go. Strings are a bit more complex than they first appear, and understanding their internal workings can give you a huge advantage in writing efficient and correct Go code. So, let's get into it!
Strings: Logical vs. Physical Representation
As we continue talking about types in Go, in this segment, I'd like to talk about strings. Strings are a bit of a curious type in Go because they have two natures: logical and physical. The reason for this is that strings in Go are all Unicode, which is a technique that allows us to represent international characters.
In the old days, programming languages used ASCII, which represented characters with 7 bits and essentially only covered American English characters. When we moved to international languages with accent marks and non-Roman scripts like Chinese or Arabic, we needed different techniques to represent those characters. Unicode was developed to handle this, using numbers that are bigger than what fits into a byte.
In Go, we have the concept of a rune, which is equivalent to what some languages call a wide character. A rune is a synonym for a 32-bit integer, big enough to represent any Unicode code point. However, to make programs efficient, we don't represent every character with four bytes all the time. Instead, we use UTF-8 encoding, a way to represent Unicode in bytes efficiently. Interestingly, UTF-8 was invented by some of the same people who worked on Go at Bell Labs.
Physically, strings in Go are the UTF-8 encoding of Unicode characters. There's another type in Go called the byte, which is a synonym for an 8-bit integer. A string is physically a sequence of bytes needed to encode Unicode characters. Logically, these are runes.
Let's look at some code to illustrate these concepts:
package main
import (
"fmt"
)
func main() {
s := "élite"
fmt.Printf("Type: %T, Value: %v\n", s, s)
// Output: Type: string, Value: élite
}
When you print the string, you get the Unicode output.
Understanding Runes and UTF-8 Encoding
Let's dive deeper into the concept of bytes vs. runes. I'll demonstrate this in the playground by creating a string with a non-ASCII character, like the French accented 'é':
package main
import (
"fmt"
)
func main() {
s := "élite"
fmt.Printf("String: %s\n", s)
// Output: String: élite
bytes := []byte(s)
fmt.Printf("Bytes: %v\n", bytes)
// Output: Bytes: [195 169 108 105 116 101]
runes := []rune(s)
fmt.Printf("Runes: %v\n", runes)
// Output: Runes: [233 108 105 116 101]
fmt.Printf("Length of string (bytes): %d\n", len(s))
// Output: Length of string (bytes): 6
fmt.Printf("Length of runes: %d\n", len(runes))
// Output: Length of runes: 5
}
When you run this program, you'll notice the length of the string is 6 bytes, not 5, because the accented 'é' is represented by two bytes in UTF-8. This illustrates the difference between logical characters (runes) and their physical byte representation.
Memory Representation of Strings
In Go, strings are immutable and represented by a string descriptor, which includes a pointer to the data and the length of the string in bytes. This allows Go to handle string operations efficiently without needing a terminating null byte.
Here's a visualization of how strings are stored in memory:
package main
import (
"fmt"
)
func main() {
s := "hello, world"
h := s[:5]
w := s[7:]
fmt.Printf("Original: %s\n", s)
// Output: Original: hello, world
fmt.Printf("Substring 1: %s\n", h)
// Output: Substring 1: hello
fmt.Printf("Substring 2: %s\n", w)
// Output: Substring 2: world
}
String Operations: Concatenation and Case Conversion
When you modify a string in Go, you're actually creating a new string. Here’s an example of concatenation and case conversion:
package main
import (
"fmt"
"strings"
)
func main() {
s := "the quick brown fox"
s = s + " jumps over the lazy dog"
fmt.Println("Concatenated:", s)
// Output: Concatenated: the quick brown fox jumps over the lazy dog
upper := strings.ToUpper(s)
fmt.Println("Uppercase:", upper)
// Output: Uppercase: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
}
A Simple Search and Replace Tool
To wrap things up, let’s build a simple search and replace tool in Go. This will demonstrate some practical string manipulation.
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
func main() {
if len(os.Args) < 3 {
fmt.Println("Usage: go run main.go <old> <new>")
return
}
oldStr := os.Args[1]
newStr := os.Args[2]
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
line := scanner.Text()
modifiedLine := strings.ReplaceAll(line, oldStr, newStr)
fmt.Println(modifiedLine)
}
if err := scanner.Err(); err != nil {
fmt.Println("Error reading input:", err)
}
}
Run this tool from the command line, providing the word to replace and the replacement word:
go run main.go "mat" "ed" < input.txt
If your input.txt
contains:
mat went to Greece.
mat is a good friend.
The output will be:
ed went to Greece.
ed is a good friend.
Strings in Go are more than just sequences of characters. They involve complex concepts like runes, UTF-8 encoding, and immutability. By understanding these details, you can write more efficient and effective Go code. Keep exploring and happy coding!