Programming lesson
Mastering File I/O System Calls in C: A Byte-by-Byte Comparison Project
Learn how to compare files byte-by-byte using open, read, write, and close system calls in C. This tutorial covers dynamic memory allocation, timing with gettimeofday, and error handling—perfect for CSCI 1730 Project 3.
Introduction: Why System Calls Matter in Modern Programming
In the era of AI-powered apps and real-time data processing, understanding low-level file I/O is more relevant than ever. Whether you're building a file comparison tool for a coding bootcamp or optimizing a data pipeline, the system calls open, close, read, and write are the building blocks. This tutorial guides you through implementing a byte-by-byte file comparison program in C—similar to a CSCI 1730 Project 3 assignment—while connecting concepts to trending tech like AI training datasets and cloud storage sync.
Understanding the Core Task
Your program will read two input files, compare them byte by byte, and write differing bytes into separate output files. It must use only the system calls open, close, read, and write, plus stat and printf for metadata and output. The assignment splits the work into two steps:
- Step 1: Read one byte at a time from each file using a small buffer (2 bytes: one for data, one for null), compare, and write differences to
differencesFoundInFile1.txt. - Step 2: Read both files entirely into dynamically allocated arrays, compare them, store differences in a third dynamic array, and write to
differencesFoundInFile2.txt.
You'll also time each step using gettimeofday and output the durations to compare performance. This mirrors real-world scenarios where file synchronization tools (like rsync) or version control systems (like Git) must efficiently detect changes.
Setting Up the Environment
Create a single C file named proj3.c. Use only these headers:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <string.h>Your program should accept exactly four command-line arguments: the executable name, input file 1, input file 2. If not, print a usage message and exit. This is a common pattern in Linux system programming and helps avoid errors.
Step 1: Byte-by-Byte Comparison
Opening Files with System Calls
Use open() with the appropriate flags. For input files, use O_RDONLY. For output files, use O_WRONLY | O_CREAT | O_TRUNC with permissions S_IRUSR | S_IWUSR (read/write for owner). This ensures the output files are created or overwritten with correct permissions.
int fd1 = open(argv[1], O_RDONLY);
if (fd1 < 0) { perror("Error opening file"); exit(1); }Reading and Writing One Byte at a Time
Allocate a buffer of 2 bytes: one for the character, one for null terminator (though for binary files, null is unnecessary; but per spec, use size 2). Loop until both files are exhausted. Compare the bytes; if they differ, write the byte from file1 to differencesFoundInFile1.txt.
char buf1[2], buf2[2];
ssize_t n1, n2;
while ((n1 = read(fd1, buf1, 1)) > 0 || (n2 = read(fd2, buf2, 1)) > 0) {
if (n1 == 0) buf1[0] = 0; // treat missing as null
if (n2 == 0) buf2[0] = 0;
if (buf1[0] != buf2[0]) {
write(out_fd1, buf1, 1);
}
}This method is slow but simple—like comparing two streaming data feeds in real-time analytics. Timing this step reveals the overhead of many system calls.
Step 2: Full-File Comparison Using Dynamic Memory
Reading Entire Files into Dynamically Allocated Arrays
Use stat() to get file sizes, then allocate exactly that many bytes using malloc. Read the entire file into the buffer. This is similar to loading a machine learning dataset into memory for processing.
struct stat st1, st2;
stat(argv[1], &st1);
stat(argv[2], &st2);
char *data1 = malloc(st1.st_size);
char *data2 = malloc(st2.st_size);
read(fd1, data1, st1.st_size);
read(fd2, data2, st2.st_size);Comparing and Writing Differences
Allocate a third dynamic array to hold differing bytes from file2 (size up to the larger file). Loop through indices up to the maximum of the two sizes. For each index, compare bytes; if different, write the byte from file2 into the differences array. Finally, write that array to differencesFoundInFile2.txt.
size_t max_size = (st1.st_size > st2.st_size) ? st1.st_size : st2.st_size;
char *diff2 = malloc(max_size);
size_t diff_count = 0;
for (size_t i = 0; i < max_size; i++) {
char c1 = (i < st1.st_size) ? data1[i] : 0;
char c2 = (i < st2.st_size) ? data2[i] : 0;
if (c1 != c2) {
diff2[diff_count++] = c2;
}
}
write(out_fd2, diff2, diff_count);Don't forget to free() all allocated memory to avoid leaks. Run Valgrind to verify.
Timing the Steps with gettimeofday
Use gettimeofday() before and after each step to capture the elapsed time. Convert to milliseconds or microseconds for output. This is a common technique in performance benchmarking for system software.
struct timeval start, end;
gettimeofday(&start, NULL);
step1(...);
gettimeofday(&end, NULL);
long seconds = end.tv_sec - start.tv_sec;
long microseconds = end.tv_usec - start.tv_usec;
double elapsed = seconds + microseconds*1e-6;
printf("Step 1 took %.6f seconds\n", elapsed);Typically, Step 2 is faster for large files because it reduces system call overhead. However, for very small files, Step 1 might be comparable. This mirrors batch processing vs. streaming in modern data engineering.
Error Handling and Robustness
Check every system call for errors. Print appropriate messages: "There was an error reading a file." or "There was an error writing to a file." Validate command-line arguments. This is crucial for production-grade software and is a key requirement in C programming assignments.
Memory Management Best Practices
Always free dynamically allocated memory. Use valgrind to ensure no leaks. Avoid global or static variables. This is essential for embedded systems and game development where memory is constrained.
Real-World Connections
File comparison is used in version control (like Git diff), data synchronization (like Dropbox), and AI model validation (comparing output files). By understanding these system calls, you're building skills for systems programming, backend development, and DevOps.
Conclusion
This project teaches you the fundamentals of file I/O system calls, dynamic memory, and performance measurement. By implementing both streaming and batch approaches, you gain insight into trade-offs that affect real-world applications. Practice these concepts to excel in your CSCI 1730 course and beyond.