Simple data stitching

I found myself in a situation where I had a number of CSV files that all shared some key data, and all had to be put together to a larger dataset. I figured that the easiest way to do this would be to deserialize the files, then stitch them together using a portion of their data as a key.

I decided to try my hand at writing a macro to solve the issue, and I ended up with two of them; one for one-to-one relations, and one for one-to-many.

The problem I was solving

I had two files. One had two pieces of information, A and B, and the other had another two pieces, B and C. What I really needed was a HashMap with A and C. The information connecting the two were the B columns.

Let's put this in easier terms with a useless toy example. The first file, user_email.csv has a username and an email address, the other file, email_color.csv, has an email address and a favorite color. I want to be able to go from username to favorite color directly, and cut out the email address.

Example files

user_email.csv
usernameemail
alicealice@example.com
bobbob@example.com
carolcarol@example.com
email_color.csv
emailcolor
alice@example.comred
bob@example.comgreen
carol@example.comblue

Some nice aliases

These things tend to be a little easier to read and understand if you make a couple of type aliases.


#![allow(unused_variables)]
fn main() {
type Username = String;
type Email = String;
type Color = String;
}

Deserialization

Now we need to create two structs that I could deserialize the data into:


#![allow(unused_variables)]
fn main() {
/// The struct for user_email.csv.
#[derive(Deserialize)]
struct UserEmail {
    username: Username,
    email: Email,
}

/// The struct for email_color.csv.
#[derive(Deserialize)]
struct EmailColor {
    email: Email,
    color: Color,
}
}

I used the csv crate to deserialize the files. It does a great job and they explain how to use it quite well.

Using stitch_one_to_one

The next step is to run the macro. The macro will return a Vec<(Left, Right)>. In our case, Left will be UserEmail and Right will be EmailColor.


#![allow(unused_variables)]
fn main() {
// Supplying these functions is left as an exercise to the reader :)
let user_emails: Vec<UserEmail> = deserialize_user_emails();
let email_colors: Vec<EmailColor> = deserialize_email_colors();

let result = stitch_one_to_one!(
    user_emails,
    (email),
    email_colors,
    (email)
);
}

The anatomy of the call above is this:


#![allow(unused_variables)]
fn main() {
stitch_one_to_one!(
    lefty_items,
    (a, b),  // The lefty key to use.
    righty_items,
    (x, y)  // The righty key to use.
);
}

The keys above translate into tuples of (left.a, left.b) for all lefty items, and (right.x, right.y) for all righties. The keys must be of the same length, and they will be compared to each other, so they should probably be of the same type as well.

To clarify, all the items in the left key must be members of the left item. The same goes for the right item; the items in its key must be members of right. For UserEmail, a valid key would be any combination of username and email.

Finishing up

Like I mentioned earlier, the macro will return a Vec<(UserEmail, EmailColor)>, and now it's up to you to traverse them and produce something that makes sense to you. In my case, this was a HashMap<Username, Color>.

This is how I produced it:


#![allow(unused_variables)]
fn main() {
let lookup: HashMap<Username, Color> = items
    .iter()
    .map(|(left, right)| {
        (
            left.username.to_string(),
            right.color.to_string(),
        )
    })
    .collect();
}

And we're done!