Refactoring types in my RSS reader

January 21, 2017

I recently did a large refactor of one of my Go projects, an RSS reader. The biggest code quality improvement came from redesigning the data types. In this post I talk about this redesign. I think it is interesting as it significantly improved the code's understandability.

I call my RSS reader Gorse. Gorse started out supporting only RSS feeds. Later I added support for RDF. The other day I found a blog I wanted to follow, but it published only an Atom feed. It was time for me to add support for Atom to Gorse. While doing so, I noticed several changes that would improve the code, so I ended up doing a bunch of refactoring. Part of my refactoring was reworking the data types.

If you want to see my actual commits, view those on January 18 and 19, 2017.

The types when I started

I started out with types like so:

type RSSFeed struct {
  ID                     int64
  Name                   string
  Description            string
  URI                    string
  UpdateFrequencySeconds int64
  LastUpdateTime         time.Time
  Items                  []RSSItem
}

type RSSItem struct {
  FeedID                int64
  FeedName              string
  ID                    int64
  Title                 string
  Description           string
  DescriptionHTML       template.HTML
  URI                   string
  PublicationDate       time.Time
  PublicationDateString string
  ReadState             string
}

type Channel struct {
  Title         string
  Link          string
  Description   string
  PubDate       string
  LastBuildDate string
  Items         []Item
}

type Item struct {
  Title       string
  Link        string
  Description string
  PubDate     string
  GUID        string
}

type RSSXML struct {
  XMLName xml.Name
  Channel ChannelXML `xml:"channel"`
  Version string     `xml:"version,attr"`
}

type ChannelXML struct {
  XMLName       xml.Name  `xml:"channel"`
  Title         string    `xml:"title"`
  Link          string    `xml:"link"`
  Description   string    `xml:"description"`
  PubDate       string    `xml:"pubDate"`
  LastBuildDate string    `xml:"lastBuildDate"`
  Items         []ItemXML `xml:"item"`
}

type ItemXML struct {
  XMLName     xml.Name `xml:"item"`
  Title       string   `xml:"title"`
  Link        string   `xml:"link"`
  Description string   `xml:"description"`
  PubDate     string   `xml:"pubDate"`
  GUID        string   `xml:"guid"`
}

type RDFXML struct {
  XMLName  xml.Name
  Channel  ChannelXML   `xml:"channel"`
  Version  string       `xml:"version,attr"`
  RDFItems []RDFItemXML `xml:"item"`
}

type RDFItemXML struct {
  XMLName     xml.Name `xml:"item"`
  Title       string   `xml:"title"`
  Link        string   `xml:"link"`
  Description string   `xml:"description"`
  PubDate     string   `xml:"date"`
  GUID        string   `xml:"guid"`
}

Conceptually, an item is something in a feed, such as a blog post.

There appeared to be a lot of duplication. What was the difference between RSSFeed and Channel, and between RSSFeed and RSSXML? Why was Channel named Channel and not Feed?

I omitted comments that gave a little more information, but even with comments I was confused. And this was something I wrote!

The purpose of each was:

RSSFeed/RSSItem were the types I used in the main reader program, and in any programs working with feeds (such as a feed builder that output XML). They were the true external types (though all the types were exported).
RSSXML/ChannelXML/ItemXML/RDFXML/RDFItemXML I used for parsing XML. They defined what fields to parse. I also used the first three when writing XML.
Channel/Item were the abstracted version of a feed, regardless of input format. When I parsed an XML payload, I parsed into RSSXML or RDFXML, and then transformed those into a Channel.

To add support for Atom, I added these types:

type AtomXML struct {
  XMLName xml.Name      `xml:"http://www.w3.org/2005/Atom feed"`
  Title   string        `xml:"title"`
  Links   []AtomLink    `xml:"link"`
  ID      string        `xml:"id"`
  Updated string        `xml:"updated"`
  Items   []AtomItemXML `xml:"entry"`
}

type AtomLink struct {
  Href string `xml:"href,attr"`
  Rel  string `xml:"rel,attr"`
}

type AtomItemXML struct {
  Title   string     `xml:"title"`
  Links   []AtomLink `xml:"link"`
  ID      string     `xml:"id"`
  Updated string     `xml:"updated"`
  Content string     `xml:"content"`
}

These served the same purpose as RSSXML and RDFXML, describing the fields to parse from XML.

I thought there were several issues with the types as they were:

Their names were confusing. RSSFeed represented arbitrary feed, which was not necessarily RSS (it might be RDF or Atom).
It was difficult to tell the difference between the types, and when one should be used instead of another. For example, why would I use Channel instead of RSSFeed?
Types were used for multiple purposes. I used RSSFeed/RSSItem to hold information from the database, to hold presentation information, and to describe a feed to write to XML. There were fields used in one case but not others. For example, DescriptionHTML, and PublicationDateString were only used for presentation (HTML). Other types suffered from this as well. I used LastBuildDate and GUID when writing XML, and because I used the same types for output as well as parsing, I ended up worrying about parsing fields I didn't need to.

These problems made it unclear when I needed to set a field and added conceptual overhead. From this confusion arose muddling of the purpose of the types. For example, types used in parsing were not taking full responsibility for parsing. The Channel and Item types represented the feed after parsing, yet the date/time field was a string. It would make more sense to parse it into a time.Time when creating that type.

This is what I saw and decided to refactor.

Creating output types

I created types to use specifically when writing XML. This means there are now a set of types describing what we parse, and a different set describing what we output. Previously I used one set for both. Having separate types for each makes it evident which fields I write out and which I parse, and lets me avoid parsing fields I don't need. This removed the confusion about where a field is set, or if it's set at all.

type outXML struct {
  XMLName xml.Name      `xml:"rss"`
  Version string        `xml:"version,attr"`
  Channel outChannelXML `xml:"channel"`
}

type outChannelXML struct {
  Title         string       `xml:"title"`
  Link          string       `xml:"link"`
  Description   string       `xml:"description"`
  PubDate       string       `xml:"pubDate"`
  LastBuildDate string       `xml:"lastBuildDate"`
  Items         []outItemXML `xml:"item"`
}

type outItemXML struct {
  Title       string `xml:"title"`
  Link        string `xml:"link"`
  Description string `xml:"description"`
  PubDate     string `xml:"pubDate"`
  GUID        string `xml:"guid"`
}

This added three more types, so it is more verbose, but we see exactly what fields we output, and when looking at the input types, we see exactly what we parse.

I removed the LastBuildDate and GUID fields from the parsing types since I don't use them there.

Type visibility

I previously exported all types, but this was unnecessary. It was also unclear which were only needed internally. I made all of the parsing and output types private. RSSXML became rssXML, RDFXML became rdfXML, ItemXML became itemXML, and so on. Afterwards, the only exported types were RSSFeed, RSSItem, Channel, and Item.

I also moved the parsing-only types to decode.go, and the output-only types to encode.go. This makes it clear where we use these types.

RDF type clarification

You may have noticed that RDFXML used ChannelXML, just like RSSXML. Do the two formats have an identical <channel> element? No. RDF's <channel> shares some elements with RSS's, but not all. In particular, when parsing RDF, the Items slice was never populated. How would you know that just from looking at the types?

I created a separate type for RDF's <channel> element:

type rdfChannelXML struct {
  XMLName     xml.Name `xml:"channel"`
  Title       string   `xml:"title"`
  Links       []string `xml:"link"`
  Description string   `xml:"description"`
  PubDate     string   `xml:"date"`
}

This clarifies exactly what fields to expect from each format.

Overloaded feed type

A big problem was the conflict of responsibility between RSSFeed and Channel. RSSFeed was the form of the feed once we retrieved it from the database, yet it also included fields, for convenience, that I populated for presentation purposes. I was trying to use it for everything and it ended up not being much good at anything. Additionally, it seemed like some of what I used RSSFeed for would be better suited to Channel. For example, it was weird expecting RSSFeed when writing a feed's XML. It has fields such as ID, but there's no requirement that I'm writing a feed from a database. When outputting a feed to XML, I didn't use this field anyway.

I decided to completely rework these types. I defined new types by their responsibility:

Feed/Item: Hold a feed parsed from XML or to be written out as XML.
DBFeed/DBItem: Hold a feed retrieved from the database.
HTMLItem: Hold information needed when presenting a feed's items in the web interface.

Essentially what I did was break RSSFeed/RSSItem into DBFeed/DBItem and HTMLItem. I then renamed Channel to Feed, and it is now clear that its job is representing an arbitrary feed, free of database and presentation concerns.

This means:

When parsing or writing out a feed we always work with Feed and Item. Previously I parsed into Channel but wrote out using RSSFeed.
When we retrieve a feed and its items from the database, we now only have fields that are actually from the database.
When we want to show a feed's information in the interface, I now have a type just for that purpose, and it includes only what we need there. Fields that are only relevant to the presentation layer are now clearly separate from other contexts.

These types look like this:

type Feed struct {
  Title       string
  Link        string
  Description string
  PubDate     time.Time
  Items       []Item
}

type Item struct {
  Title       string
  Link        string
  Description string
  PubDate     time.Time
}

type DBFeed struct {
  ID                     int64
  Name                   string
  URI                    string
  UpdateFrequencySeconds int64
  LastUpdateTime         time.Time
}

type DBItem struct {
  ID              int64
  Title           string
  Description     string
  Link            string
  FeedID          int64
  PublicationDate time.Time
  FeedName        string
  ReadState       string
}

type HTMLItem struct {
  ID              int64
  FeedName        string
  Title           string
  Link            string
  PublicationDate string
  Description     template.HTML
}

Summary

I didn't start out intending to do this refactor, but as I added the types to support Atom, it became clear that I had strained my original design to the point that it was confusing.

I ended up with more types than I started with (17 versus 12!), but each type now has a clear responsibility. It is apparent when to use each type, and types no longer have context specific fields, so I don't have to wonder whether I need to set a field or whether a field is set.

Thinking about how to better represent the data was a useful exercise. But could I have avoided needing to refactor this way in the first place? If I had supported multiple input formats from the start, or at least designed for that, I may have made different choices. I also might have broken up the types by purpose earlier. I think sharing the types made sense up to a point, but it became a problem. A good time to rework the types might have been when I started writing feeds as well as reading them. At that point I started using the feed functionality as a library and using it in programs beyond the reader, so the purposes of the types started shifting.

Tags: go, golang, programming, rss

The One and the Many