Refactoring types in my RSS reader
I recently did a large refactor of one of my Go projects, an RSS reader. The biggest code quality improvement came from redesigning the data types. In this post I talk about this redesign. I think it is interesting as it significantly improved the code's understandability.
I call my RSS reader Gorse. Gorse started out supporting only RSS feeds. Later I added support for RDF. The other day I found a blog I wanted to follow, but it published only an Atom feed. It was time for me to add support for Atom to Gorse. While doing so, I noticed several changes that would improve the code, so I ended up doing a bunch of refactoring. Part of my refactoring was reworking the data types.
If you want to see my actual commits, view those on January 18 and 19, 2017.
The types when I started
I started out with types like so:
type RSSFeed struct {
ID int64
Name string
Description string
URI string
UpdateFrequencySeconds int64
LastUpdateTime time.Time
Items []RSSItem
}
type RSSItem struct {
FeedID int64
FeedName string
ID int64
Title string
Description string
DescriptionHTML template.HTML
URI string
PublicationDate time.Time
PublicationDateString string
ReadState string
}
type Channel struct {
Title string
Link string
Description string
PubDate string
LastBuildDate string
Items []Item
}
type Item struct {
Title string
Link string
Description string
PubDate string
GUID string
}
type RSSXML struct {
XMLName xml.Name
Channel ChannelXML `xml:"channel"`
Version string `xml:"version,attr"`
}
type ChannelXML struct {
XMLName xml.Name `xml:"channel"`
Title string `xml:"title"`
Link string `xml:"link"`
Description string `xml:"description"`
PubDate string `xml:"pubDate"`
LastBuildDate string `xml:"lastBuildDate"`
Items []ItemXML `xml:"item"`
}
type ItemXML struct {
XMLName xml.Name `xml:"item"`
Title string `xml:"title"`
Link string `xml:"link"`
Description string `xml:"description"`
PubDate string `xml:"pubDate"`
GUID string `xml:"guid"`
}
type RDFXML struct {
XMLName xml.Name
Channel ChannelXML `xml:"channel"`
Version string `xml:"version,attr"`
RDFItems []RDFItemXML `xml:"item"`
}
type RDFItemXML struct {
XMLName xml.Name `xml:"item"`
Title string `xml:"title"`
Link string `xml:"link"`
Description string `xml:"description"`
PubDate string `xml:"date"`
GUID string `xml:"guid"`
}
Conceptually, an item is something in a feed, such as a blog post.
There appeared to be a lot of duplication. What was the difference between
RSSFeed
and Channel
, and between RSSFeed
and RSSXML
? Why was Channel
named Channel
and not Feed
?
I omitted comments that gave a little more information, but even with comments I was confused. And this was something I wrote!
The purpose of each was:
RSSFeed
/RSSItem
were the types I used in the main reader program, and in any programs working with feeds (such as a feed builder that output XML). They were the true external types (though all the types were exported).RSSXML
/ChannelXML
/ItemXML
/RDFXML
/RDFItemXML
I used for parsing XML. They defined what fields to parse. I also used the first three when writing XML.Channel
/Item
were the abstracted version of a feed, regardless of input format. When I parsed an XML payload, I parsed intoRSSXML
orRDFXML
, and then transformed those into aChannel
.
To add support for Atom, I added these types:
type AtomXML struct {
XMLName xml.Name `xml:"http://www.w3.org/2005/Atom feed"`
Title string `xml:"title"`
Links []AtomLink `xml:"link"`
ID string `xml:"id"`
Updated string `xml:"updated"`
Items []AtomItemXML `xml:"entry"`
}
type AtomLink struct {
Href string `xml:"href,attr"`
Rel string `xml:"rel,attr"`
}
type AtomItemXML struct {
Title string `xml:"title"`
Links []AtomLink `xml:"link"`
ID string `xml:"id"`
Updated string `xml:"updated"`
Content string `xml:"content"`
}
These served the same purpose as RSSXML
and RDFXML
, describing the fields
to parse from XML.
I thought there were several issues with the types as they were:
- Their names were confusing.
RSSFeed
represented arbitrary feed, which was not necessarily RSS (it might be RDF or Atom). - It was difficult to tell the difference between the types, and when one
should be used instead of another. For example, why would I use
Channel
instead ofRSSFeed
? - Types were used for multiple purposes. I used
RSSFeed
/RSSItem
to hold information from the database, to hold presentation information, and to describe a feed to write to XML. There were fields used in one case but not others. For example,DescriptionHTML
, andPublicationDateString
were only used for presentation (HTML). Other types suffered from this as well. I usedLastBuildDate
andGUID
when writing XML, and because I used the same types for output as well as parsing, I ended up worrying about parsing fields I didn't need to.
These problems made it unclear when I needed to set a field and added
conceptual overhead. From this confusion arose muddling of the purpose of the
types. For example, types used in parsing were not taking full responsibility
for parsing. The Channel
and Item
types represented the feed after parsing,
yet the date/time field was a string. It would make more sense to parse it into
a time.Time
when creating that type.
This is what I saw and decided to refactor.
Creating output types
I created types to use specifically when writing XML. This means there are now a set of types describing what we parse, and a different set describing what we output. Previously I used one set for both. Having separate types for each makes it evident which fields I write out and which I parse, and lets me avoid parsing fields I don't need. This removed the confusion about where a field is set, or if it's set at all.
type outXML struct {
XMLName xml.Name `xml:"rss"`
Version string `xml:"version,attr"`
Channel outChannelXML `xml:"channel"`
}
type outChannelXML struct {
Title string `xml:"title"`
Link string `xml:"link"`
Description string `xml:"description"`
PubDate string `xml:"pubDate"`
LastBuildDate string `xml:"lastBuildDate"`
Items []outItemXML `xml:"item"`
}
type outItemXML struct {
Title string `xml:"title"`
Link string `xml:"link"`
Description string `xml:"description"`
PubDate string `xml:"pubDate"`
GUID string `xml:"guid"`
}
This added three more types, so it is more verbose, but we see exactly what fields we output, and when looking at the input types, we see exactly what we parse.
I removed the LastBuildDate
and GUID
fields from the parsing types since I
don't use them there.
Type visibility
I previously exported all types, but this was unnecessary. It was also unclear
which were only needed internally. I made all of the parsing and output types
private. RSSXML
became rssXML
, RDFXML
became rdfXML
, ItemXML
became
itemXML
, and so on. Afterwards, the only exported types were RSSFeed
,
RSSItem
, Channel
, and Item
.
I also moved the parsing-only types to decode.go
, and the output-only types
to encode.go
. This makes it clear where we use these types.
RDF type clarification
You may have noticed that RDFXML
used ChannelXML
, just like RSSXML
. Do
the two formats have an identical <channel>
element? No. RDF's <channel>
shares some elements with RSS's, but not all. In particular, when parsing RDF,
the Items
slice was never populated. How would you know that just from
looking at the types?
I created a separate type for RDF's <channel>
element:
type rdfChannelXML struct {
XMLName xml.Name `xml:"channel"`
Title string `xml:"title"`
Links []string `xml:"link"`
Description string `xml:"description"`
PubDate string `xml:"date"`
}
This clarifies exactly what fields to expect from each format.
Overloaded feed type
A big problem was the conflict of responsibility between RSSFeed
and
Channel
. RSSFeed
was the form of the feed once we retrieved it from the
database, yet it also included fields, for convenience, that I populated for
presentation purposes. I was trying to use it for everything and it ended up not
being much good at anything. Additionally, it seemed like some of what I used
RSSFeed
for would be better suited to Channel
. For example, it was weird
expecting RSSFeed
when writing a feed's XML. It has fields such as ID
, but
there's no requirement that I'm writing a feed from a database. When outputting
a feed to XML, I didn't use this field anyway.
I decided to completely rework these types. I defined new types by their responsibility:
Feed
/Item
: Hold a feed parsed from XML or to be written out as XML.DBFeed
/DBItem
: Hold a feed retrieved from the database.HTMLItem
: Hold information needed when presenting a feed's items in the web interface.
Essentially what I did was break RSSFeed
/RSSItem
into DBFeed
/DBItem
and
HTMLItem
. I then renamed Channel
to Feed
, and it is now clear that its job
is representing an arbitrary feed, free of database and presentation concerns.
This means:
- When parsing or writing out a feed we always work with
Feed
andItem
. Previously I parsed intoChannel
but wrote out usingRSSFeed
. - When we retrieve a feed and its items from the database, we now only have fields that are actually from the database.
- When we want to show a feed's information in the interface, I now have a type just for that purpose, and it includes only what we need there. Fields that are only relevant to the presentation layer are now clearly separate from other contexts.
These types look like this:
type Feed struct {
Title string
Link string
Description string
PubDate time.Time
Items []Item
}
type Item struct {
Title string
Link string
Description string
PubDate time.Time
}
type DBFeed struct {
ID int64
Name string
URI string
UpdateFrequencySeconds int64
LastUpdateTime time.Time
}
type DBItem struct {
ID int64
Title string
Description string
Link string
FeedID int64
PublicationDate time.Time
FeedName string
ReadState string
}
type HTMLItem struct {
ID int64
FeedName string
Title string
Link string
PublicationDate string
Description template.HTML
}
Summary
I didn't start out intending to do this refactor, but as I added the types to support Atom, it became clear that I had strained my original design to the point that it was confusing.
I ended up with more types than I started with (17 versus 12!), but each type now has a clear responsibility. It is apparent when to use each type, and types no longer have context specific fields, so I don't have to wonder whether I need to set a field or whether a field is set.
Thinking about how to better represent the data was a useful exercise. But could I have avoided needing to refactor this way in the first place? If I had supported multiple input formats from the start, or at least designed for that, I may have made different choices. I also might have broken up the types by purpose earlier. I think sharing the types made sense up to a point, but it became a problem. A good time to rework the types might have been when I started writing feeds as well as reading them. At that point I started using the feed functionality as a library and using it in programs beyond the reader, so the purposes of the types started shifting.
Tags: go, golang, programming, rss