This past summer I interned in San Francisco as a software engineering intern at a company called Doximity, a medical-based tech startup focused on doctors and clinicians. In this article, I will share an experience I had at the company by detailing an engineering decision I made, where I designed an internal tool based on an analysis of the GraphQL and REST APIs for GitHub.
Before diving into the problem statement and practical examples, I'd like to take some time to go over some technical terminology that will be used throughout the article.
Create, read, update, and delete, commonly known by the acronym CRUD, is a set of actions you may perform on a resource in persistent storage applications.
HTTP verbs, which are often used to build RESTful endpoints, may be expressed by the CRUD acronym as follows:
C: PUT/POST # Creates a resource (if the resource does not exist)
R: GET # Fetches an existing resource
U: PUT/POST/PATCH # Updates an existing resource
D: DELETE # Deletes an existing resource
Figure 1: HTTP methods, and the CRUD actions which they map to
For problem scope and brevity's sake, I will only be focusing on the R
in CRUD (the READ
action), and how it pertained to a solution I devised at my last internship.
"REST (Representational State Transfer) is an application program interface (API) that uses HTTP requests to
GET
,PUT
,POST
andDELETE
data." [1]
REST is the most common way to build CRUD APIs. Moreover, it is most commonly built with HTTP methods. Other protocols exist for RESTful servers, such as Simple Object Access Protocol (SOAP), but for the purpose of the article, I will only be focusing on RESTful endpoint architectures which use HTTP verbs.
Regarding CRUD, the READ
action is equivalent to the GET
HTTP verb for a particular Uniform Resource Identifier (URI) (see figure 1). Making a GET
request for a particular endpoint will return a representation of a resource (or list of resources), which is commonly in the form of some JavaScript Object Notation (JSON) schema.
Example Request-Response Cycle
Request:
GET /users # Perform GET request on /users URI
Accept: application/json # Designates content is in JSON format
Response:
200 OK
Content-Type: application/json
{
"users": [
{
"id": 1,
"first_name": "Andrew"
},
{
"id": 2,
"first_name": "Robert"
}
]
}
Figure 2: example of a
GET
request-response cycle for the/users
endpoint
In this example, the URI /users
corresponds to the User
resource. When a request is made, the endpoint returns an array of JSON objects with the User
object schema.
"GraphQL is a query language for APIs and a runtime for fulfilling those queries with your existing data." [2]
GraphQL was developed by Facebook in 2012 as an alternative to conventional REST endpoints as a way to fetch and mutate data in a web server. It supports create, read, update, delete (CRUD) actions for persistent storage applications as REST does. It was designed as a specification, not an implementation, with a client implementation provided by Facebook and written in JavaScript (Node.js).
By definition, GraphQL is a query language. To achieve the READ
CRUD action functionality in GraphQL, a user may perform a query. A GraphQL query, in its most rudimentary form, is a way to ask a given object for its fields.
{
user {
id # 'id' returns an 'Int' type
name # 'name' returns a 'String' type
}
}
Figure 3: rudimentary query asking to fetch a user's
id
andname
Figure 4: an example of a GraphQL query and response in GitHub's GraphQL (v4) API
Since being open-sourced in 2015, GraphQL has gained popularity, and now has clients implemented in over 10 languages, including the popular programming languages Elixir, Java, Python, Clojure, and Ruby.
import sangria.schema._
import sangria.execution._
import sangria.macros._
// Defining a potential query type with a value for the field "hello"
val QueryType = ObjectType("Query", fields[Unit, Unit](
Field("hello", StringType, resolve = _ ⇒ "Hello world!")
))
// Creating a GraphQL schema for the given query type
val schema = Schema(QueryType)
// Executing a query on the schema with the given potential query type
val query = graphql"{ hello }"
Executor.execute(schema, query) map println
Figure 5: an example usage of a GraphQL client written in Scala
GraphQL is gaining traction and popularity in the developer community due to its ease of use, and performance benefits such as reducing the number of requests required to fetch desired data. Several larger tech companies such as GitHub, Shopify, and Pinterest have adopted GraphQL into their stacks, and have started to write versions of their APIs with the GraphQL specifications, to take advantage of the benefits aforementioned.
For one of my projects at Doximity, I was assigned a task to investigate and subsequently build a tool to help recruiters research potential software engineering candidates using GitHub as a means of finding them.
Initial Specifications
The initial specifications for the tool were as follows:
Specification 1: A recruiter at Doximity should be able to find software engineers to reach out to for potential engineering outsourcing
Specification 2: The said recruiter should be able to search candidates based on location
Specification 3: The said recruiter should be able to search candidates based on skills (i.e., rails, docker, neo4j)
Data to collect from candidates
In total, there were seven requirements for the data collected about a candidate:
Requirement 1: Name
Requirement 2: Count of projects in each language
Requirement 3: Email
Requirement 4: List of organizations they belong to
Requirement 5: Number of followers
Requirement 6: If they are looking for a job
Requirement 7: Location
I initially started researching ways I could leverage GitHub's REST API (v3) to find this information. However, I noticed a new API for GitHub in alpha stage utilizing the GraphQL specification (v4). I was curious about the benefits attained from using GraphQL over conventional REST architecture, and so I decided to compare the two in the context of my use case.
When investigating the differences between the APIs, I noticed some shortcomings with the REST implementation and some advantages of the GraphQL one when leveraging the GitHub API for my use case.
Shortcoming: Numerous requests needed to fetch desired data
In our situation, multiple calls were necessary to retrieve all of the required data needed by our recruiters. Firstly, an initial request was made with a query string comprised of a list of locations for potential candidates, 'anded' to a list of languages said candidates had used in their repositories.
GET: "https://api.github.com/search/users?q=location:san-francisco&language:ruby"
Figure 6: example of the initial GET request, with a specified locations, and languages
This example would return an array of User
objects comprised of the following JSON object schema:
{
"login": "mojombo",
"id": 1,
"avatar_url": "https://avatars0.githubusercontent.com/u/1?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/mojombo",
"html_url": "https://github.com/mojombo",
"followers_url": "https://api.github.com/users/mojombo/followers",
"following_url": "https://api.github.com/users/mojombo/following{/other_user}",
"gists_url": "https://api.github.com/users/mojombo/gists{/gist_id}",
"starred_url": "https://api.github.com/users/mojombo/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/mojombo/subscriptions",
"organizations_url": "https://api.github.com/users/mojombo/orgs",
"repos_url": "https://api.github.com/users/mojombo/repos",
"events_url": "https://api.github.com/users/mojombo/events{/privacy}",
"received_events_url": "https://api.github.com/users/mojombo/received_events",
"type": "User",
"site_admin": false,
"score": 1.0
}
Figure 7: JSON object schema for GitHub v3 REST API's
serach/users
endpoint
As you can see from the response schema, none of the required information in the business requirements is present in this response. However, some other API URLs exist, which can be queried to find the desired information.
Endpoints for api.github.com:
GET: "search/users?q=location:#{locations}&language:#{languages}" # fetches array of users including urls and 'user_name'
GET: "search/code?q=#{skill} user:#{user_name}" # search for a particular skill or key term
GET: "users/#{user_name}" # name, email, location, isHireable, number of followers
GET: "users/#{user_name}/orgs" # organizations they belong to
GET: "users/#{user_name}/repos" # gets variable 'repo_name'
GET: "repos/#{user_name}/#{repo_name}/languages" # gets languages used in each repository
Figure 8: all the endpoints required in a tentative RESTful solution, using GitHub's v3 REST API
In total, to fetch all the desired information using the traditional REST architecture, we had to make one request to get $n$ candidates, and then three + $x$ + $R_m$ calls per candidate to pull all the required data.
let $n$ be the total number of users fetched in one query
let $x$ be the number of skills you wish to check for a given candidate
let $R_m$ be the average number of repositories a given candidate has
Formula for number of requests: $1 + \prod_{m=1}^n (3 + x + R_m)$
Shortcoming: Excess response information
Moreover, a significant shortcoming I found with RESTful requests is when I only wanted one line of data in a request, and by default, the endpoint would return multiple lines (sending more data over the server than was required). This shortcoming will be covered in depth when highlighting the advantages of GraphQL in this regard.
Advantage: Inline fragments
GraphQL lets you benefit from a programming concept know as union types. In short, a union type in the context of GraphQL allows you to specify the number of permitted primitive and non-primitive types a value may hold. For an abstract type, you may fragment on multiple concrete types. In GraphQL, this concept may be illustrated through an example with an abstract base class and derived concrete classes.
Consider the following class hierarchy example with abstract base class Animal
, and derived concrete classes Dog
, Cat
, and Bird
:
Figure 9: UML of simple class hierarchy with an abstract base class and three derived child classes
In the query example below, pet
returns an abstract type Animal
, which may have any of the concrete types Dog
, Cat
, or Bird
. For the sake of argument, say I want to use a GraphQL API to get a list of objects of type Dog
and Bird
, but not of type Cat
. I could fragment on the concrete types as follows:
query PetsInStore($store: Store!) {
pet(store: $store) {
age
name
... on Dog {
fur_colour
}
... on Bird {
feather_colour
}
}
}
Figure 10: An example of fragmenting on two concrete types for a union type comprised of derived types of an abstract type [3]
This query would return an array of concrete objects of type Bird
or Dog
of abstract type Animal
, with their corresponding fields feather_colour
, and fur_colour
.
Advantage: Tell GraphQL what you need
In conventional REST architectures, you often end up fetching more data than is desired for your use case. Say you wanted to fetch a list of user ids from a given endpoint with the following schema:
{
"users": [
{
"id": 1,
"first_name": "Andrew",
"last_name": "McBurney",
"age": 20,
"citizenship": "Canadian",
"birthday": "October 13th, 1996",
"hometown": "Niagara Falls, Ontario"
},
...
]
}
Figure 11: REST response for
/users
endpoint on an arbitrary API
In this example, the only data you're concerned about in the id
field. However, when you perform a GET
request on the /users
endpoint, you end up fetching a list of objects of type User
, with the fields id
, first_name
, last_name
, age
, citizenship
, birthday
, and hometown
.
In GraphQL, you can simply by specifying the fields of interest to you. The above example may be simplified as follows:
query {
users {
id
}
}
Figure 12: GraphQL solution for business requirements entailing only user
id
field
This query will return a JSON response of the following form:
{
"users": [
{
"id": 1
},
{
"id": 2
}
...
]
}
Figure 13: GraphQL response for objects of type
User
, limiting the fields toid
only
In GraphQL, you simply tell the server what data you want through a query, and it returns only that information. The advantage over REST, in this case, is that you end up sending fewer data over the server since you're only fetching the data you care about.
In summary:
REST # potential for extra data sent over the server (fixed response)
GraphQL # tell it what you want, restrict data to what you need (dynamic response)
Advantage: One endpoint
Furthermore, there's no concept of multiple endpoints in GraphQL as there is with the conventional REST architecture. You can access all the information you require in an API simply by fragmenting on union types, rather than making multiple requests to attain the desired results.
Figure 14: in GraphQL, you can get the same information in fewer requests, from one endpoint rather than multiple like in REST architecture
Rate Limits
GitHub uses rate limits to restrict the number of requests a user may make to their API. Rate limits are commonly used as security measures to prevent malicious scraping of information.
For the v3 REST API, GitHub had a rate limit of 30 authenticated requests per minute [4] and 10 unauthenticated requests per minute [4] for their search API api.github.com/search.
For the v4 GraphQL API and the other v3 REST endpoints, the total number of authenticated requests you're allowed to make in an hour is 5000 (approximately 83 requests per minute).
Due to this limitation, there was an obligation to fetch the information in the fewest number of requests possible — to avoid the throttling of requests when exceeding the API's rate limit. This played an important factor in the design of the tool I implemented.
Compatibility Issues
Unfortunately, GraphQL is still in its alpha stages [5], and is not entirely compatible with the v3 REST API specification. A limitation I encountered was with regards to the code search feature in the v3 endpoint. Unfortunately, there was not a viable alternative to this endpoint in the GraphQL API at the time I implemented the solution.
Figure 15: code search issue opened up on GitHub's GraphQL API discussion forums
As was previously mentioned, when I was researching the GraphQL (v4) and REST (v3) APIs, I found two shortcomings with the REST implementation: numerous requests were needed to pull all the desired data, and excess response information was returned from the server. The GraphQL API mitigated these two shortcomings with the REST architecture for this use case. It reduced the problem I had with many requests, by enabling me to write a query that encompassed what would have been multiple requests to the REST API. Furthermore, it allowed me to specify only the fields relevant to my business logic, thus removing data I wasn't concerned with from the server response.
Due to the compatibility issues with the v4 GraphQL API not being backward compatible with the search/code
endpoint from the v3 REST API api.github.com/search/code, I was tempted to implement the entire solution using the v3 REST API. However, I didn't want to lose out on the benefits GraphQL provides, and ended up creating a hybrid solution which leveraged both the v3 REST API and v4 GraphQL one.
Hybrid Solution
I ended up using the v3 REST search API to search the code for specific keywords, or skills in files. An example was searching for the term rails
in the file Gemfile
, to see if the candidate had rails projects on their GitHub.
GET: "https://api.github.com/search/code?q=rails+in+filename:Gemfile user:AndrewMcBurney"
Figure 16: example using v3 REST API
search/code
endpoint to search for all repositories containing the keywordrails
in the filenameGemfile
for userAndrewMcBurney
For all other business requirements, I used the following GraphQL query from the GitHub v4 API:
query($query_string: String!, $cursor: String!, $m: Int!) {
search(query: $query_string, type: USER, first: $m, after: $cursor) {
userCount
edges {
cursor
node {
# Fragmenting on concrete class User
... on User {
# Requirement 1: Name
name
# Requirement 3: Email
email
# Requirement 6: If they are looking for a job
isHireable
# Requirement 7: Location
location
# Requirement 5: Number of followers
followers {
totalCount
}
# Requirement 4: List of organizations they belong to
organizations(first: 10) {
nodes {
name
}
}
# Requirement 2: Count of projects in each language
repositories(first: 100, orderBy: { field: PUSHED_AT, direction: DESC }) {
nodes {
languages(first: 10, orderBy: { field: SIZE, direction: DESC }) {
nodes {
name
}
}
}
}
}
}
}
}
}
Figure 17: GraphQL query to fetch all required data (except code searching)
While this query is certainly more complicated than the tentative REST solution in figure 8, it provides several advantages, such as reducing the number of requests needed to fetch all desired data and limiting the response to only contain the fields relevant to my business requirements.
let $m$ be the total number of users fetched in one query
let $x$ be the number of skills you wish to check for a given candidate
Formula for number of requests: $1 + mx$
Looking back, there were plenty of lessons to be learned from the design decisions I made. Here are some of the following lessons I'd like to share:
GraphQL isn't a silver bullet, and won't take over API design overnight
There are plenty of legacy REST APIs in production today, and only a handful of companies using GraphQL to design their new APIs. It will take a long time before GraphQL gains enough traction as a query language for it to take up a large portion of the market.
Tools in alpha stage are volatile and not always feature complete
As was discussed in the limitations section, the GitHub v4 API is not currently backward compatible with the v3 REST API. Hence, the new API is not feature complete with regards to the v3 REST endpoint. Moreover, some GraphQL APIs are not compatible with their legacy counterparts (i.e., GitHub v3 to v4 compatibility).
Furthermore, GitHub has warned users that API features are volatile and subject to change before the beta release. Thus, breaking changes are a possibility, which would lead to further maintenance for developers using the API.
The chances of GitHub modifying the fields I used from the objects I searched are very small — but there is still an opportunity for things to change before the final release of the v4 API.
In many situations, developer time is more valuable than your program runtime
While the hybrid solution I implemented was faster, and used fewer requests than a solution comprised solely of the v3 REST API, it added extra complexity. Given the volatile nature of the API, perhaps it wasn't the best decision to implement the tool partially using the v4 GraphQL API.
After completing the project, I found myself asking questions such as " was the benefit gained from using part of the GraphQL API worth the developer time spent to implement both GraphQL and REST clients, and the added complexity of scraping from two distinct APIs?"
GraphQL is on the rise, and it's exciting to see how the developer community grows and responds to the new technology — as it is a relatively young query language.
If you're interested in learning more about GraphQL, and have a GitHub account, I would recommend playing around with the GitHub GraphQL API explorer. It's a great tool if you're interested in learning more about the query language, and the GitHub API itself.
As always, thank you for reading my article, and have a fantastic day!
— Andrew Robert
[1] http://searchcloudstorage.techtarget.com/definition/RESTful-API
[3] http://graphql.org/learn/queries/#inline-fragments
[4] https://developer.github.com/v3/search/#rate-limit
[5] https://developer.github.com/v4/
[6] https://medium.com/chute-engineering/graphql-in-the-age-of-rest-apis-b10f2bf09bba
GraphQL logos are trademarked by Facebook — all rights reserved.