Using Python + GraphQL + MongoDB to Find Promising Open Source Projects on GitHub

Written by Jordan H.

This week I’m launching a segment called “Freedom Friday,” where I showcase a promising GitHub project that I think deserves recogntition but might have fallen underdeath your radar.

But I’m a programmer. I’m going to go about this the way a programmer would: scour StackOverflow.com to figure out why my GraphQL query won’t work.

But before and after I did that I desgined and developed a Python application to “guess” at what the promising projects were. In this blog I describe how I went about this and what I learned.

Reshare or comment on this article and get a shout-out in the first “Freedom Friday” video!


The Project

There are 5 steps to this project, as shown in the diagram below

Overview

Steps 1 and 2 are handled by one Python script, and steps 3-5 are handled by another.

1. Query Projects form GraphQL

Here is the query used for the project:

query ($searchQuery : String! $first : Int!, $after : String) {
  search(query: $searchQuery, type: REPOSITORY, first: $first, after: $after) {
    repositoryCount
    pageInfo {
      endCursor
      startCursor
    }
    edges {
      node {
        ... on Repository {
          id
          name
          url
          description
          stargazerCount
          closedIssues: issues(states: CLOSED) {
            totalCount
          }
          totalIssues: issues {
            totalCount
          }
          repositoryTopics (first: 10) {
            edges {
              node {
                ... on RepositoryTopic {
                  topic {
                    name
                  }
                }
              }
            }
          }
          releases {
            totalCount
          }
        }
      }
    }
  }
}

Here are the GraphQL variables:

{
    "searchQuery": "s:public pushed:>={pushed_date} stars:{min_stars}..{max_stars} sort:interactions-desc topics:>={n_topics} created:>={created_date} NOT module NOT plugin NOT wrapper NOT \"for the\""
}

For now I don’t use the first variable (maybe later). I’m just interested in the top searches.

You might be freaking out by the curly braces. Basically what it means is that I’m using this search query as a python variable.

Search Query paramters

Here’s a rundown of what each term means in the query:

is:public
By itself this will grab all the public repositories in the default order order
pushed:>={pushed_date}
Include only repositories pushed to github beyond the date YYY-MM-DD
stars:{min_stars}..{max_stars}
Include only repos that have more than {min_stars} but fewer than {max_stars}
sort:interactions-desc
Sort by number of interactions (more interactions to fewer)
topics:>={n_topics}
Include only repos that have more than {n_topics} topics.
NOT module
Don’t include anything that mentions “module” in the title or description
NOT wrapper
Don’t include anything that mentions “wrapper” in the title or description
NOT “for the”
Don’t include anything that mentions “for the” in the title or description, which I found was an indicator of a module

You can see all the possible search query parameters here.

Try out some of the search parameters yourself in GitHub to see what you get.

GraphQL Response

The response looks something like this.

{
  "data": {
    "search": {
      "repositoryCount": 1435,
      "pageInfo": {
        "endCursor": "Y3Vyc29yOjQ=",
        "startCursor": "Y3Vyc29yOjE="
      },
      "edges": [
        {
          "node": {
            "updatedAt": "2021-05-01T13:15:01Z",
            "id": "MDEwOlJlcG9zaXRvcnkxNDM3NzQ1MjU=",
            "name": "TLC591x",
            "url": "https://github.com/Andy4495/TLC591x",
            "description": "Library for Texas Instruments TLC5916 and TLC5917 constant current LED sink driver for Arduino and Energia. ",
            "watchers": {
              "totalCount": 4
            },
            "forkCount": 0,
            "mentionableUsers": {
              "totalCount": 1
            },
            "closedIssues": {
              "totalCount": 2
            },
            "totalIssues": {
              "totalCount": 3
            },
            "repositoryTopics": {
              "edges": [
                {
                  "node": {
                    "topic": {
                      "name": "arduino"
                    }
                  }
                },
                {
                  "node": {
                    "topic": {
                      "name": "energia"
                    }
                  }
                },
            /// .... You get the idea ....
            },
        },
        /// .... You get the idea ....
        {
          "node": {
            "updatedAt": "2021-05-01T15:24:42Z",
            "id": "MDEwOlJlcG9zaXRvcnkzMjQwODE3Njk=",
            "name": "raihaninfo",
            "url": "https://github.com/raihaninfo/raihaninfo",
            "description": "Md Abu Raihan portfolio",
            "watchers": {
              "totalCount": 1
            },
            "forkCount": 1,
            "mentionableUsers": {
              "totalCount": 1
            },
            "closedIssues": {
              "totalCount": 0
            },
            "totalIssues": {
              "totalCount": 0
            },
            "repositoryTopics": {
              "edges": []
            },
            "releases": {
              "totalCount": 0
            }
          }
        }
      ]
    }
  }
}

Translating from GraphQL to MongoDB

It looks like a lot of weird data. I don’t need to know all about the edges and nodes to just get the data. So let’s strip out the heirarchical structure to just store what we need. That’s what Python is here for! Python can translate the discombobulated GraphQL Query to MongoDB.

Here’s the relevant portion from the Python code. Whenever I write a conversion function, I like to make it a @classmethod of the type I’m trying to import to:

from mongoengine import Document
# ...
class Repository(Document)
    # ...
    @classmethod
    def from_graphql(cls, edge: T.Dict, autosave=True):
        node = edge["node"]
        id = node["id"]
        name = node["name"]
        url = node["url"]
        last_updated = datetime.strptime(node["updatedAt"], TIMESTAMP_FORMAT)
        description = node["description"]
        watchers = node["watchers"]["totalCount"]
        forks = node["forkCount"]
        mentionable_users = node["mentionableUsers"]["totalCount"]
        closed_issues = node["closedIssues"]["totalCount"]
        total_issues = node["totalIssues"]["totalCount"]
        topics = [t["node"]["topic"]["name"] for t in node["repositoryTopics"]["edges"]]
        releases = node["releases"]["totalCount"]
        stargazers = node["stargazerCount"]

        stats = RepositoryStats(
            watchers=watchers,
            forks=forks,
            mentionable_users=mentionable_users,
            closed_issues=closed_issues,
            total_issues=total_issues,
            releases=releases,
            last_updated=last_updated,
            stargazers=stargazers,
        )
        repo = cls(
            _id=id,
            name=name,
            url=url,
            description=description,
            stats=stats,
            topics=topics,
        )

        if autosave:
            repo.save()
        return repo

Calculating the Score

I wanted a way to calculate a sort of “popularity score.” I included this as property in the RepositoryStats model to act kind of like a field:

class RepositoryStats(EmbeddedDocument, FFMixin):

    watchers = IntField(required=True)
    stargazers = IntField(required=True)
    forks = IntField(required=True)
    mentionable_users = IntField(required=True)
    total_issues = IntField(required=True)
    closed_issues = IntField(required=True)
    releases = IntField(required=True)
    last_updated = DateTimeField(required=True)

    @property
    def closure(self) -> float:
        return float(self.closed_issues + 1) / float(self.total_issues + 1)

    @property
    def activity(self) -> float:
        return float(
            (self.mentionable_users * 1)
            * (self.forks * 1)
            * (self.releases * 1)
            * (self.watchers * 1)
            * (self.stargazers * 1)
        )

    @property
    def age_in_days(self) -> float:
        # Because of timezone discrepancies I'm just fudging the
        # math a bit by adding a day by default.
        # Otherwise you'll get repositories that are created 3 hours
        # from now!
        return (datetime.today() - self.last_updated).days + 1

    @property
    def score(self) -> float:
        factor = self.activity / (self.closure)
        if self.age_in_days > 0:
            # This will be a divide-by-zero error if self.age_in_days is
            # less than or equal to 0.
            factor = (factor / self.age_in_days / AGE_FACTOR)
        if not (self.releases and factor):
            return self.releases or factor
        try:
            return math.log(factor * self.releases)
        except ValueError as err:
            import ipdb

            ipdb.set_trace()

The ipdb there is only for debugging in case I run into a math error. Basically it should’t ever get to that point.

Getting the data

Then in order to fetch the data I simply run:

# I like to use poetry
poetry run python -m fffeed.fetch 2021-02-01 2019-05-15 -s 1 -S 500 -T 3
# This might hang for a bit
poetry run python -m fffeed.report > report.csv

And the following table is written to report.csv.

sample

Looks like it works! Take a look at the 2nd top project cl-hackathon-app. Even though it has a lower closure rate than docs, that’s only because it’s not as active…plus cl-hackathon-app has more issues, which indicates more users are filing issues.

Try it out!

The project is open source at https://gitlab.com/srcrr/freedom-friday-feed. Clone the repo and try it out yourself!

And don’t forget to share this article with your colleagues on LinkedIn or twitter. If you do you’ll get a shout-out in my next Freedom Friday video!

Did you like this article?

Did you know I'm available for hire?

Send me an email here: hireme@jordanhewitt.pro

Or connect with me in Wire: @nswered