Introducing Kinase, B12’s Web Content Labeling Framework
The web contains a variety of content that is easily accessible to humans but ill-formed for machines. Many organizations have spent countless hours building user interfaces and algorithms for web scraping, schema inference, and structured data extraction from the vast corners of the web.
With Kinase, we hope that the many companies, academic institutions, and journalists who want to add structure to unstructured data on the web can focus on their applications without having to custom-design a user interface for acquiring that data.
A Brief History of Structuring Data on the Web
Web structured data extraction has been a commercial, academic, and journalistic fascination for a long time. Here are just a small set of efforts in this vein:
Various companies including import.io and the now-defunct Kimono Labs have been established with the goal of turning structured data embedded on websites into machine-accessible APIs. We hope to make user interfaces like these easier to build with Kinase.
Open source projects such as Scrapy and Portia help programmatically or visually create web crawlers to scrape large collections of structured data from websites.
Vertical-specific efforts like that at Locu use human- and machine-powered pipelines to extract valuable information such as price lists.
In journalism, data scraping is often a step along the way to identifying an insight for a story. There are many tutorials and explainers about what this process looks like.
Kinase fits into this space by providing a reusable user interface for labeling content on a website. It does not offer complex extraction algorithms or methods of extracting large paginated collections of data from websites. Instead, it makes it simple for users to click on text and other media (e.g., a product’s title and photo), specify which fields in your schema the clicked-on content is relevant to (e.g.,
picture), and save this labeled content via whatever API you provide.
There are three Kinase concepts that it helps to understand before using Kinase or creating your own Kinase extension: annotations are the fields you would like to label, mappings are the specific values of those fields on a particular website, and contexts are groupings of mappings for a particular labeling session. We expand on each concept below.
Annotations represent the labels for which you’d like to select content. For instance, you might want to map all the products from a company’s website to a
Content selected for an annotation must be mapped to one of its fields. In the case of our
products annotation, these fields might include a
photo. Each field in an annotation can only be mapped to a specific type of content: in our example,
products annotation can be mapped to a variable number of products containing that set of fields. Kinase also allows you to enforce a single mapping for an annotation, as described in the next section.
The content a user selects for an annotation is called a mapping. A single mapping to
products would include the actual text taken from the web for a product’s
description, as well as an image for its
Each mapped field in a mapping contains the content taken from the website (with any additional user edits) and the original source of that content (the URL of the website it was taken from, and a unique CSS selector specifying its container).
If specified, an annotation can support multiple mappings (when selecting content for
products, the multiple mappings for that annotation might represent a store’s entire product catalog).
In Kinase, annotations and their mappings are grouped together in a context. This context is keyed by an arbitrary string and might represent a specific user labeling content in the extension or a project that content is being labeled for. If a user were doing research on both the laptop and desktop market, they might use a separate context to represent each market with each context containing the
By default, all mapped content is stored in a default context, so if you don’t need to switch contexts you can ignore the concept entirely.
Using Kinase involves creating your own instance of Kinase, programmatically telling Kinase what the data you’re labeling looks like, and then generating a Chrome extension that your organization can use to label data with that schema.
Here’s a brief walkthrough of how the process works:
First, install Kinase with
npm install kinase(or better, yarn add kinase).
Then, you can create your own derived extension:
const Kinase = require(‘kinase’) const extension = new Kinase(options)
options.outputparameter. You can configure your instance of Kinase with things like a custom API for reading and saving data in structured form based on the configuration options you provide.
How We Use Kinase
Within B12, we use Kinase in various ways to power our Smart Websites:
Porting over old customer websites. When customers who have an existing website request a Launch Boost, our experts use Kinase to label all of the customers’ old structured data (e.g., their team, products, services) so that we can reuse this information on the website we are designing.
Training machine learning models. When a customer first joins B12, our robots algorithmically design a website for the customer in 60 seconds with as much of the content from their old website and social media presence as possible displayed correctly. To do this, we use a number of models including logo classifiers, about text classifiers, and collection (e.g., teams, products) classifiers that require training data. We’re currently exploring using the data that is generated from Kinase labeling sessions to train these models so that they become more accurate in the future.
These use cases highlight why building a Chrome plugin for these purposes is helpful. Our experts can enable Kinase on a customer’s old website or Facebook page, label the structured data they would like to ingest, and save this data via our internal API. We provide the schema of the data we’d like our experts to label and an API for saving the labeled data, and Kinase provides a reusable labeling interface that interacts nicely with any webpage.
Under the Hood
all: initialCSS rule on specific interface elements to combat commonly occurring style issues, but we’re still working to tackle this issue.
Read nextSee all
The first AI website GPT on OpenAI only needs two details from you to build a personalized siteRead now
B12’s suite of ChatGPT plugins continues to grow, offering you more ways to easily leverage generative AI.Read now