Preface: This is the second in a series of 3 articles dedicated to the analysis of the Ethereum blockchain events:
Part I: Data scraping (fetching and decoding data)
Part II: Data pre-processing
Part III: Data analysis (a SuperRare contract data analysis)

This second part of the series is dedicated to the data pre-processing of a blockchain events dataset.

Since a Solidity contract can implement its own events, each with their own signature / semantic, I had to choose one contract to illustrate this article. My preferred choice was the SuperRare contract.

SuperRare is one of the most famous digital art marketplaces. You can purchase at sale price artworks certified on the blockchain, but you can also make open offers or place bids in auctions. Once you own a piece you can resell it in the secondary market.

The choice of a NFT marketplace contract

As a contract allowing the transfer, purchase and sale of NFTs, the SuperRare contract has several advantages:

  • event volume is low. Even if the volume has strongly increased starting end of 2019, it is still very manageable (141.409 events in the dataset).
  • the contract relies on the ERC721 standard which stipulate a solid foundation for event definitions.
  • NFTs are a great and fun way to explore the blockchain.

But first of all what are NFTs exactly?

Non-fungible tokens (NFTs) are unique, digital items with blockchain-managed ownership. Examples include collectibles, game items, digital art, event tickets, domain names, and even ownership records for physical assets. (opensea.io)

To know more about NFTs, this article on opensea.io blog is pretty good: The Non-Fungible Token Bible: Everything you need to know about NFTs.

Other examples of NFT Token Marketplaces:

  • CryptoKitties, a marketplace where you can buy/sell/breed digital cats. This is the first project to take NFTs to the mainstream.
  • Decentraland, NFTs for Decentraland land ownership and in-world assets.

NFT exchanges on marketplaces implements ERC721-compliant smart contracts.
The ERC721 standard provides a mapping of unique identifiers (each of which represents a single asset, ie. a token) to blockchain addresses, which represent the owner of that identifier. ERC721 also provides a permissioned way to transfer these assets, using the transferFrom method.
It defines a minimum interface a smart contract must implement to allow unique tokens to be managed, owned, and traded.

ERC721 functions:

balanceOf(owner)
ownerOf(tokenId)
safeTransferFrom(from, to, tokenId)
transferFrom(from, to, tokenId)
approve(to, tokenId)
getApproved(tokenId)
setApprovalForAll(operator, _approved)
isApprovedForAll(owner, operator)
safeTransferFrom(from, to, tokenId, data)

ERC721 events:

Transfer(from, to, tokenId)
Approval(owner, approved, tokenId)
ApprovalForAll(owner, operator, approved)

(See https://docs.openzeppelin.com/contracts/3.x/api/token/erc721)

The SuperRare events dataset

Note: The dataset is extracted from BigQuery but for an exploring purpose, you should instead fetch some samples from the Etherscan API (I provide a piece of code to perform this in my previous article).

The dataset has been extracted from the logs table of the bigquery-public-data.crypto_ethereum dataset which is available among the BigQuery public datasets.
You can query the dataset in SQL and filter by address and date to get the data you’re interested in on the period you’re interested in.
Be sure to to read Cost optimization best practices for BigQuery first and follow the best practices to reduce your query costs.
At least:

  1. Only query the data you need. So don’t select *
  2. Check if tables are partitioned. Most of the tables in the dataset are partitioned by day. So filter on dates!
  3. Estimate your query cost before running it: if you’re in the console, keep an eye on the estimation of the number of bytes read, if you’re in a script or if you’re using the bq command-line, use the dry_run option

BigQuery Pricing: $5.00 per TB, the first 1 TB per month is free (see BQ Query Pricing and BigQuery Quota).

That said, let’s dive into the dataset by loading it in a pandas dataframe:

df_events_raw.shape
(141409, 4)
df_events_raw.head()

The relevant information holds in the block_timestamp, topics and data columns.
block_timestamp refers to the block timestamp of the transaction where the event was emitted. The topics field encodes the event method and its indexed keys/parameters.
data encodes additional data (non indexed key/parameter).
The transaction_hash field refers to the hash of the transaction where the event was emitted.

Note that, on a SuperRare artwork page, you can view the transaction history. In the History section, just click the [view tx ] link to open the transaction on Etherscan:

The events dates range from 2018–04–05 to 2020–12–30.

Pre-processing of the dataset

What you’ll need:

  • A mapping dictionary which holds the keccak256 signatures of the method name and argument types (see my previous article for a full explanation on how to compute this dictionary):
signatures['0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef']
'Transfer(address,address,uint256)'
  • To perform sales analytics, you’ll need the price history of the pair ETH/USD or whatever pair fitting your purpose (you can download it here: ethereumprice):
eth_prices.head()
ethPrice        date
0  638.507606  2020-12-27
1  628.604621  2020-12-26
2  615.147380  2020-12-25
3  579.693058  2020-12-24
4  636.242994  2020-12-23

This will be used in the next article, the data analysis part.

  • Finally, you will need some utility functions to decode topic and data events (conversion to integer, string, etc.). Some examples:
from hexbytes import HexBytes

# Decode Ethereum address, extracting the 20 last bytes 
def decodeAddress(x):
  return HexBytes(x)[-20:].hex()

def decodeInteger(x):
  return int(x, 16)#decode transaction value

def decodeValue(x): 
  return decodeInteger(x) / 10**18

Your first pre-processing step will consist on:

  • The extraction of the 4 topics (topic_0, topic_1, topic_2, topic_3) from the topics array
  • The mapping of the topic_0 to the event method name according to the signatures mapping dictionary

This how your dataset should like after this first pre-processing:

df_events = preProcess(df_events_raw)

df_events.shape
(141409, 9)
df_events[['date', 'eventName', 'topic_1', 'topic_2', 'data']] \
  .sample(5)

The last step will be the extraction of the “Mint” event.

There is actually no such event.
A mint, that is a token creattion, is a transfer of a token from the genesis address (or zero address, ie. 0x0) to the address calling the mint function. So this event is emitted in the mint function but has the form Transfer(0x0, _to, _tokenId) and will be categorized as a Mint event in the dataset.

GENESIS_ADDRESS = \
    '0x0000000000000000000000000000000000000000000000000000000000000000'


def addCustomEvent(eventName, topic_1, topic_2):
  if eventName == 'Transfer' and topic_1 == GENESIS_ADDRESS:
    return 'Mint'
  else:
    return eventName


df['eventName'] = df.apply(lambda x: addCustomEvent(x.eventName,
                           x.topic_1, x.topic_2), axis=1)

Events Stats

Let’s start with some basic data analysis on events distribution and volume.

Although standard ERC721 stipulate a solid foundation to build upon, things like auctions or secondary sales are enforced by the platform. Hence in addition to the ERC721 events Transfer, Approval and ApprovalForAll, the contract implements events like Bid, CancelBid, AcceptBid, Sold and SetSalePrice:

kwargs = {
  "title": "Events Total Count by Method",
  "xlabel": "Event Method Name", 
  "ylabel": "Count"
}
df_events \
  .eventName \
  .value_counts() \
  .plot(kind="bar",**kwargs)

df_events \
  .eventName \
  .value_counts() \
  .nlargest(n=10, keep='first') \
  .plot.pie(title="Event Method Distribution")

Let’s see the monthly evolution of events volume:

kwargs = {
  "title": "Event volume by Month",
  "xlabel": "Month", 
  "ylabel": "Count"
}

df_events \
  .set_index("datetime") \
  .resample('1m') \
  .size()\
  .plot(**kwargs)

Let’s partition by event type by filtering on the most frequent event names: Bid, Mint, AccetpBid, Sold and SetSalePrice (Note that SalePriceSet and SetSalePrice events are merged together):

df_events.loc[df_events['eventName'] == "SalePriceSet", 'eventName'] = "SetSalePrice"
kwargs = {
  "title": "Events Total Count by Method by Month",
  "xlabel": "Month", 
  "ylabel": "",
  "figsize": (11,8)
}

df_events[df_events.eventName.isin(["Mint", "Bid", "Sold", "AcceptBid", "SetSalePrice"])] \
  .set_index("datetime") \
  .resample('1m') \
  .eventName \
  .value_counts() \
  .unstack() \
  .fillna(0) \
  .plot(**kwargs)

Obviously, here, the key months are November/December 2019 where the number of bids starts to explode.

More Pre-processing

We have not finished yet.
As explained in my previous article, the event signature drive the extraction and the conversion of the relevant information stored in the topics and data fields. For the Transfer event below:

event Transfer(address indexed _from, address indexed _to, uint256 _tokenId)

the _from address and the _to address values (both indexed) are located in the topic_1 and topic_2 fields respectively. To extract the Ethereum addresses from these fields, you’ll use the decodeAddress function:

df_transfer_events = df_events[df_events.eventName == 'Transfer']

df_transfer_events['from'] = df_transfer_events.topic_1.apply(lambda x: \
        decodeAddress(x))

df_transfer_events['to'] = df_transfer_events.topic_2.apply(lambda x: \
        decodeAddress(x))

The _tokenId (the non indexed parameter) is held by the data field. To extract the integer value from this field, you’ll use the decodeInteger function:

df_transfer_events['tokenId'] = df_transfer_events.data.apply(lambda x: \
        decodeInteger(x.data))

df_transfer_events[['data', 'tokenId']].head()

Following the same method, you will have a dataframe for each event type and where you’ll extract the relevant information in the context of this event type.

Note : There is a trick with the SuperRare contract. Actually, SuperRare has several contracts that can interact with each other.
That’s why you could find 2 event/functions with different names or parameters (usually an additional parameter for the contract address calling an other contract function). So the extraction of the event parameter values will not be so straight-forward.

We’ll go on with the data analysis of the SuperRare contract focusing on the Mint/ AcceptBid / Sold events on the next and last part of this series.

References:

https://opensea.io/blog/guides/non-fungible-tokens/
http://erc721.org/
https://docs.openzeppelin.com/contracts/3.x/api/token/erc721