stateshaper

Stateshaper

Reduce file size and generate content using small seeds

Stateshaper is a Python project that assists in tokenizing an infinite array of memorized numbers. The tokens can be re-created from only a few bytes and used with mapping rules that can call events or derive values for variables. Determinism is achieved by implementing an algorithm that shares similarites with PRNGs (Pseudo-random Number Generator) and LCGs (Linear Congruential Generator).

The primary benefit of the package is that it allows for a reduction in an application's storage size. This in turn saves database costs, including those related to size, bandwidth and energy. This can amount to a savings of over 90% in many cases. It is most efficient when used for programs featuring content generation, personalization, synthetic data and procedural generation.

Stateshaper can also be used securely. If desired, the output created from the starting seed can be unique based on the chosen parameters. For example, in web applications the parameter values can be stored in environment variables the same way that access keys can.

Recommended Uses Include:

Content Feeds
ML Training
Personalized Suggestions
QA Stress Testing
Procedural World Generation

This repository contains code written in Python, with other langauges scheduled to be available soon.

Multiple demonstrations are currently live online (currently desktop only):

ML Training Demo

https://stateshaper-ml.vercel.app

Ruleset: Tokens

ML Training can require an enourmous amount of data. Stateshaper is able to create nearly all possible test scenarios and re-create them at any time. This only takes a basic plugin file that derives values from tokenized numeric output. This demo shows how Stateshaper can be used in ML Training for self-driving cars.

Targeted Ads Compression Demo

https://ads-demo.vercel.app

Ruleset: Ratings

Demonstrates the engine's ability to generate data based on personalization. Ads shown are based on user preference ratings and can be adjust in the app. The data needed to recreate the entire profile is condensed into a ~50-250 byte JSON string.

Lesson Plans Demo

https://lessons-demo.vercel.app

Ruleset: Ratings

An example of a personalized learning plan based on a student's performance. Condenses the entire profile into a small seed.

Fintech QA Demo

https://stateshaper-qa-demo.vercel.app

Ruleset: Tokens

Using numbers as tokens, values are derived to stress test a fintech app's math calculations.

Drawing Graphics Demo

https://stateshaper-drawing.vercel.app

Ruleset: Tokens

With a plugin file that uses a numerical token to set each graphic object's attributes, an endless amount of onscreen content can be generated. A basic example of 2d shapes and colors is shown. When more precise calculations are used, this output can include even the most modern textures.

Features

Deterministic – same seed → same output on any machine.
Lightweight – core state is just a handful of integers (< 1 KB).
No training – no datasets, no GPUs, no model weights.
Semantic output – not just random noise; tokens can represent text, events, or structures.
Reproducible – perfect for QA, research, and simulation replays.
Offline-friendly – runs on laptops, servers, and small embedded devices.

Quick Start

Installation

Clone this repository:

git clone https://github.com/stateshaper/stateshaper.git
cd stateshaper

Make sure your data is in one of the formats listed in the "example_data" directory. The output that is generated depends on the values contained in this dataset. The data types are based on the following rules:

Instructions: Formatting Data for Input

If needed, use the FormatData class in the nested 'format_data' directory.

Compound - A collection of items that include a specified group. Only items from the defined groups will be part of the final output.

Example: Compound Dataset

Rating - Creates a sense of personalization for the output. The output is initially created based on a ratings preference. Afterward, the output is derived from the current included items and adjusted based on whatever parameters are decided upon (such as user input). The 'derived' dataset is all that needs to be saved on the backend, and does not include a 'rating' key. It can be used for all profiles in an application.

Example: Initial Rating Dataset, Derived Rating Dataset

Random - A seemingly random array of the included items is generated. Only one item from the master dataset is included per engine step.

Example: Random Dataset

Initialize a RunEngine class:

# data (REQUIRED) - the input data. must be in a format listed in the 'example_data' directory
# seed (optional) - only required to recreate a previous run of the engine. it is created after the first run of the engine. when used, no other parameters other than token count need to be specified. if no custom parameters are set, only the "v" key with state format data needs to be included. (ex. seed={"v": ["ABC12345", "BVCH457SZ"]})
# token_count (optional, default=10) - The desired size of the list containing your input terms.
# initial_state (optional, default=5) - The starting number to derive your output from. It can also be an array of integers for custom logic. 
# constants (optional, default={"a": 3, "b": 5, "c": 7, "d": 11}) - Only change this for custom morphing equations.
# mod (optional, default=9973) - Only change this for custom morphing equations. Its size indicates how much unique data can be generated from a seed. This can scale from 1 to infinity (or whatever the largest number the computer can handle is).

from stateshaper import RunEngine

# BASIC (first run)
engine = RunEngine(data=your_data, token_count=needed_tokens)

# RE-CREATE PREVIOUS OUTPUT
engine = RunEngine(data=your_data, seed=created_seed, token_count=needed_tokens)

# CUSTOM
engine = RunEngine(data=your_data, seed=created_seed, token_count=needed_tokens, constants=optional_custom_logic, mod=more_optional_logic)

#FULLY DEFINED 
engine = RunEngine(data=your_data, token_count=needed_tokens, initial_state=optional_int_or_int_list, constants=optional_custom_logic, mod=more_optional_logic)


engine.start_engine()

Call the run_engine method:

engine.run_engine() # can pass an integer as a token count if a certain amount of output is needed. ex: engine.run_engine(25)

# OUTPUT    
#
# ["your", "input", "values", "are", "returned", "based", "on", "chosen", "stateshaper", "rules"]


# FOR ONE TOKEN
engine.one_token()

# OUTPUT
#
# ["one_item"]


# PREVIOUS TOKENS (based on passed token count)
engine.reverse()

# OUTPUT    
#
# ["rules", "stateshaper", "chosen", "on", "based", "returned", "are", "values", "input", "your"]


# REVERSE ONE
engine.one_reverse()

# OUTPUT
#
# ["one_item"]

For continuous use, the engine can be called in a loop using the run_engine function. For one time, call it once with a specific token_count parameter.

To create the same output again, start_engine needs to be called once more.

Core Logic Example

Details of the Stateshaper Main Class

This section shows examples of the main classes included in the engine. They are not meant to be ran individually.

from main.core import Stateshaper
from main.plugins.PluginData import PluginData

# Small numeric seed. This can usually be one number, but can also be an array if needed for custom logic. During each step of the engine, output is derived from this value. It is then morphed into a new number to be used for the next iteration. 
#
state = 5

# Tokens that are generated during each iteration of the program. For instance, this set of events can be used to generate sprites in a video game map. 
#
vocab = ["plant", "office building", "pedestrian", "tree", "pavement"...] 
#
# other examples include:
# vocab = ["string1", "string2", "string3"...]
# vocab = [event1, event2, event3...]
#  
# Class instantiation. The parameters are the only values that need to be stored other than your app's custom plugin file. In the most minimal cases, only the vocabulary is needed to be stored.
engine = Stateshaper(
    state=state,
    vocab=vocab,
    constants={"a": 3, "b": 5, "c": 7, "d": 11},
    mod=9973,
)

# Generate 20 tokens. 
tokens = [engine.next_token() for _ in range(20)]
# Example Output : ["tree", "tree", "pedestrian", "tree", "office building", "pedestrian", "pavement"....]

# Use the tokens to call events.
# The first parameter, i is the type of sprite to draw.
# The second parameter is a number from the state array (1 - mod, 9973). This can be used in the drawing function to add variations like color, size and position. 
events = [i for plugin.draw(i, state[tokens.index(i)]) in tokens]

Connector Class

The Connector class can take your data and process it to be ready for compression into seed format.

For more ino, see the CONNECTOR documentation.

TinyState Class

Aside from the plugin file (which can be a template that does not include specific numbers), relevant data such as the event map and ratings can be condensed into Tiny State and/or Raw State format. Example:

Tiny State: ABC-12345

Raw State: QV589JX4

These values can be encoded and decoded in the TinyState class within the 'tools' directory. The main data map is represented as a long string of numbers. These numbers stand for positions in the map and are encoded into Tiny State format. A subset of numbers from the vocab used in the engine is also kept and encoded into Raw State format.

class TinyState:

   # return: Coded dataset. 
   # return[0]: Tiny State seed. Reperesents the compressed values for the master data set.
   # return[1]: Raw State seed. Represents the subset of personalized values chosen for the specific instance.
   def get_seed(data):
      # Encode logic
      # This is the value that will be kept in the database. It is where most of the compression happens. 
      return ["ABC-12345", "QV589JX4"]

   # return: personalized events for the specific user/instance.
   def rebuild_data(master_seed, subset_seed):
      # Decode logic.
      # Numbers values from the seeds stand for key/value pairs from the master dataset and are kept in groups of four in data sets length 100 or less. If more length is needed the group size can be increased.
      #
      # Example: 0214 stands for key #3, value #15 
      return ["event1", "event2", "event3"...]

For a given user, all that will be needed to be stored for the above example is:

["user-1234", ["abc-12345", QV589JX4]]

From that, all other content can be generated during run time, and be personalized to each user.

The ratings can be modified as needed, then re-encoded as a different seed.

For some uses, a longer seed may be required. Sometimes this can be because a custom initial state, mod or constants are required. Also if a very large amount of data causes the Tiny State seed to need additional characters.

In total, there are four types of data used in Stateshaper. They are really just strings, dicts and lists in a certain format. The specific formats are as follows:

Full State:

seed = {"user_id": "johnq1234", "s": 5, "v": ["ABC12345", "567yQ90T34"], "c": {"a": 3,"b": 5,"c": 7,"d": 11}, "m": 9973}

# ~115 bytes

Short State:

seed = ["user_176551",5,["ABC12345", "567yQ90T34"]]

# ~45 bytes

Tiny State

seed = "ABC12345"

# ~8 bytes

Raw State

seed = "567yQ90T34"

# ~10 bytes

The format needed will vary depending on the needs for each application. For applications needing only continuous, random data Tiny or Raw format may be all that is needed. For those that require more complex, personalized data, Full State may be needed. A combination of any of these can be used, as long as the required 'vocab' parameter is passed into the engine.

For more info, see the TINY_STATE documentation.

How It Works

The 'seed' array, 'constants' and 'mod' value are used for calculations during each iteration. The array numbers during that iteration are used to call tokens from the list of values defined in the 'vocab' parameter. This can be seemingly random if needed, or designed to occur in a specific sequence.

For basic use, no plugin is required. Only an array of the tokens (variables or functions) is needed. If no particular order is needed (such as generating data to stress test a system for QA, or cooking app that suggests a random recipe) this may be all that is needed.

For more specialized designs, a custom plugin file can be written. This will be used along with Stateshaper 'Connector' class to define specific rules for the tokens included in 'vocab' list. This can be based on developer needs and can be based on attributes, sequence or frequency the tokens are called if needed.

Considerations for Designing Custom Plugins

Define a Token List Them 'vocab' parameter. This can be an array of any type of values, including functions. A custom plugin file can be written if needed.
Are Custom 'seed', 'constants' or 'mod' Values Needed? If specific deterministic output is needed, these values ca be adjusted to fit with the morph equation.
Is a Custom Morph Rule Needed? The math done to change the array values can also be altered. This can allow for further customization of the deterministic array.
Call Stateshaper Class Object and Pass the Created Parameters.
Generate the Ouput Create as many tokens as needed with Stateshaper().generate_token(x) method. This can be called all at once or during a loop.
Modify the Stream if Needed The data can be changed based on input such as user behavior or duration. The main class variables can be assigned new values in real time, or a new instance of the class can be created.

Running Tests

Tests can be ran using the Tests class.

These tests demonstrate Stateshaper's ability to generate data against popular existing algorithms.

Areas of focus include determinism, reversibility, personalization, direct indexing, semantic flow and compression.

For more info, see the TESTS documentation.

Use Cases (Expanded)

Personalization Without Storing User Data

Personal Ads or News Fitness Routine Smart Home Scheduling Student Test Sets/Lesson Plans

Assign each user a seed and derive their long-term content pattern from it, without storing behavioral data or personally identifiable information. The output evolves over time based on input such as user interaction.

Synthetic Data

Video Game Simulations QA System Testing Fintech Data Experimental Values

Generate large, reproducible datasets from a single small seed. This avoids privacy issues and reduces cost compared to collecting real-world data. Relevant data can be continually created and called within an application.

Structures

Inventory Application Content Bookkeeping Statistical Records

Condense large amounts of data into smaller objects. Generate it in real-time based on a set of defined terms/rules.

Project Structure

stateshaper/
├── api/
|     ├── run_api.py
|     ├── API.md
├── example_data
|     └── format_data/
|        ├── FormatData.py
|        ├── FORMAT_DATA.md
|     ├── compound.json
|     ├── random.json
|     ├── rating_derived.json
|     ├── rating_initial.json
|     ├── tokens.json
|     ├── EXAMPLE_DATA.md
├── research/
|     ├── flowchart.png
├── src/
│   └── main/
|        └── connector/
|              ├── Connector.py
|              ├── Modify.py
|              ├── Vocab.py
|              ├── CONNECTOR.md
|        └── demos/
|              └── ads/
|                    ├── ad_list.py
|                    ├── Ads.py
|              └── fintech_qa/
|                    ├── FintechQA.py
|                    ├── qa_data.py
|                    ├── FINTECH_QA.md
|              └── graphics/
|                    ├── Graphics.py
|                    ├── GRAPHICS.md
|              └── lesson_plan/
|                    ├── lessons_list.py
|                    ├── LessonPlan.py
|              └── ml_training/
|                    └── data/
|                       ├── environments.py
|                       ├── vehicles.py
|                    ├── BuildEnvironment.py
|                    ├── MachineLearning.py
|                    ├── TripTimeline.py
|              ├── DEMOS.md
|        └── tools/
|              └── compress_json/
|                 ├── CompressJson.py
|              └── derive_vocab/
|                 ├── DeriveVocab.py
|                 ├── DERIVE_VOCAB.md
|              └── tiny_state/
|                 ├── TinyState.py
|                 ├── TINY_STATE.md
|              ├── TOOLS.md              
│       ├── core.py
│       ├── stateshaper.py
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── pyproject.toml
├── QUICK_START.md
├── README.md
├── setup.py

Contributing

Contributions, ideas, and experiments are welcome!

See CONTRIBUTING instructions if you are interested in creating a custom plugin (or anything else). Right now you can fork this repo to experiment. An open source version of the code will be available soon.

License

This project is released under the MIT License. See LICENSE for details.

If you use this in research, products, or experiments, a mention or citation of the "Stateshaper" and/or "Jason G. Dunn" is appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stateshaper

Achievements

Achievements

Block or report stateshaper

Stateshaper

Ruleset: Tokens

Ruleset: Ratings

Ruleset: Ratings

Ruleset: Tokens

Ruleset: Tokens

Features

Quick Start

Installation

If needed, use the FormatData class in the nested 'format_data' directory.

Core Logic Example

Details of the Stateshaper Main Class

Connector Class

TinyState Class

How It Works

Running Tests

Use Cases (Expanded)

Personalization Without Storing User Data

Synthetic Data

Structures

Project Structure

Contributing

License

Popular repositories Loading

Uh oh!