transform

common transformations for LLM data
from nbdev.showdoc import show_doc
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def chat(**kwargs):
    "A wrapper around `openai.ChatCompletion` that has automatic retries." 
    client.api_key = os.environ['OPENAI_API_KEY']
    return client.chat.completions.create(**kwargs)

source

chat

 chat (**kwargs)

A wrapper around openai.ChatCompletion that has automatic retries.


source

fetch_run_componets

 fetch_run_componets (run_id:str)

Return the inputs, output and funcs for a run of type ChatOpenAI.

_tst_run_id = '1863d76e-1462-489a-a8a7-e0404239fe47'

with _temp_env_var(tmp_env):  #context manager that has specific environment vars for testing                    
    _inp, _out, _funcs = fetch_run_componets(_tst_run_id)

print(f"""first input:
{_inp[0]} 

output:
{_out} 

functions:
{_funcs}""")
first input:
{'role': 'system', 'content': "You are a helpful documentation Q&A assistant, trained to answer questions from LangSmith's documentation. LangChain is a framework for building applications using large language models.\nThe current time is 2023-09-05 16:49:07.308007.\n\nRelevant documents will be retrieved in the following messages."} 

output:
{'role': 'assistant', 'content': "Currently, LangSmith does not support project migration between organizations. However, you can manually imitate this process by reading and writing runs and datasets using the SDK. Here's an example of exporting runs:\n\n1. Read the runs from the source organization using the SDK.\n2. Write the runs to the destination organization using the SDK.\n\nBy following this process, you can transfer your runs from one organization to another. However, it may be faster to create a new project within your destination organization and start fresh.\n\nIf you have any further questions or need assistance, please reach out to us at [email protected]."} 

functions:
[]
class RunData(BaseModel):
    "Key components of a run from LangSmith"
    inputs:List[dict]
    output:dict
    funcs:List[dict] 
    run_id:str

    @classmethod
    def from_run_id(cls, run_id:str):
        "Create a `RunData` object from a run id."
        inputs, output, funcs = fetch_run_componets(run_id)
        return cls(inputs=inputs, output=output, funcs=funcs, run_id=run_id)

    def to_msg_dict(self):
        "Transform the instance into a dict in the format that can be used for OpenAI fine-tuning."
        msgs = self.inputs + [self.output]
        return {"functions": self.funcs,
                "messages": msgs}

    def to_json(self):
        "The json version of `to_msg_dict`."
        return json.dumps(self.to_msg_dict())

    @property
    def outputs(self):
        "Return outputs for langsmith Datasets compatibility."
        return self.output

    @property
    def flat_input(self):
        "The input to the LLM in markdown."
        return self._flatten_data(self.inputs)

    @property
    def flat_output(self):
        "The output of the LLM in markdown."
        return self._flatten_data([self.output])

    @classmethod    
    def _flatten_data(cls, data):
        "Produce a flattened view of the data as human readable Markdown."
        md_str = ""
        for item in data:
            # Heading
            role = item['role']
            if role == 'assistant' and 'function_call' in item:
                role += ' - function call'
            if role == 'function':
                role += ' - results'
            
            md_str += f"### {role.title()}\n\n"

            content = item.get('content', '')
            if content: md_str += content + "\n"
                
            elif 'function_call' in item:
                func_name = item['function_call']['name']
                args = json.loads(item['function_call']['arguments'])
                formatted_args = ', '.join([f"{k}={v}" for k, v in args.items()])
                md_str += f"{func_name}({formatted_args})\n"
            md_str += "\n"
        return md_str

source

RunData

 RunData (inputs:List[dict], output:dict, funcs:List[dict], run_id:str)

Key components of a run from LangSmith


source

RunData.from_run_id

 RunData.from_run_id (run_id:str)

Create a RunData object from a run id.

with _temp_env_var(tmp_env): #context manager that has specific environment vars for testing
    rd = RunData.from_run_id(_tst_run_id)

print(f'Run {rd.run_id} has {len(rd.inputs)} inputs.')
print(f'Run {rd.run_id} output:\n{rd.output}')
Run 1863d76e-1462-489a-a8a7-e0404239fe47 has 3 inputs.
Run 1863d76e-1462-489a-a8a7-e0404239fe47 output:
{'role': 'assistant', 'content': "Currently, LangSmith does not support project migration between organizations. However, you can manually imitate this process by reading and writing runs and datasets using the SDK. Here's an example of exporting runs:\n\n1. Read the runs from the source organization using the SDK.\n2. Write the runs to the destination organization using the SDK.\n\nBy following this process, you can transfer your runs from one organization to another. However, it may be faster to create a new project within your destination organization and start fresh.\n\nIf you have any further questions or need assistance, please reach out to us at [email protected]."}

source

RunData.to_msg_dict

 RunData.to_msg_dict ()

Transform the instance into a dict in the format that can be used for OpenAI fine-tuning.

rd.to_msg_dict()['messages'][-2:]
[{'role': 'user',
  'content': 'How do I move my project between organizations?'},
 {'role': 'assistant',
  'content': "Currently, LangSmith does not support project migration between organizations. However, you can manually imitate this process by reading and writing runs and datasets using the SDK. Here's an example of exporting runs:\n\n1. Read the runs from the source organization using the SDK.\n2. Write the runs to the destination organization using the SDK.\n\nBy following this process, you can transfer your runs from one organization to another. However, it may be faster to create a new project within your destination organization and start fresh.\n\nIf you have any further questions or need assistance, please reach out to us at [email protected]."}]

source

RunData.to_json

 RunData.to_json ()

The json version of to_msg_dict.

rd.to_json()[:100]
'{"functions": [], "messages": [{"role": "system", "content": "You are a helpful documentation Q&A as'

The properties flat_input and flat_output allow you to view the input to the LLM and the output in a human readable format (markdown):


source

RunData.flat_input

 RunData.flat_input ()

The input to the LLM in markdown.

print(rd.flat_input[:400])
### System

You are a helpful documentation Q&A assistant, trained to answer questions from LangSmith's documentation. LangChain is a framework for building applications using large language models.
The current time is 2023-09-05 16:49:07.308007.

Relevant documents will be retrieved in the following messages.

### System



Skip to main content

 **🦜️🛠️ LangSmith Docs**Python DocsJS/TS Docs

Sear

source

RunData.flat_output

 RunData.flat_output ()

The output of the LLM in markdown.

print(rd.flat_output)
### Assistant

Currently, LangSmith does not support project migration between organizations. However, you can manually imitate this process by reading and writing runs and datasets using the SDK. Here's an example of exporting runs:

1. Read the runs from the source organization using the SDK.
2. Write the runs to the destination organization using the SDK.

By following this process, you can transfer your runs from one organization to another. However, it may be faster to create a new project within your destination organization and start fresh.

If you have any further questions or need assistance, please reach out to us at [email protected].

Preparing .jsonl files

OpenAI fine-tuning takes .jsonl files.


source

write_to_jsonl

 write_to_jsonl (data_list:List[__main__.RunData], filename:str)

Writes a list of dictionaries to a .jsonl file.

Parameters: - data_list (list of RunData): The data to be written. - filename (str): The name of the output file.

_rids = ['59080971-8786-4849-be88-898d3ffc2b45', '8cd7deed-9547-4a07-ac01-55e9513ca1cd']
_tsfm_runs = [RunData.from_run_id(rid) for rid in _rids]
write_to_jsonl(_tsfm_runs, '_data/test_data.jsonl');

It can save you time to validate jsonl files prior to uploading them.


source

validate_jsonl

 validate_jsonl (fname)

Code is modified from https://cookbook.openai.com/examples/chat_finetuning_data_prep, but updated for function calling.

validate_jsonl('_data/test_data.jsonl')
Num examples: 2
No errors found