Parsing Multipart Form Data

Tags

I have a suite of simple CGI webapps that run on my localhost Apache webserver. Long ago they were Perl scripts. These days they are Python scripts. However, in some areas, Python can be a moving target. Case in point, the cgi module and the FieldStorage dictionary object.

The module was deprecated since Python 3.11 and removed in Python 3.13. Which I just installed on my new laptop. Which broke all my webapps. Which forced me to update them. Which went fine except for one app using multipart form data. This post documents the changes and some new code I wrote.

Prior to Python 3.13, handling CGI data was as simple as:

 from cgi import FieldStorage

 

 if __name__ == ‘__main__’:

     form = FieldStorage()

     …



Just import FieldStorage from the cgi module, then call it to get a dictionary of the submitted data. The keys are the field names, and the item values are the field values. A minor wrinkle is that when multiple fields share the same name, the field value is a list object of the multiple values.

For example, if the URL was something like this:

http://localhost/bin/myapp?x=21&y=42&z=63&a=foo&a=bar&a=moo

Then the form data would be:

‘x’: ’21’
‘y’: ’42’
‘z’: ’63’
‘a’: [‘foo’, ‘bar’, ‘moo’]

(Note that it’s all string data.) There is one more wrinkle — a not so minor one — that I’ll get to below. It has to do with handling cases when a file is sent to the webserver.

For the record, now that I’m running Python 3.13, running the code above prints:

Traceback (most recent call last):
  File "fragment.py", line 1, in 
    from cgi import FieldStorage
ModuleNotFoundError: No module named 'cgi'

Which is what all my webapps started doing.

For all but the app handling file data, the fix was fairly simple:

 from sys import stdin

 from os import environ

 from urllib import parse

 

 if __name__ == ‘__main__’:

     # GET data…

     qs = environ.get(‘QUERY_STRING’, ”)

     get_dict = parse.parse_qs(qs)

 

     # POST data…

     post_len = int(environ.get(‘CONTENT_LENGTH’, 0))

     post_txt = stdin.read(post_len)

     post_dict = parse.parse_qs(post_txt)

 

     form = dict(**get_dict, **post_dict)

     params = {}

     for name in sorted(form):

         value = form[name]

         # If it’s a list of multiple items…

         if 1 < len(value):

             for ix,v in enumerate(value, start=1):

                 params[f’{name}–{ix}‘] = v

         # Otherwise it’s just a name:value pair…

         else:

             params[name] = value[0]

     …



Rather than the FieldStorage function, we import the parse module from urllib.

CGI data can come from a GET request (parameters on the URL) or from a POST request (parameters in the stdin stream).

So, first the code obtains the QUERY_STRING text (if any) from the environment (line #7). It passes this to parse.parse_qs (line #8) and receives a dictionary object (get_dict). This contains the parameters from the URL.

Then (lines #11 and #12) the code obtains any text from stdin — this would be data submitted via a POST request. Line #13 calls parse_qs again with this text to get another dictionary object (post_dict).

Line #15 merges the two parse output dictionaries into a single dictionary (form).

One change is that the dictionary values parse_qs returns are always list object. If there was only one parameter with that name, there is just one object in the dictionary. (Previously, there were list objects only for multiple items with the same name.) All my webapp code expects single items, so the code in lines #17 to #25 goes through the items in form and copies them to params (which is what my code uses) as single items (line #25). It also expands lists with multiple items to params with names having “-1”, “-2”, and so on appended (line #22).

And that fixed my webapps. All but one.

That one was expecting file data in a POST request. As with all POST requests, the form data streams from stdin (note lines #11 and #12 in the code above). That isn’t the problem; the code above obviously solves that one.

The problem is, firstly, that my code previously accessed file data directly from the FieldStorage results for the associated parameter. I typically used a field named “fn” (filename) to post a file from my webform. Accessing the file was just a matter of:

 if __name__ == ‘__main__’:

     …

     for name in sorted(form):

         …

         # Load filename and file object into params…

         if name == ‘fn’:

             fobj = form[‘fn’]

             params[‘fn’] = fobj.filename

             params[‘fp’] = fobj.file

         …

     …

 

     # Use filename and file data…

     fn = params[‘fn’]

     bs = params[‘fp’].read()

     …



The code from line #3 to #9 was part of a loop similar to the one above that moves parameters from form to params. Line #6 gets the object FieldStorage returned. That object has a filename attribute (line #8) and a file attribute (line #9). The former is just a string, but the latter is a file object.

To use the file data (lines #13 to #15), the code accesses the ‘fn’ and ‘fp’ values in params. The file object had to be read to obtain the data (which is in bytes).

Without the parsing FieldStorage did, POST data in the stdin stream looks something like this:

------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="userId"

HT/50201
------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="sessId"

EWZJGUQLEDTZEQXWOMERLWZY
------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="ctrlId"

1768274923354599726
------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="fn"; filename="email_spam.py"
Content-Type: text/x-python

<<file-data>>
------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="size"

1
------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="cols"

16
------WebKitFormBoundaryFvgrhoe2FqQdLu4J
Content-Disposition: form-data; name="enc"

utf-8
------WebKitFormBoundaryFvgrhoe2FqQdLu4J--

The multipart form data above contains seven parts — six for simple input strings and one for the file contents. Each part starts with a separator that begins with six hyphens. The text following the hyphens is unique to every submission. In CGI POST data, the separator is defined in the CONTENT_TYPE environment string.

The value for the above form data looked like this:

multipart/form-data; boundary=----WebKitFormBoundaryFvgrhoe2FqQdLu4J

Following the separator is one or more headers. There is always a Content-Disposition header that provides the name of the form field. In the part for the actual file data, the Content-Disposition header also contains the original filename. That part has a second header, the Content-Type header that describes the file content.

Following the header(s) is a blank line followed by the part’s data. For simple input, the data is just a single line of text. For the file, it’s the actual file bytes, which can be binary (if sending an image, for instance). The potential for binary data basically means all the data has to be processed as bytes.

I did look around for an existing module someone might have written for handling this and found one but couldn’t get it to work — it had unannounced dependencies I didn’t feel like chasing. After looking at the form data for a while, I decided it should be easy enough to code a parser.

What I’m posting here is the first version that I’ll use until it breaks or needs improvement somehow. It has worked on all the files I’ve tested it with, but I make no claims that it’ll handle everything. Take it as a starting point.

To handle parsing the headers, I made a separate class:

 txt = ‘Content-Disposition: form-data; name=”fn”; filename=”foo.py”‘

 

 class MimeHeader (tuple):

     ”’\

 MIME Header class. Parses text to a defined tuple.

 

 Expected text has the syntax:

     <name>: <text> [; name=value [; …]]

 

 Created tuple is:

     (name, text, fields)

 

 Properties:

     name        -field name

     text        -field text

     fields      -dictionary of name:value pairs

 

 Methods:

     str(obj)    -printable version

     repr(obj)   -debugging version

 ”’

     def __new__ (cls, text):

         ”’New MimeHeader instance.”’

 

         # Split text into name and value parts…

         parts = text.partition(‘: ‘)

         if parts[1] != ‘: ‘:

             ValueError(f’Invalid MIME Header (no “: “): “{text}“‘)

 

         # Split the value into fields…

         subparts = parts[2].split(‘; ‘)

         value = subparts[0]

 

         fields = {}

         for subp in subparts[1:]:

             # Split field into name and value…

             key,val = subp.split(‘=’, maxsplit=1)

             # Add field; remove any surrounding double-quotes…

             fields[key] = val.strip(‘”‘)

 

         # Delegate creating new instance to tuple…

         return super().__new__(cls, (parts[0], value, fields))

 

     @property

     def name (self): return self[0]

 

     @property

     def text (self): return self[1]

 

     @property

     def fields (self): return self[2]

 

     def __str__ (self):

         return f’{self.name}: {self.text}‘

 

     def __repr__ (self):

         return f’<{type(self).__name__} @{id(self):08x}>‘

 

 

 if __name__ == ‘__main__’:

     hdr = MimeHeader(txt)

     print(f’Header: {hdr} {hdr!r}‘)

     print(f’Header Name: {hdr.name}‘)

     print(f’Header Text: {hdr.text}‘)

     print(‘Header Fields:’)

     for nam,val in hdr.fields.items():

         print(f’> {nam}: {val}‘)

     print()



Which should be pretty self-explanatory. It subclasses tuple, so MimeHeader objects are tuples with named fields plus nice debug and string representations. When run, this prints:

Header: Content-Disposition: form-data <MimeHeader @2388d272390>
Header Name: Content-Disposition
Header Text: form-data
Header Fields:
> name: fn
> filename: foo.py

With that in hand, here’s the form data parser:

 from examples import MimeHeader

 

 EOL  = ord(‘\n’)

 CR   = ord(‘\r’)

 DASH = ord(‘-‘)

 

 def parse_multipart_form (content_type:str, byte_string:bytes) -> tuple:

     ”’\

 Parse Multipart Form Data.

 

 Arguments:

     content_type    -content header text

     byte_string     -form data bytes

 

 Returns:

     tuple: (form, filename, filetype, filedata)

 

     Note: the last three field will be None if no file data.

 ”’

     form = {}               # form data to be returned

     parthdrs = {}           # temp dictionary for part headers

     content = []            # buffer for building file content

     buf = []                # temp buff for building strings

     state = ‘prefix’        # current state

     boundary = None         # part separator string

 

     # Parse the content-type header to get the multipart boundary…

     ix = content_type.index(‘boundary=’) + len(‘boundary=’)

     boundary = content_type[ix:].lstrip(‘-‘)

 

     # Iterate over the bytes of the data…

     for bx in byte_string:

 

         if state == ‘prefix’:

             # Accumulate dashes…

             if bx == DASH:

                 buf.append(bx)

                 continue

             # Not a dash; separator text begins…

             if len(buf) < 6:

                 raise SyntaxError(f’Expected six hyphens, not {len(buf)}.‘)

             # Add first separator character to buffer…

             buf = [bx]

             state = ‘sep’

             continue

 

         if state == ‘sep’:

             # End of separator…

             if bx == EOL:

                 sep_str = ”.join(chr(b) for b in buf)

                 if not sep_str.startswith(boundary):

                     raise SyntaxError(f’Invalid Separator: “{sep_str}“‘)

                 buf = []

                 parthdrs = {}

                 state = ‘part.start’

                 continue

             # Ignore CR characters…

             if bx == CR:

                 continue

             # Add separator character to buffer…

             buf.append(bx)

             continue

 

         if state == ‘part.start’:

             # Blank line (instead of header text)…

             if bx == EOL:

                 # End of part headers…

                 content = []

                 state = ‘part.data’

                 continue

             # Ignore CR characters…

             if bx == CR:

                 continue

             # Add first part header character to buffer…

             buf = [bx]

             state = ‘part.hdr’

             continue

 

         if state == ‘part.hdr’:

             # Handle end of line…

             if bx == EOL:

                 txt = ”.join(chr(b) for b in buf)

                 hdr = MimeHeader(txt)

                 parthdrs[hdr.name.lower()] = hdr

                 state = ‘part.start’

                 continue

             # Ignore CR characters…

             if bx == CR:

                 continue

             # Add part header character to buffer…

             buf.append(bx)

             continue

 

         if state == ‘part.data’:

             # A dash might mean a separator…

             if bx == DASH:

                 buf = [bx]

                 state = ‘test.prefix’

                 continue

             # Add content byte to buffer…

             content.append(bx)

             continue

 

         if state == ‘test.prefix’:

             # Not a dash; start of separator…

             if bx != DASH:

                 # Not a separator; add to content and resume…

                 if len(buf) != 6:

                     # Not a separator…

                     content.extend(buf)

                     content.append(bx)

                     state = ‘part.data’

                     continue

                 # Got 6 dashes; might be a separator…

                 buf = [bx]

                 state = ‘test.sep’

                 continue

             # Add character to possible prefix string…

             buf.append(bx)

             continue

 

         if state == ‘test.sep’:

             # End of line; test separator…

             if bx == EOL:

                 txt = (”.join(chr(c) for c in buf)).strip()

                 if txt != boundary:

                     # Not a separator…

                     content.extend(buf)

                     content.append(bx)

                     state = ‘part.data’

                     continue

                 # It is a separator; get disposition header…

                 cdisp = parthdrs[‘content-disposition’]

                 name = cdisp.fields[‘name’]

                 # Add file content (bytes) to form (strip trailing CR-LF)…

                 form[name] = bytes(content[0:–2])

                 # Add the disposition header…

                 form[f’{name}.disp‘] = cdisp

                 # If content-type header provided…

                 if ‘content-type’ in parthdrs:

                     # Add to form…

                     form[f’{name}.type‘] = parthdrs[‘content-type’]

                 # Get ready for next part…

                 buf = []

                 parthdrs = {}

                 state = ‘part.start’

                 continue

             # Add to separator string…

             buf.append(bx)

             continue

 

     # Extract the filename, filetype, and file data for convenience…

     filedata, filename, filetype = None,None,None

     for name in form:

         if name == ‘fn’:

             filedata = form[‘fn’]

             filetype = form[‘fn.type’].text

             filename = form[‘fn.disp’].fields[‘filename’]

             if filetype.startswith(‘text’):

                 filedata = str(filedata, encoding=‘utf8’)

             continue

 

         # Also convert simple fields from bytes to strings…

         if ‘.’ not in name:

             txt = form[name]

             if isinstance(txt,bytes):

                 form[name] = str(txt, encoding=‘utf8’)

 

     form[‘fn.name’] = filename

     return (form, filename, filetype, filedata)



Kinda long, but state machines tend to that because each state requires handling code. It’s not as well commented as I’d like for publication, but the code isn’t doing anything especially complicated. It also lacks as much vertical whitespace as I usually use for clarity, but I wanted to keep each state’s code together.

Here’s the state diagram for the state machine in the above code:

Text in blue is the condition that moves the state. Blue text in parentheses is an action; blue text in square brackets is an extended condition. For instance, any character other than a dash moves from the prefix state, which is counting dashes, to the sep (separator) state, which will gather the separator string. The “not 6” and “got 6” conditions check that the separator prefix is six dashes. Refer to the sample of multipart form text above to follow the state diagram and code.

We can exercise parse_multipart_form with this code:

 from sys import argv

 from examples import parse_multipart_form

 

 FName = r’C:\Demo\Python\multipart-1.txt’

 CType = r’multipart/form-data; boundary=—-WebK…Lu4J’

 

 def test_parser (file_name:str, content_type:str) -> tuple:

     ”’Function to exercise Multipart parser.”’

 

     # Read multipart form data from file (binary mode)…

     fp = open(file_name, mode=‘rb’)

     try:

         data = fp.read()

         print(f’loaded: {file_name}‘)

         print(f’bytes: {len(data)}‘)

         print()

     except:

         raise

     finally:

         fp.close()

 

     # Parse the multipart form…

     form,fname,ftype,fdata = parse_multipart_form(content_type, data)

 

     # List the form data…

     for key in sorted(form):

         val = form[key]

 

         if key == ‘fn’:

             print(f’{key}: <filedata>‘)

             continue

 

         if isinstance(val, str):

             print(f’{key}: {val}‘)

             continue

 

         print(f’{key}: {form[key].text}‘)

 

     print()

 

     return (form, fname, ftype, fdata)

 

 if __name__ == ‘__main__’:

     print(f’autorun: {argv[0]}‘)

 

     file_name = argv[1] if 1 < len(argv) else FName

     content_type = argv[2] if 2 < len(argv) else CType

 

     print()

     form,fname,ftype,fdata = test_parser(file_name, content_type)

     print()



Note that in the source above I’ve truncated the content type global variable CType (line #5) to make the line fit this post. The actual code (as in the ZIP file) contains the full-length separator string. The code also expects a file containing form data. An example file is included in the ZIP file. Note that the content type string defining the separator must use the same separator string as the file data does.

When run, this prints:

autorun: fragment.py

loaded: C:\Demo\Python\multipart-1.txt
bytes: 3509

bd: le
bd.disp: form-data
cmd: Line#
cmd.disp: form-data
cols: 16
cols.disp: form-data
ctrlId: 1768274923354599726
ctrlId.disp: form-data
data: x
data.disp: form-data
dw: 2
dw.disp: form-data
fn: 
fn.disp: form-data
fn.name: email_spam.py
fn.type: text/x-python
nbrs: d
nbrs.disp: form-data
nw: 4
nw.disp: form-data
sessId: EWZJGUQLEDTZEQXWOMERLWZY
sessId.disp: form-data
size: 1
size.disp: form-data
userId: HT/50201
userId.disp: form-data

Not the prettiest output, but it suffices. Each field is from the webform I use to generate the input (and submit files for different kinds of processing). Here’s a screenshot of the UI I made:

It can produce hex dumps, line numbers, sorted output, and other processes including character histograms (output shown below webform) and word counts. It’s also capable of displaying CSV and TAB files as tables.

The Py2H buttons convert Python to colorized HTML (the two reflect different versions) Just a bunch of handy file utilities I use often enough to want simple apps to do them.

And I think that’s about all for this time.

Link: Zip file containing all code fragments used in this post.

∅

1 thought on “Parsing Multipart Form Data”

Wyrd Smythe said:

September 15, 2025 at 11:45 am

ATTENTION: The WordPress Reader strips the style information from posts, which can destroy certain important formatting elements. If you’re reading this in the Reader, I highly recommend (and urge) you to [A] stop using the Reader and [B] always read blog posts on their website.

This post is: Parsing Multipart Form Data